F5-TTS

Text-to-speech system for fluent, faithful speech generation with flow matching

Repository activity

Stars14.7k
Forks2.2k
Open Issues52

License

MIT

Languages

Python
Shell
Dockerfile

Get it:GitHub ArXiv

About F5-TTS

F5-TTS is a text-to-speech system for generating speech from text, with support for basic TTS, multi-style and multi-speaker generation, and voice chat. It is aimed at lower-latency inference and includes a CLI for running inference from the command line.

The model uses a Diffusion Transformer with ConvNeXt V2, plus Sway Sampling for inference-time flow step sampling. It also includes E2 TTS as a Flat-UNet Transformer reproduction, and offers custom inference with more language support. Runtime deployment is available with Triton and TensorRT-LLM.

F5-TTS ships code under the MIT License, while the pre-trained models use CC-BY-NC. It provides Docker usage and training and finetuning guidance with Hugging Face Accelerate. Published base models are available on Hugging Face, Model Scope, and Wisemodel.

Key features

Basic TTS with chunk inference
Multi-style and multi-speaker generation
Voice chat powered by Qwen2.5-3B-Instruct
CLI inference
Runtime deployment with Triton and TensorRT-LLM

Details

First released: 2024
Platforms: Web · CLI · Docker
Deployment: self-hostable · docker
Language: Python
License: MIT; models CC-BY-NC
Model access: Hugging Face · Model Scope · Wisemodel