Text-to-speech system for fluent, faithful speech generation with flow matching
- Stars14.7k
- Forks2.2k
- Open Issues52
MIT
- Python
- Shell
- Dockerfile

About F5-TTS
F5-TTS is a text-to-speech system for generating speech from text, with support for basic TTS, multi-style and multi-speaker generation, and voice chat. It is aimed at lower-latency inference and includes a CLI for running inference from the command line.
The model uses a Diffusion Transformer with ConvNeXt V2, plus Sway Sampling for inference-time flow step sampling. It also includes E2 TTS as a Flat-UNet Transformer reproduction, and offers custom inference with more language support. Runtime deployment is available with Triton and TensorRT-LLM.
F5-TTS ships code under the MIT License, while the pre-trained models use CC-BY-NC. It provides Docker usage and training and finetuning guidance with Hugging Face Accelerate. Published base models are available on Hugging Face, Model Scope, and Wisemodel.
Key features
- Basic TTS with chunk inference
- Multi-style and multi-speaker generation
- Voice chat powered by Qwen2.5-3B-Instruct
- CLI inference
- Runtime deployment with Triton and TensorRT-LLM
Details
- First released
- 2024
- Platforms
- Web · CLI · Docker
- Deployment
- self-hostable · docker
- Language
- Python
- License
- MIT; models CC-BY-NC
- Model access
- Hugging Face · Model Scope · Wisemodel
