F5-TTS logo

F5-TTS

Text-to-speech system for fluent, faithful speech generation with flow matching

Repository activity
  • Stars14.7k
  • Forks2.2k
  • Open Issues52
License

MIT

Languages
  • Python
  • Shell
  • Dockerfile
F5-TTS screenshot

About F5-TTS

F5-TTS is a text-to-speech system for generating speech from text, with support for basic TTS, multi-style and multi-speaker generation, and voice chat. It is aimed at lower-latency inference and includes a CLI for running inference from the command line.

The model uses a Diffusion Transformer with ConvNeXt V2, plus Sway Sampling for inference-time flow step sampling. It also includes E2 TTS as a Flat-UNet Transformer reproduction, and offers custom inference with more language support. Runtime deployment is available with Triton and TensorRT-LLM.

F5-TTS ships code under the MIT License, while the pre-trained models use CC-BY-NC. It provides Docker usage and training and finetuning guidance with Hugging Face Accelerate. Published base models are available on Hugging Face, Model Scope, and Wisemodel.

Key features

  • Basic TTS with chunk inference
  • Multi-style and multi-speaker generation
  • Voice chat powered by Qwen2.5-3B-Instruct
  • CLI inference
  • Runtime deployment with Triton and TensorRT-LLM

Details

First released
2024
Platforms
Web · CLI · Docker
Deployment
self-hostable · docker
Language
Python
License
MIT; models CC-BY-NC
Model access
Hugging Face · Model Scope · Wisemodel