Few-shot voice cloning and text-to-speech WebUI that can fine-tune from 1 minute of voice data
- Stars58.7k
- Forks6.4k
- Open Issues875
MIT
- Python
- Shell
- Cuda

About GPT-SoVITS-WebUI
GPT-SoVITS-WebUI is a few-shot voice conversion and text-to-speech WebUI. It targets voice cloning workflows where only a short sample or a small training set is available, turning reference speech into a voice for synthesized text rather than requiring a large recorded corpus.
Zero-shot TTS accepts a 5-second vocal sample for immediate text-to-speech conversion. Few-shot TTS can fine-tune with 1 minute of training data to improve voice similarity and realism. Cross-lingual inference supports English, Japanese, Korean, Cantonese, and Chinese, including inference in a language different from the training dataset.
The WebUI includes tools for voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling to help build training datasets and GPT/SoVITS models. It can run from a Windows integrated package, a Python environment with webui.py, or Docker Compose, with an AutoDL Cloud Docker option for users in China.
Key features
- Zero-shot TTS from a 5-second vocal sample
- Few-shot fine-tuning with 1 minute of training data
- Cross-lingual inference across English, Japanese, Korean, Cantonese, and Chinese
- Voice accompaniment separation and automatic training set segmentation
- Chinese ASR and text labeling tools for dataset creation
Details
- First released
- 2024
- Interface
- WebUI
- Training data
- 5-sec sample · 1-min fine-tune
- Languages
- EN · JA · KO · Cantonese · Chinese
- Dataset tools
- Separation · segmentation · ASR · labeling
- Self-hosting
- Python WebUI · Docker Compose