GPT-SoVITS-WebUI

Few-shot voice cloning and text-to-speech WebUI that can fine-tune from 1 minute of voice data

Repository activity

Stars58.7k
Forks6.4k
Open Issues875

License

MIT

Languages

Python
Shell
Cuda

Get it:GitHub

About GPT-SoVITS-WebUI

GPT-SoVITS-WebUI is a few-shot voice conversion and text-to-speech WebUI. It targets voice cloning workflows where only a short sample or a small training set is available, turning reference speech into a voice for synthesized text rather than requiring a large recorded corpus.

Zero-shot TTS accepts a 5-second vocal sample for immediate text-to-speech conversion. Few-shot TTS can fine-tune with 1 minute of training data to improve voice similarity and realism. Cross-lingual inference supports English, Japanese, Korean, Cantonese, and Chinese, including inference in a language different from the training dataset.

The WebUI includes tools for voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling to help build training datasets and GPT/SoVITS models. It can run from a Windows integrated package, a Python environment with webui.py, or Docker Compose, with an AutoDL Cloud Docker option for users in China.

Key features

Zero-shot TTS from a 5-second vocal sample
Few-shot fine-tuning with 1 minute of training data
Cross-lingual inference across English, Japanese, Korean, Cantonese, and Chinese
Voice accompaniment separation and automatic training set segmentation
Chinese ASR and text labeling tools for dataset creation

Details

First released: 2024
Interface: WebUI
Training data: 5-sec sample · 1-min fine-tune
Languages: EN · JA · KO · Cantonese · Chinese
Dataset tools: Separation · segmentation · ASR · labeling
Self-hosting: Python WebUI · Docker Compose