GPT-SoVITS-WebUI

Few-shot voice cloning and text-to-speech WebUI that can fine-tune from 1 minute of voice data

Repository activity
  • Stars58.7k
  • Forks6.4k
  • Open Issues875
License

MIT

Languages
  • Python
  • Shell
  • Cuda
Get it:GitHub
GPT-SoVITS-WebUI screenshot

About GPT-SoVITS-WebUI

GPT-SoVITS-WebUI is a few-shot voice conversion and text-to-speech WebUI. It targets voice cloning workflows where only a short sample or a small training set is available, turning reference speech into a voice for synthesized text rather than requiring a large recorded corpus.

Zero-shot TTS accepts a 5-second vocal sample for immediate text-to-speech conversion. Few-shot TTS can fine-tune with 1 minute of training data to improve voice similarity and realism. Cross-lingual inference supports English, Japanese, Korean, Cantonese, and Chinese, including inference in a language different from the training dataset.

The WebUI includes tools for voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling to help build training datasets and GPT/SoVITS models. It can run from a Windows integrated package, a Python environment with webui.py, or Docker Compose, with an AutoDL Cloud Docker option for users in China.

Key features

  • Zero-shot TTS from a 5-second vocal sample
  • Few-shot fine-tuning with 1 minute of training data
  • Cross-lingual inference across English, Japanese, Korean, Cantonese, and Chinese
  • Voice accompaniment separation and automatic training set segmentation
  • Chinese ASR and text labeling tools for dataset creation

Details

First released
2024
Interface
WebUI
Training data
5-sec sample · 1-min fine-tune
Languages
EN · JA · KO · Cantonese · Chinese
Dataset tools
Separation · segmentation · ASR · labeling
Self-hosting
Python WebUI · Docker Compose