General-purpose speech recognition model for multilingual transcription, translation, and language identification
MIT
- Python

About Whisper
Whisper is a speech recognition model for transcribing audio and handling speech translation and language identification. It is designed for general-purpose use on diverse audio, and it can replace a traditional speech-processing pipeline with one model.
It uses a Transformer sequence-to-sequence approach trained on multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. The command-line tool can transcribe audio files and set the language, and the Python API can load a model and return text from audio input.
Whisper's code and model weights are released under the MIT License. It installs from a Python package, and the command-line tool requires ffmpeg. Several model sizes are available, trading off accuracy against speed and memory.
Key features
- Multilingual speech recognition
- Speech translation
- Language identification
- Voice activity detection
- Command-line transcription from audio files
Details
- First released
- 2022
- Platforms
- CLI
- Deployment
- Offline-first
- License
- MIT
- Runtime
- Python 3.8-3.11
- Dependency
- ffmpeg
