Inference server for open source embedding and sequence classification models with Docker backend images
- Stars4.9k
- Forks399
- Open Issues184
Apache-2.0
- Rust
- Python
- JavaScript

About Text Embeddings Inference
Text Embeddings Inference is a toolkit for deploying and serving open source text embeddings and sequence classification models. It gives applications an inference endpoint for embedding extraction, re-ranking, sequence classification, and sparse embeddings.
TEI supports Nomic, BERT, CamemBERT, XLM-RoBERTa, JinaBERT, Mistral, Alibaba GTE, Qwen2, MPNet, ModernBERT, Qwen3, and Gemma3 models. It uses token based dynamic batching, optimized inference with Flash Attention, Candle, and cuBLASLt, and Safetensors or ONNX weight loading. It exposes a REST API, Swagger UI, and gRPC images.
Deployment options include Docker images for specific backends and local installation. It supports Metal for local Macs, CPU-only ARM64 hosts, NVIDIA GPUs with CUDA 12.2 or higher drivers, and experimental AMD Instinct support through ROCm. Private or gated Hugging Face models can use HF_TOKEN; instrumentation includes OpenTelemetry tracing and Prometheus metrics.
Key features
- Serves text embeddings, re-rankers, sequence classification, and sparse embeddings
- Token based dynamic batching
- Optimized inference with Flash Attention, Candle, and cuBLASLt
- Safetensors and ONNX weight loading
- OpenTelemetry tracing and Prometheus metrics
Details
- First released
- 2023
- Self-hosting
- Docker images · local install
- API
- REST · Swagger UI · gRPC
- Backends
- CUDA · CPU · Metal · ROCm experimental
- Weights
- Safetensors · ONNX
- Observability
- OpenTelemetry · Prometheus
