Text Embeddings Inference logo

Text Embeddings Inference

Inference server for open source embedding and sequence classification models with Docker backend images

Repository activity
  • Stars4.9k
  • Forks399
  • Open Issues184
License

Apache-2.0

Languages
  • Rust
  • Python
  • JavaScript
Get it:DocsGitHub
Text Embeddings Inference screenshot

About Text Embeddings Inference

Text Embeddings Inference is a toolkit for deploying and serving open source text embeddings and sequence classification models. It gives applications an inference endpoint for embedding extraction, re-ranking, sequence classification, and sparse embeddings.

TEI supports Nomic, BERT, CamemBERT, XLM-RoBERTa, JinaBERT, Mistral, Alibaba GTE, Qwen2, MPNet, ModernBERT, Qwen3, and Gemma3 models. It uses token based dynamic batching, optimized inference with Flash Attention, Candle, and cuBLASLt, and Safetensors or ONNX weight loading. It exposes a REST API, Swagger UI, and gRPC images.

Deployment options include Docker images for specific backends and local installation. It supports Metal for local Macs, CPU-only ARM64 hosts, NVIDIA GPUs with CUDA 12.2 or higher drivers, and experimental AMD Instinct support through ROCm. Private or gated Hugging Face models can use HF_TOKEN; instrumentation includes OpenTelemetry tracing and Prometheus metrics.

Key features

  • Serves text embeddings, re-rankers, sequence classification, and sparse embeddings
  • Token based dynamic batching
  • Optimized inference with Flash Attention, Candle, and cuBLASLt
  • Safetensors and ONNX weight loading
  • OpenTelemetry tracing and Prometheus metrics

Details

First released
2023
Self-hosting
Docker images · local install
API
REST · Swagger UI · gRPC
Backends
CUDA · CPU · Metal · ROCm experimental
Weights
Safetensors · ONNX
Observability
OpenTelemetry · Prometheus