Text Embeddings Inference

Inference server for open source embedding and sequence classification models with Docker backend images

Repository activity

Stars4.9k
Forks399
Open Issues184

License

Apache-2.0

Languages

Rust
Python
JavaScript

Get it:Docs GitHub

About Text Embeddings Inference

Text Embeddings Inference is a toolkit for deploying and serving open source text embeddings and sequence classification models. It gives applications an inference endpoint for embedding extraction, re-ranking, sequence classification, and sparse embeddings.

TEI supports Nomic, BERT, CamemBERT, XLM-RoBERTa, JinaBERT, Mistral, Alibaba GTE, Qwen2, MPNet, ModernBERT, Qwen3, and Gemma3 models. It uses token based dynamic batching, optimized inference with Flash Attention, Candle, and cuBLASLt, and Safetensors or ONNX weight loading. It exposes a REST API, Swagger UI, and gRPC images.

Deployment options include Docker images for specific backends and local installation. It supports Metal for local Macs, CPU-only ARM64 hosts, NVIDIA GPUs with CUDA 12.2 or higher drivers, and experimental AMD Instinct support through ROCm. Private or gated Hugging Face models can use HF_TOKEN; instrumentation includes OpenTelemetry tracing and Prometheus metrics.

Key features

Serves text embeddings, re-rankers, sequence classification, and sparse embeddings
Token based dynamic batching
Optimized inference with Flash Attention, Candle, and cuBLASLt
Safetensors and ONNX weight loading
OpenTelemetry tracing and Prometheus metrics

Details

First released: 2023
Self-hosting: Docker images · local install
API: REST · Swagger UI · gRPC
Backends: CUDA · CPU · Metal · ROCm experimental
Weights: Safetensors · ONNX
Observability: OpenTelemetry · Prometheus