High-throughput REST API for serving text embeddings, reranking, CLIP, CLAP, and ColPali models
- Stars2.8k
- Forks192
- Open Issues127
MIT
- Python
- Makefile
- Jinja

About Infinity
Infinity is a high-throughput, low-latency REST API for serving text embeddings, reranking models, CLIP, CLAP, and ColPali. It lets you deploy any embedding, reranking, or sentence-transformer model from Hugging Face behind one API.
The server is built on FastAPI and combines PyTorch, optimum for ONNX and TensorRT, and CTranslate2, using FlashAttention. It runs on NVIDIA CUDA, AMD ROCm, CPU, AWS INF2, or Apple MPS, and uses dynamic batching with tokenization in dedicated worker threads. You can mix and match multiple models and let Infinity orchestrate them.
The OpenAPI surface is aligned with OpenAI's embeddings spec. The CLI v2 launches every argument from environment variables or flags, and pre-built Docker images cover CUDA, ROCm, and ONNX-GPU. A separate Python client package is published for application use.
Key features
- REST API for embeddings, reranking, CLIP, CLAP, and ColPali
- Deploy models from Hugging Face
- Dynamic batching with worker-thread tokenization
- FastAPI-based server with OpenAPI-aligned API
- Docker images for CUDA, ROCm, and ONNX-GPU
Details
- First released
- 2023
- License
- MIT
- Platforms
- Linux · Docker · CLI
- Deployment
- self-hostable · docker
- Accelerators
- CUDA · ROCm · CPU · AWS INF2 · Apple MPS
- Model types
- Embeddings · reranking · CLIP · CLAP · ColPali