Infinity

High-throughput REST API for serving text embeddings, reranking, CLIP, CLAP, and ColPali models

Repository activity

Stars2.8k
Forks192
Open Issues127

License

MIT

Languages

Python
Makefile
Jinja

Get it:Website GitHub Docker

About Infinity

Infinity is a high-throughput, low-latency REST API for serving text embeddings, reranking models, CLIP, CLAP, and ColPali. It lets you deploy any embedding, reranking, or sentence-transformer model from Hugging Face behind one API.

The server is built on FastAPI and combines PyTorch, optimum for ONNX and TensorRT, and CTranslate2, using FlashAttention. It runs on NVIDIA CUDA, AMD ROCm, CPU, AWS INF2, or Apple MPS, and uses dynamic batching with tokenization in dedicated worker threads. You can mix and match multiple models and let Infinity orchestrate them.

The OpenAPI surface is aligned with OpenAI's embeddings spec. The CLI v2 launches every argument from environment variables or flags, and pre-built Docker images cover CUDA, ROCm, and ONNX-GPU. A separate Python client package is published for application use.

Key features

REST API for embeddings, reranking, CLIP, CLAP, and ColPali
Deploy models from Hugging Face
Dynamic batching with worker-thread tokenization
FastAPI-based server with OpenAPI-aligned API
Docker images for CUDA, ROCm, and ONNX-GPU

Details

First released: 2023
License: MIT
Platforms: Linux · Docker · CLI
Deployment: self-hostable · docker
Accelerators: CUDA · ROCm · CPU · AWS INF2 · Apple MPS
Model types: Embeddings · reranking · CLIP · CLAP · ColPali