Infinity

High-throughput REST API for serving text embeddings, reranking, CLIP, CLAP, and ColPali models

Repository activity
  • Stars2.8k
  • Forks192
  • Open Issues127
License

MIT

Languages
  • Python
  • Makefile
  • Jinja
Infinity screenshot

About Infinity

Infinity is a high-throughput, low-latency REST API for serving text embeddings, reranking models, CLIP, CLAP, and ColPali. It lets you deploy any embedding, reranking, or sentence-transformer model from Hugging Face behind one API.

The server is built on FastAPI and combines PyTorch, optimum for ONNX and TensorRT, and CTranslate2, using FlashAttention. It runs on NVIDIA CUDA, AMD ROCm, CPU, AWS INF2, or Apple MPS, and uses dynamic batching with tokenization in dedicated worker threads. You can mix and match multiple models and let Infinity orchestrate them.

The OpenAPI surface is aligned with OpenAI's embeddings spec. The CLI v2 launches every argument from environment variables or flags, and pre-built Docker images cover CUDA, ROCm, and ONNX-GPU. A separate Python client package is published for application use.

Key features

  • REST API for embeddings, reranking, CLIP, CLAP, and ColPali
  • Deploy models from Hugging Face
  • Dynamic batching with worker-thread tokenization
  • FastAPI-based server with OpenAPI-aligned API
  • Docker images for CUDA, ROCm, and ONNX-GPU

Details

First released
2023
License
MIT
Platforms
Linux · Docker · CLI
Deployment
self-hostable · docker
Accelerators
CUDA · ROCm · CPU · AWS INF2 · Apple MPS
Model types
Embeddings · reranking · CLIP · CLAP · ColPali