Skip to content

Text Generation Inference (TGI)

What it is

Text Generation Inference (TGI) is a specialized toolkit for deploying and serving Large Language Models (LLMs). Developed by Hugging Face, it is designed for high-performance text generation in production environments.

What problem it solves

TGI addresses the engineering challenges of serving LLMs at scale. It implements advanced optimizations like tensor parallelism for multi-GPU inference, dynamic batching to maximize throughput, and custom Rust kernels for faster generation.

Where it fits in the stack

Infra. It provides the high-performance serving layer for Hugging Face models, bridging the gap between raw weights and a production-ready API.

Typical use cases

  • Enterprise-grade LLM APIs: Powering internal or external model services with high reliability.
  • Multi-GPU Deployment: Serving very large models (e.g., Llama-3-70B) that require tensor parallelism.
  • Real-time Chat: Production backends for applications like Hugging Chat that require streaming responses.

Strengths

  • Production-Hardened: Battle-tested at Hugging Face for their own Inference API.
  • Advanced Optimizations: Includes Flash Attention, Paged Attention, and optimized kernels.
  • Flexible Serving: Supports a wide range of Hugging Face models out of the box.
  • Enterprise Features: Robust monitoring, streaming support, and Prometheus metrics.
  • Multi-LoRA: Efficiently serve multiple fine-tuned adapters on a single base model.

Limitations

  • Licensing: Uses the Hugging Face Optimized Inference License (HFOIL), which has restrictions on commercial redistribution as a service.
  • Setup Complexity: Docker is the primary and recommended way to run it, which may be a barrier for some environments.

When to use it

  • When you need a highly optimized, production-ready server for LLMs in the Hugging Face ecosystem.
  • When you need to scale models across multiple GPUs efficiently.
  • When serving multiple LoRA adapters simultaneously is a requirement.

When not to use it

  • For local development on consumer hardware where simpler tools like Ollama or llama.cpp suffice.
  • If your commercial model conflicts with the HFOIL license terms.

Licensing and cost

  • Open Source: Yes (HFOIL v1.0)
  • Cost: Free
  • Self-hostable: Yes

Getting started

Installation (Docker)

Docker is the recommended way to run TGI.

Minimal CLI Example

model=google/gemma-2b
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 \
    -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id $model

Advanced Serving: Multi-LoRA and Quantization

TGI supports serving multiple LoRA adapters and advanced quantization schemes like AWQ and GPTQ via bitsandbytes.

docker run --gpus all --shm-size 1g -p 8080:80 \
    -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Meta-Llama-3-8B-Instruct \
    --quantize bitsandbytes-nf4 \
    --lora-adapters "adapter_1=path/to/lora1,adapter_2=path/to/lora2"

Querying the API

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{
        "inputs":"The future of AI is",
        "parameters":{
            "max_new_tokens":20,
            "adapter_id": "adapter_1"
        }
    }' \
    -H 'Content-Type: application/json'

Monitoring and Observability

TGI exposes a /metrics endpoint for Prometheus, providing detailed insights into: - Request latency (TTFT and total generation time). - Batch sizes and queue lengths. - GPU memory utilization and throughput (tokens/sec).

Sources / References

Contribution Metadata

  • Last reviewed: 2026-05-17
  • Confidence: high