Skip to content

LiteLLM

What it is

LiteLLM is an open-source AI Gateway (proxy server) and Python SDK that provides a unified OpenAI-compatible interface to 100+ LLM providers — OpenAI, Anthropic, Google Vertex AI, AWS Bedrock, Azure OpenAI, Ollama, and more. It sits between your agents and models, acting as a traffic controller with routing, fallbacks, budget enforcement, virtual keys, and observability built in.

Backed by Y Combinator (W23). MIT-licensed core; enterprise tier available.

What problem it solves

When running multiple AI agents (OpenClaw, OpenHands, Aider, n8n AI nodes) against both local Ollama models and cloud providers, you quickly accumulate problems:

  • Each tool has its own API format and SDK
  • Secrets are scattered across configs
  • There is no central cost tracking
  • Provider outages cascade into agent failures
  • Local Ollama models are not OpenAI-compatible by default for some tools

LiteLLM solves all of these by presenting a single OpenAI-compatible endpoint that any tool can target, while internally routing, falling back, tracking costs, and enforcing budgets.

Where it fits in the stack

Provider Routing / Abstraction Layer. LiteLLM is typically the first hop after an agent makes an LLM API call.

┌──────────────────────────────────────────────────────────┐
│  Agents: OpenHands │ OpenClaw │ Aider │ n8n AI nodes    │
└───────────────────────────┬──────────────────────────────┘
                            │  OpenAI-compatible call
┌───────────────────────────▼──────────────────────────────┐
│                    LiteLLM Proxy (port 4000)              │
│  ┌──────────────┐  ┌────────────┐  ┌───────────────────┐ │
│  │ Virtual Keys │  │  Router    │  │ Budget / Guardrail│ │
│  └──────────────┘  └──────────┘  └───────────────────┘ │
│  ┌──────────────────────────────────────────────────────┐ │
│  │  Logging (Langfuse │ Prometheus │ S3 │ stdout)       │ │
│  └──────────────────────────────────────────────────────┘ │
└──────┬────────────────────┬─────────────────┬─────────────┘
       │                    │                  │
 Ollama (local)      OpenRouter          Anthropic API
 192.168.0.5:30068   (free tier)         (cloud fallback)

Deployment

docker run \
  -v $(pwd)/litellm-config.yaml:/app/config.yaml \
  -p 4000:4000 \
  -e OPENROUTER_API_KEY="${OPENROUTER_API_KEY}" \
  -e ANTHROPIC_API_KEY="${ANTHROPIC_API_KEY}" \
  -e LITELLM_MASTER_KEY="sk-your-master-key" \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml --detailed_debug

One-click cloud (for testing)

Deploy to Render or Railway with the official template — useful for a shared team proxy before self-hosting.

Core configuration

Full home-lab config (litellm-config.yaml)

model_list:
  # ── Local Ollama (TrueNAS) ──────────────────────────────
  - model_name: llama3.2
    litellm_params:
      model: ollama/llama3.2
      api_base: http://192.168.0.5:30068
    model_info:
      max_tokens: 8192
      supports_function_calling: true

  - model_name: qwen2.5-coder-14b
    litellm_params:
      model: ollama/qwen2.5-coder:14b
      api_base: http://192.168.0.5:30068
    model_info:
      max_tokens: 32768
      supports_function_calling: true

  - model_name: nomic-embed
    litellm_params:
      model: ollama/nomic-embed-text
      api_base: http://192.168.0.5:30068

  # ── MacBook M4 Ollama ───────────────────────────────────
  - model_name: llama3.2-local
    litellm_params:
      model: ollama/llama3.2
      api_base: http://localhost:11434

  # ── Cloud Fallbacks ─────────────────────────────────────
  - model_name: claude-opus
    litellm_params:
      model: anthropic/claude-opus-4.7
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: gemini-flash
    litellm_params:
      model: google/gemini-3.5-flash
      api_key: os.environ/GEMINI_API_KEY

  - model_name: openrouter-free
    litellm_params:
      model: openrouter/google/gemma-3-27b-it:free
      api_key: os.environ/OPENROUTER_API_KEY

  # ── Load Balancing ──────────────────────────────────────
  # Distribute traffic across multiple Ollama instances
  - model_name: llama3-balanced
    litellm_params:
      model: ollama/llama3
      api_base: http://ollama-1:11434
    model_info:
      id: "ollama-1"
  - model_name: llama3-balanced
    litellm_params:
      model: ollama/llama3
      api_base: http://ollama-2:11434
    model_info:
      id: "ollama-2"

router_settings:
  routing_strategy: least-busy
  fallback_model: openrouter-free
  allowed_fails: 2
  cooldown_time: 60       # seconds before retrying a failed model

## Advanced Routing Patterns

### Priority-based Routing
Useful for favoring a cheaper or local instance while keeping a high-performance cloud instance as a hot standby.

```yaml
model_list:
  - model_name: coder-model
    litellm_params:
      model: ollama/qwen2.5-coder:14b
      api_base: http://local-gpu:11434
      priority: 1  # Lowest number = Highest priority
  - model_name: coder-model
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20241022
      priority: 2

Model-specific Caching

Enable Redis caching per model to reduce costs and latency for repetitive agent queries (like "summarize this log").

router_settings:
  enable_cache: true
  cache_responses: true
  cache_type: redis
  redis_url: "redis://localhost:6379"

model_list:
  - model_name: fast-summary
    litellm_params:
      model: groq/llama-3.1-8b-instant
      ttl: 3600  # Cache for 1 hour

litellm_settings: success_callback: ["langfuse"] failure_callback: ["langfuse"] request_timeout: 120

general_settings: master_key: os.environ/LITELLM_MASTER_KEY database_url: "postgresql://litellm:pass@localhost:5432/litellm" # optional; enables UI + key management

## Virtual keys and budget management

Virtual keys allow you to give different agents or users their own API keys with individual rate limits and budget caps. All keys route through the same LiteLLM proxy but are tracked and limited independently.

```bash
# Create a virtual key for OpenHands with a $5/month budget
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "key_alias": "openhands-agent",
    "max_budget": 5.0,
    "budget_duration": "monthly",
    "models": ["qwen2.5-coder-14b", "claude-sonnet"],
    "rpm_limit": 60
  }'

# Create a key for OpenClaw with only local models
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "key_alias": "openclaw-agent",
    "max_budget": 0,
    "models": ["llama3.2", "llama3.2-local"],
    "rpm_limit": 120
  }'

# Create a token-limited key (1M tokens) for a research task
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "key_alias": "research-task",
    "max_token_budget": 1000000,
    "budget_duration": "30d"
  }'

Agents then use their virtual key as the API key:

# OpenHands pointing at LiteLLM with its virtual key
export LLM_BASE_URL="http://192.168.0.5:4000"
export LLM_API_KEY="sk-openhands-virtual-key"
export LLM_MODEL="openai/qwen2.5-coder-14b"

Guardrails

LiteLLM can block or modify requests/responses before they reach the model:

# In litellm-config.yaml
litellm_settings:
  guardrails:
    - guardrail_name: "pii-masking"
      litellm_params:
        guardrail: "aporia"           # or "presidio" for local PII detection
        mode: "during_call"
        default_on: true

    - guardrail_name: "prompt-injection"
      litellm_params:
        guardrail: "lakera_guard"
        mode: "pre_call"

For purely local guardrails without third-party services, use the built-in content filter:

litellm_settings:
  content_policy:
    outgoing:
      block_words: ["password", "api_key", "secret"]
    incoming:
      block_words: ["ignore previous instructions", "jailbreak"]

Logging and observability

litellm_settings:
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]

environment_variables:
  LANGFUSE_PUBLIC_KEY: "pk-lf-..."
  LANGFUSE_SECRET_KEY: "sk-lf-..."
  LANGFUSE_HOST: "http://192.168.0.5:3100"   # self-hosted Langfuse

Prometheus metrics

litellm_settings:
  success_callback: ["prometheus"]

# Metrics exposed at http://localhost:4000/metrics
# Track: litellm_requests_total, litellm_total_tokens, litellm_request_latency_seconds

Prometheus Scrape Config

Add this to your prometheus.yml:

scrape_configs:
  - job_name: 'litellm'
    static_configs:
      - targets: ['litellm:4000']

Grafana Dashboard

Use the official LiteLLM Grafana Dashboard to visualize: - Request volume by model/key - Latency (p95, p99) - Token usage and cost

Management UI

LiteLLM ships a web UI (enabled when database_url is set in config):

  • Accessible at http://localhost:4000/ui
  • Login with master key credentials
  • Manage virtual keys, view spend by key/model/day, set budgets, test models, view logs
# Start with UI enabled (requires PostgreSQL)
docker run \
  -e DATABASE_URL="postgresql://litellm:pass@postgres:5432/litellm" \
  -e LITELLM_MASTER_KEY="sk-master" \
  -p 4000:4000 \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml

Python SDK (without proxy)

For scripts that don't need a persistent proxy:

import litellm

# Unified call — same interface regardless of provider
response = litellm.completion(
    model="ollama/llama3.2",
    messages=[{"role": "user", "content": "Summarise this text..."}],
    api_base="http://192.168.0.5:30068",
)

# Streaming
for chunk in litellm.completion(
    model="anthropic/claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": "Explain RAG"}],
    stream=True,
):
    print(chunk.choices[0].delta.content or "", end="")

# Embeddings
embeddings = litellm.embedding(
    model="ollama/nomic-embed-text",
    input=["text to embed"],
    api_base="http://192.168.0.5:30068",
)

Supported endpoint types

LiteLLM proxies the full OpenAI API surface:

Endpoint Use
/chat/completions Standard chat
/completions Legacy text completion
/embeddings Vector embeddings
/images/generations Image generation (DALL-E compatible)
/audio/transcriptions Whisper-compatible transcription
/audio/speech TTS
/batches Batch inference
/rerank Document re-ranking (Cohere-compatible)
/messages Anthropic Messages API passthrough
/responses OpenAI Responses API
/a2a Agent-to-agent protocol
/realtime OpenAI Realtime API (WebSocket)

Realtime API

GA as of May 2026, LiteLLM provides a unified WebSocket interface for Realtime multimodal (audio/text) models.

# litellm-config.yaml
model_list:
  - model_name: gpt-4o-realtime
    litellm_params:
      model: openai/gpt-4o-realtime-preview-2024-10-01

Connect via WebSocket: ws://localhost:4000/v1/realtime?model=gpt-4o-realtime

Memory API

Introduced in April 2026, the Memory API allows LiteLLM to maintain long-term state across sessions for specific users or agents.

# python snippet
response = litellm.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "My favorite color is blue."}],
    user="jules-001",
    integrate_memory=True  # Automatically persists and retrieves context
)

MCP Gateway

Hardened in May 2026, the MCP Gateway allows you to securely expose Model Context Protocol (MCP) servers to your agents via the LiteLLM proxy.

# litellm-config.yaml
mcp_servers:
  - name: "google-drive"
    command: "npx"
    args: ["-y", "@modelcontextprotocol/server-google-drive"]
    env:
      GOOGLE_DRIVE_CREDENTIALS: "..."

Integrating with OpenHands

# docker-compose.yml — OpenHands + LiteLLM
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    volumes:
      - ./litellm.yaml:/app/config.yaml
    environment:
      LITELLM_MASTER_KEY: "${LITELLM_MASTER_KEY}"
      ANTHROPIC_API_KEY: "${ANTHROPIC_API_KEY}"
      OPENROUTER_API_KEY: "${OPENROUTER_API_KEY}"
    ports:
      - "4000:4000"
    command: --config /app/config.yaml

  openhands:
    image: docker.all-hands.dev/all-hands-ai/openhands:latest
    environment:
      LLM_BASE_URL: "http://litellm:4000"
      LLM_MODEL: "openai/qwen2.5-coder-14b"
      LLM_API_KEY: "${OPENHANDS_VIRTUAL_KEY}"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ~/.openhands-state:/.openhands-state
    ports:
      - "3000:3000"
    depends_on:
      - litellm

Security considerations

  • Master key: Set LITELLM_MASTER_KEY via environment variable; never hardcode in config
  • Virtual key rotation: Revoke and regenerate agent keys periodically; each agent should have its own key
  • Secret management: All provider API keys are loaded from environment; LiteLLM never writes them to disk
  • Network isolation: Run the proxy on an internal Docker network; expose only port 4000 to trusted agents
  • Logging privacy: Prompts and responses are logged to Langfuse/Prometheus; ensure your logging backend is on private infrastructure if requests contain sensitive data

Typical use cases

  • Multi-Agent Orchestration: Providing a single endpoint for OpenHands, OpenClaw, and Aider to share the same model pool and budget.
  • Cost Management: Tracking and limiting spend for local experiments vs. production cloud calls.
  • Resilient AI Pipelines: Implementing automatic failover from local Ollama models to cloud providers (Anthropic/OpenRouter) during heavy load or local downtime.
  • Local Development: Simulating cloud model APIs using local models (e.g., using Qwen 2.5-Coder as an OpenAI-compatible substitute).

Strengths

  • Protocol normalisation: Every agent speaks one language (OpenAI Chat Completions), regardless of backend
  • 100+ provider support: OpenAI, Anthropic, Ollama, Bedrock, Azure, Vertex AI, OpenRouter, Replicate, and more
  • Built-in fallbacks: Automatic failover when a model is down or rate-limited
  • Cost tracking: Per-key and per-model spend tracked in real time
  • Self-hostable: Full control; no third-party telemetry
  • Management UI: Visual key management and spend dashboard without external tooling
  • Embeddable: Works as a long-running proxy or imported as a Python library

Limitations

  • Operational overhead: Adds a service to maintain; needs health checks and restart policies
  • PostgreSQL dependency: Full UI + key persistence requires a Postgres instance
  • Feature parity gaps: Not all provider-specific parameters are exposed; some advanced provider features require raw passthrough
  • Local-model latency: Proxying through LiteLLM adds ~5–20 ms per call vs direct Ollama calls

When to use it

  • When running multiple AI agents with different LLM backends
  • When you need a centralised place to track AI spend and enforce budgets
  • For resilient systems that can survive provider outages via automatic fallback
  • When tools only support OpenAI format but you want to use Ollama or Bedrock
  • For team deployments where different people need different API key access levels

When not to use it

  • If you only ever use one provider and one agent (direct calls are simpler)
  • For very simple, low-volume scripts where a proxy adds unnecessary complexity
  • When sub-5 ms latency is critical and you are already calling Ollama directly
  • OpenRouter — cloud-based model router (alternative to LiteLLM proxy for cloud-only use).
  • Ollama — local model serving backend.
  • OpenHands — software engineering agent; recommended to pair with LiteLLM.
  • OpenClaw — agent platform; can use LiteLLM for model routing.
  • Local LLMs — overview of local model options.
  • vLLM — high-throughput alternative to Ollama for inference.
  • Langfuse — observability backend for LiteLLM.
  • Model Routing Guide — logic for choosing models via LiteLLM.
  • Unified Search — script that can benefit from unified LLM access.

Sources / References

Backlog

  • [x] Perform quarterly technical freshness audit. (Completed 2026-05-26)

Contribution Metadata

  • Last reviewed: 2026-05-26
  • Confidence: high