LiteLLM¶

What it is¶

LiteLLM is an open-source AI Gateway (proxy server) and Python SDK that provides a unified OpenAI-compatible interface to 100+ LLM providers — OpenAI, Anthropic, Google Vertex AI, AWS Bedrock, Azure OpenAI, Ollama, and more. It sits between your agents and models, acting as a traffic controller with routing, fallbacks, budget enforcement, virtual keys, and observability built in.

Backed by Y Combinator (W23). MIT-licensed core; enterprise tier available.

What problem it solves¶

When running multiple AI agents (OpenClaw, OpenHands, Aider, n8n AI nodes) against both local Ollama models and cloud providers, you quickly accumulate problems:

Each tool has its own API format and SDK
Secrets are scattered across configs
There is no central cost tracking
Provider outages cascade into agent failures
Local Ollama models are not OpenAI-compatible by default for some tools

LiteLLM solves all of these by presenting a single OpenAI-compatible endpoint that any tool can target, while internally routing, falling back, tracking costs, and enforcing budgets.

Where it fits in the stack¶

Provider Routing / Abstraction Layer. LiteLLM is typically the first hop after an agent makes an LLM API call.

┌──────────────────────────────────────────────────────────┐
│  Agents: OpenHands │ OpenClaw │ Aider │ n8n AI nodes    │
└───────────────────────────┬──────────────────────────────┘
                            │  OpenAI-compatible call
┌───────────────────────────▼──────────────────────────────┐
│                    LiteLLM Proxy (port 4000)              │
│  ┌──────────────┐  ┌────────────┐  ┌───────────────────┐ │
│  │ Virtual Keys │  │  Router    │  │ Budget / Guardrail│ │
│  └──────────────┘  └──────────┘  └───────────────────┘ │
│  ┌──────────────────────────────────────────────────────┐ │
│  │  Logging (Langfuse │ Prometheus │ S3 │ stdout)       │ │
│  └──────────────────────────────────────────────────────┘ │
└──────┬────────────────────┬─────────────────┬─────────────┘
       │                    │                  │
 Ollama (local)      OpenRouter          Anthropic API
 192.168.0.5:30068   (free tier)         (cloud fallback)

Deployment¶

Docker (recommended)¶

docker run \
  -v $(pwd)/litellm-config.yaml:/app/config.yaml \
  -p 4000:4000 \
  -e OPENROUTER_API_KEY="${OPENROUTER_API_KEY}" \
  -e ANTHROPIC_API_KEY="${ANTHROPIC_API_KEY}" \
  -e LITELLM_MASTER_KEY="sk-your-master-key" \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml --detailed_debug

One-click cloud (for testing)¶

Deploy to Render or Railway with the official template — useful for a shared team proxy before self-hosting.

Core configuration¶

Full home-lab config (`litellm-config.yaml`)¶

model_list:
  # ── Local Ollama (TrueNAS) ──────────────────────────────
  - model_name: llama3.2
    litellm_params:
      model: ollama/llama3.2
      api_base: http://192.168.0.5:30068
    model_info:
      max_tokens: 8192
      supports_function_calling: true

  - model_name: qwen2.5-coder-14b
    litellm_params:
      model: ollama/qwen2.5-coder:14b
      api_base: http://192.168.0.5:30068
    model_info:
      max_tokens: 32768
      supports_function_calling: true

  - model_name: nomic-embed
    litellm_params:
      model: ollama/nomic-embed-text
      api_base: http://192.168.0.5:30068

  # ── MacBook M4 Ollama ───────────────────────────────────
  - model_name: llama3.2-local
    litellm_params:
      model: ollama/llama3.2
      api_base: http://localhost:11434

  # ── Cloud Fallbacks ─────────────────────────────────────
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

  - model_name: openrouter-free
    litellm_params:
      model: openrouter/google/gemma-3-27b-it:free
      api_key: os.environ/OPENROUTER_API_KEY

  # ── Load Balancing ──────────────────────────────────────
  # Distribute traffic across multiple Ollama instances
  - model_name: llama3-balanced
    litellm_params:
      model: ollama/llama3
      api_base: http://ollama-1:11434
    model_info:
      id: "ollama-1"
  - model_name: llama3-balanced
    litellm_params:
      model: ollama/llama3
      api_base: http://ollama-2:11434
    model_info:
      id: "ollama-2"

router_settings:
  routing_strategy: least-busy
  fallback_model: openrouter-free
  allowed_fails: 2
  cooldown_time: 60       # seconds before retrying a failed model

litellm_settings:
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]
  request_timeout: 120

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: "postgresql://litellm:pass@localhost:5432/litellm"  # optional; enables UI + key management

Virtual keys and budget management¶

Virtual keys allow you to give different agents or users their own API keys with individual rate limits and budget caps. All keys route through the same LiteLLM proxy but are tracked and limited independently.

# Create a virtual key for OpenHands with a $5/month budget
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "key_alias": "openhands-agent",
    "max_budget": 5.0,
    "budget_duration": "monthly",
    "models": ["qwen2.5-coder-14b", "claude-sonnet"],
    "rpm_limit": 60
  }'

# Create a key for OpenClaw with only local models
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "key_alias": "openclaw-agent",
    "max_budget": 0,
    "models": ["llama3.2", "llama3.2-local"],
    "rpm_limit": 120
  }'

Agents then use their virtual key as the API key:

# OpenHands pointing at LiteLLM with its virtual key
export LLM_BASE_URL="http://192.168.0.5:4000"
export LLM_API_KEY="sk-openhands-virtual-key"
export LLM_MODEL="openai/qwen2.5-coder-14b"

Guardrails¶

LiteLLM can block or modify requests/responses before they reach the model:

# In litellm-config.yaml
litellm_settings:
  guardrails:
    - guardrail_name: "pii-masking"
      litellm_params:
        guardrail: "aporia"           # or "presidio" for local PII detection
        mode: "during_call"
        default_on: true

    - guardrail_name: "prompt-injection"
      litellm_params:
        guardrail: "lakera_guard"
        mode: "pre_call"

For purely local guardrails without third-party services, use the built-in content filter:

litellm_settings:
  content_policy:
    outgoing:
      block_words: ["password", "api_key", "secret"]
    incoming:
      block_words: ["ignore previous instructions", "jailbreak"]

Logging and observability¶

Langfuse (recommended for home lab)¶

litellm_settings:
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]

environment_variables:
  LANGFUSE_PUBLIC_KEY: "pk-lf-..."
  LANGFUSE_SECRET_KEY: "sk-lf-..."
  LANGFUSE_HOST: "http://192.168.0.5:3100"   # self-hosted Langfuse

Prometheus metrics¶

litellm_settings:
  success_callback: ["prometheus"]

# Metrics exposed at http://localhost:4000/metrics
# Track: litellm_requests_total, litellm_total_tokens, litellm_request_latency_seconds

Prometheus Scrape Config¶

Add this to your prometheus.yml:

scrape_configs:
  - job_name: 'litellm'
    static_configs:
      - targets: ['litellm:4000']

Grafana Dashboard¶

Use the official LiteLLM Grafana Dashboard to visualize: - Request volume by model/key - Latency (p95, p99) - Token usage and cost

Management UI¶

LiteLLM ships a web UI (enabled when database_url is set in config):

Accessible at http://localhost:4000/ui
Login with master key credentials
Manage virtual keys, view spend by key/model/day, set budgets, test models, view logs

# Start with UI enabled (requires PostgreSQL)
docker run \
  -e DATABASE_URL="postgresql://litellm:pass@postgres:5432/litellm" \
  -e LITELLM_MASTER_KEY="sk-master" \
  -p 4000:4000 \
  ghcr.io/berriai/litellm:main-latest \
  --config /app/config.yaml

Python SDK (without proxy)¶

For scripts that don't need a persistent proxy:

import litellm

# Unified call — same interface regardless of provider
response = litellm.completion(
    model="ollama/llama3.2",
    messages=[{"role": "user", "content": "Summarise this text..."}],
    api_base="http://192.168.0.5:30068",
)

# Streaming
for chunk in litellm.completion(
    model="anthropic/claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": "Explain RAG"}],
    stream=True,
):
    print(chunk.choices[0].delta.content or "", end="")

# Embeddings
embeddings = litellm.embedding(
    model="ollama/nomic-embed-text",
    input=["text to embed"],
    api_base="http://192.168.0.5:30068",
)

Supported endpoint types¶

LiteLLM proxies the full OpenAI API surface:

Endpoint	Use
`/chat/completions`	Standard chat
`/completions`	Legacy text completion
`/embeddings`	Vector embeddings
`/images/generations`	Image generation (DALL-E compatible)
`/audio/transcriptions`	Whisper-compatible transcription
`/audio/speech`	TTS
`/batches`	Batch inference
`/rerank`	Document re-ranking (Cohere-compatible)
`/messages`	Anthropic Messages API passthrough
`/responses`	OpenAI Responses API
`/a2a`	Agent-to-agent protocol

Integrating with OpenHands¶

# docker-compose.yml — OpenHands + LiteLLM
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    volumes:
      - ./litellm.yaml:/app/config.yaml
    environment:
      LITELLM_MASTER_KEY: "${LITELLM_MASTER_KEY}"
      ANTHROPIC_API_KEY: "${ANTHROPIC_API_KEY}"
      OPENROUTER_API_KEY: "${OPENROUTER_API_KEY}"
    ports:
      - "4000:4000"
    command: --config /app/config.yaml

  openhands:
    image: docker.all-hands.dev/all-hands-ai/openhands:latest
    environment:
      LLM_BASE_URL: "http://litellm:4000"
      LLM_MODEL: "openai/qwen2.5-coder-14b"
      LLM_API_KEY: "${OPENHANDS_VIRTUAL_KEY}"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ~/.openhands-state:/.openhands-state
    ports:
      - "3000:3000"
    depends_on:
      - litellm

Security considerations¶

Master key: Set LITELLM_MASTER_KEY via environment variable; never hardcode in config
Virtual key rotation: Revoke and regenerate agent keys periodically; each agent should have its own key
Secret management: All provider API keys are loaded from environment; LiteLLM never writes them to disk
Network isolation: Run the proxy on an internal Docker network; expose only port 4000 to trusted agents
Logging privacy: Prompts and responses are logged to Langfuse/Prometheus; ensure your logging backend is on private infrastructure if requests contain sensitive data

Strengths¶

Protocol normalisation: Every agent speaks one language (OpenAI Chat Completions), regardless of backend
100+ provider support: OpenAI, Anthropic, Ollama, Bedrock, Azure, Vertex AI, OpenRouter, Replicate, and more
Built-in fallbacks: Automatic failover when a model is down or rate-limited
Cost tracking: Per-key and per-model spend tracked in real time
Self-hostable: Full control; no third-party telemetry
Management UI: Visual key management and spend dashboard without external tooling
Embeddable: Works as a long-running proxy or imported as a Python library

Limitations¶

Operational overhead: Adds a service to maintain; needs health checks and restart policies
PostgreSQL dependency: Full UI + key persistence requires a Postgres instance
Feature parity gaps: Not all provider-specific parameters are exposed; some advanced provider features require raw passthrough
Local-model latency: Proxying through LiteLLM adds ~5–20 ms per call vs direct Ollama calls

When to use it¶

When running multiple AI agents with different LLM backends
When you need a centralised place to track AI spend and enforce budgets
For resilient systems that can survive provider outages via automatic fallback
When tools only support OpenAI format but you want to use Ollama or Bedrock
For team deployments where different people need different API key access levels

When not to use it¶

If you only ever use one provider and one agent (direct calls are simpler)
For very simple, low-volume scripts where a proxy adds unnecessary complexity
When sub-5 ms latency is critical and you are already calling Ollama directly

OpenRouter — cloud-based model router (alternative to LiteLLM proxy for cloud-only use)
Ollama — local model serving backend
OpenHands — software engineering agent; recommended to pair with LiteLLM
OpenClaw — agent platform; can use LiteLLM for model routing
Local LLMs — overview of local model options
vLLM — high-throughput alternative to Ollama for inference

Sources / References¶

Contribution Metadata¶

Last reviewed: 2026-03-21
Confidence: high