LiteLLM¶
What it is¶
LiteLLM is an open-source AI Gateway (proxy server) and Python SDK that provides a unified OpenAI-compatible interface to 100+ LLM providers — OpenAI, Anthropic, Google Vertex AI, AWS Bedrock, Azure OpenAI, Ollama, and more. It sits between your agents and models, acting as a traffic controller with routing, fallbacks, budget enforcement, virtual keys, and observability built in.
Backed by Y Combinator (W23). MIT-licensed core; enterprise tier available.
What problem it solves¶
When running multiple AI agents (OpenClaw, OpenHands, Aider, n8n AI nodes) against both local Ollama models and cloud providers, you quickly accumulate problems:
- Each tool has its own API format and SDK
- Secrets are scattered across configs
- There is no central cost tracking
- Provider outages cascade into agent failures
- Local Ollama models are not OpenAI-compatible by default for some tools
LiteLLM solves all of these by presenting a single OpenAI-compatible endpoint that any tool can target, while internally routing, falling back, tracking costs, and enforcing budgets.
Where it fits in the stack¶
Provider Routing / Abstraction Layer. LiteLLM is typically the first hop after an agent makes an LLM API call.
┌──────────────────────────────────────────────────────────┐
│ Agents: OpenHands │ OpenClaw │ Aider │ n8n AI nodes │
└───────────────────────────┬──────────────────────────────┘
│ OpenAI-compatible call
┌───────────────────────────▼──────────────────────────────┐
│ LiteLLM Proxy (port 4000) │
│ ┌──────────────┐ ┌────────────┐ ┌───────────────────┐ │
│ │ Virtual Keys │ │ Router │ │ Budget / Guardrail│ │
│ └──────────────┘ └──────────┘ └───────────────────┘ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Logging (Langfuse │ Prometheus │ S3 │ stdout) │ │
│ └──────────────────────────────────────────────────────┘ │
└──────┬────────────────────┬─────────────────┬─────────────┘
│ │ │
Ollama (local) OpenRouter Anthropic API
192.168.0.5:30068 (free tier) (cloud fallback)
Deployment¶
Docker (recommended)¶
docker run \
-v $(pwd)/litellm-config.yaml:/app/config.yaml \
-p 4000:4000 \
-e OPENROUTER_API_KEY="${OPENROUTER_API_KEY}" \
-e ANTHROPIC_API_KEY="${ANTHROPIC_API_KEY}" \
-e LITELLM_MASTER_KEY="sk-your-master-key" \
ghcr.io/berriai/litellm:main-latest \
--config /app/config.yaml --detailed_debug
One-click cloud (for testing)¶
Deploy to Render or Railway with the official template — useful for a shared team proxy before self-hosting.
Core configuration¶
Full home-lab config (litellm-config.yaml)¶
model_list:
# ── Local Ollama (TrueNAS) ──────────────────────────────
- model_name: llama3.2
litellm_params:
model: ollama/llama3.2
api_base: http://192.168.0.5:30068
model_info:
max_tokens: 8192
supports_function_calling: true
- model_name: qwen2.5-coder-14b
litellm_params:
model: ollama/qwen2.5-coder:14b
api_base: http://192.168.0.5:30068
model_info:
max_tokens: 32768
supports_function_calling: true
- model_name: nomic-embed
litellm_params:
model: ollama/nomic-embed-text
api_base: http://192.168.0.5:30068
# ── MacBook M4 Ollama ───────────────────────────────────
- model_name: llama3.2-local
litellm_params:
model: ollama/llama3.2
api_base: http://localhost:11434
# ── Cloud Fallbacks ─────────────────────────────────────
- model_name: claude-opus
litellm_params:
model: anthropic/claude-opus-4.7
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: gemini-flash
litellm_params:
model: google/gemini-3.5-flash
api_key: os.environ/GEMINI_API_KEY
- model_name: openrouter-free
litellm_params:
model: openrouter/google/gemma-3-27b-it:free
api_key: os.environ/OPENROUTER_API_KEY
# ── Load Balancing ──────────────────────────────────────
# Distribute traffic across multiple Ollama instances
- model_name: llama3-balanced
litellm_params:
model: ollama/llama3
api_base: http://ollama-1:11434
model_info:
id: "ollama-1"
- model_name: llama3-balanced
litellm_params:
model: ollama/llama3
api_base: http://ollama-2:11434
model_info:
id: "ollama-2"
router_settings:
routing_strategy: least-busy
fallback_model: openrouter-free
allowed_fails: 2
cooldown_time: 60 # seconds before retrying a failed model
## Advanced Routing Patterns
### Priority-based Routing
Useful for favoring a cheaper or local instance while keeping a high-performance cloud instance as a hot standby.
```yaml
model_list:
- model_name: coder-model
litellm_params:
model: ollama/qwen2.5-coder:14b
api_base: http://local-gpu:11434
priority: 1 # Lowest number = Highest priority
- model_name: coder-model
litellm_params:
model: anthropic/claude-3-5-sonnet-20241022
priority: 2
Model-specific Caching¶
Enable Redis caching per model to reduce costs and latency for repetitive agent queries (like "summarize this log").
router_settings:
enable_cache: true
cache_responses: true
cache_type: redis
redis_url: "redis://localhost:6379"
model_list:
- model_name: fast-summary
litellm_params:
model: groq/llama-3.1-8b-instant
ttl: 3600 # Cache for 1 hour
litellm_settings: success_callback: ["langfuse"] failure_callback: ["langfuse"] request_timeout: 120
general_settings: master_key: os.environ/LITELLM_MASTER_KEY database_url: "postgresql://litellm:pass@localhost:5432/litellm" # optional; enables UI + key management
## Virtual keys and budget management
Virtual keys allow you to give different agents or users their own API keys with individual rate limits and budget caps. All keys route through the same LiteLLM proxy but are tracked and limited independently.
```bash
# Create a virtual key for OpenHands with a $5/month budget
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"key_alias": "openhands-agent",
"max_budget": 5.0,
"budget_duration": "monthly",
"models": ["qwen2.5-coder-14b", "claude-sonnet"],
"rpm_limit": 60
}'
# Create a key for OpenClaw with only local models
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"key_alias": "openclaw-agent",
"max_budget": 0,
"models": ["llama3.2", "llama3.2-local"],
"rpm_limit": 120
}'
# Create a token-limited key (1M tokens) for a research task
curl -X POST http://localhost:4000/key/generate \
-H "Authorization: Bearer $LITELLM_MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"key_alias": "research-task",
"max_token_budget": 1000000,
"budget_duration": "30d"
}'
Agents then use their virtual key as the API key:
# OpenHands pointing at LiteLLM with its virtual key
export LLM_BASE_URL="http://192.168.0.5:4000"
export LLM_API_KEY="sk-openhands-virtual-key"
export LLM_MODEL="openai/qwen2.5-coder-14b"
Guardrails¶
LiteLLM can block or modify requests/responses before they reach the model:
# In litellm-config.yaml
litellm_settings:
guardrails:
- guardrail_name: "pii-masking"
litellm_params:
guardrail: "aporia" # or "presidio" for local PII detection
mode: "during_call"
default_on: true
- guardrail_name: "prompt-injection"
litellm_params:
guardrail: "lakera_guard"
mode: "pre_call"
For purely local guardrails without third-party services, use the built-in content filter:
litellm_settings:
content_policy:
outgoing:
block_words: ["password", "api_key", "secret"]
incoming:
block_words: ["ignore previous instructions", "jailbreak"]
Logging and observability¶
Langfuse (recommended for home lab)¶
litellm_settings:
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
environment_variables:
LANGFUSE_PUBLIC_KEY: "pk-lf-..."
LANGFUSE_SECRET_KEY: "sk-lf-..."
LANGFUSE_HOST: "http://192.168.0.5:3100" # self-hosted Langfuse
Prometheus metrics¶
litellm_settings:
success_callback: ["prometheus"]
# Metrics exposed at http://localhost:4000/metrics
# Track: litellm_requests_total, litellm_total_tokens, litellm_request_latency_seconds
Prometheus Scrape Config¶
Add this to your prometheus.yml:
scrape_configs:
- job_name: 'litellm'
static_configs:
- targets: ['litellm:4000']
Grafana Dashboard¶
Use the official LiteLLM Grafana Dashboard to visualize: - Request volume by model/key - Latency (p95, p99) - Token usage and cost
Management UI¶
LiteLLM ships a web UI (enabled when database_url is set in config):
- Accessible at
http://localhost:4000/ui - Login with master key credentials
- Manage virtual keys, view spend by key/model/day, set budgets, test models, view logs
# Start with UI enabled (requires PostgreSQL)
docker run \
-e DATABASE_URL="postgresql://litellm:pass@postgres:5432/litellm" \
-e LITELLM_MASTER_KEY="sk-master" \
-p 4000:4000 \
ghcr.io/berriai/litellm:main-latest \
--config /app/config.yaml
Python SDK (without proxy)¶
For scripts that don't need a persistent proxy:
import litellm
# Unified call — same interface regardless of provider
response = litellm.completion(
model="ollama/llama3.2",
messages=[{"role": "user", "content": "Summarise this text..."}],
api_base="http://192.168.0.5:30068",
)
# Streaming
for chunk in litellm.completion(
model="anthropic/claude-sonnet-4-20250514",
messages=[{"role": "user", "content": "Explain RAG"}],
stream=True,
):
print(chunk.choices[0].delta.content or "", end="")
# Embeddings
embeddings = litellm.embedding(
model="ollama/nomic-embed-text",
input=["text to embed"],
api_base="http://192.168.0.5:30068",
)
Supported endpoint types¶
LiteLLM proxies the full OpenAI API surface:
| Endpoint | Use |
|---|---|
/chat/completions |
Standard chat |
/completions |
Legacy text completion |
/embeddings |
Vector embeddings |
/images/generations |
Image generation (DALL-E compatible) |
/audio/transcriptions |
Whisper-compatible transcription |
/audio/speech |
TTS |
/batches |
Batch inference |
/rerank |
Document re-ranking (Cohere-compatible) |
/messages |
Anthropic Messages API passthrough |
/responses |
OpenAI Responses API |
/a2a |
Agent-to-agent protocol |
/realtime |
OpenAI Realtime API (WebSocket) |
Realtime API¶
GA as of May 2026, LiteLLM provides a unified WebSocket interface for Realtime multimodal (audio/text) models.
# litellm-config.yaml
model_list:
- model_name: gpt-4o-realtime
litellm_params:
model: openai/gpt-4o-realtime-preview-2024-10-01
Connect via WebSocket:
ws://localhost:4000/v1/realtime?model=gpt-4o-realtime
Memory API¶
Introduced in April 2026, the Memory API allows LiteLLM to maintain long-term state across sessions for specific users or agents.
# python snippet
response = litellm.completion(
model="gpt-4o",
messages=[{"role": "user", "content": "My favorite color is blue."}],
user="jules-001",
integrate_memory=True # Automatically persists and retrieves context
)
MCP Gateway¶
Hardened in May 2026, the MCP Gateway allows you to securely expose Model Context Protocol (MCP) servers to your agents via the LiteLLM proxy.
# litellm-config.yaml
mcp_servers:
- name: "google-drive"
command: "npx"
args: ["-y", "@modelcontextprotocol/server-google-drive"]
env:
GOOGLE_DRIVE_CREDENTIALS: "..."
Integrating with OpenHands¶
# docker-compose.yml — OpenHands + LiteLLM
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
volumes:
- ./litellm.yaml:/app/config.yaml
environment:
LITELLM_MASTER_KEY: "${LITELLM_MASTER_KEY}"
ANTHROPIC_API_KEY: "${ANTHROPIC_API_KEY}"
OPENROUTER_API_KEY: "${OPENROUTER_API_KEY}"
ports:
- "4000:4000"
command: --config /app/config.yaml
openhands:
image: docker.all-hands.dev/all-hands-ai/openhands:latest
environment:
LLM_BASE_URL: "http://litellm:4000"
LLM_MODEL: "openai/qwen2.5-coder-14b"
LLM_API_KEY: "${OPENHANDS_VIRTUAL_KEY}"
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ~/.openhands-state:/.openhands-state
ports:
- "3000:3000"
depends_on:
- litellm
Security considerations¶
- Master key: Set
LITELLM_MASTER_KEYvia environment variable; never hardcode in config - Virtual key rotation: Revoke and regenerate agent keys periodically; each agent should have its own key
- Secret management: All provider API keys are loaded from environment; LiteLLM never writes them to disk
- Network isolation: Run the proxy on an internal Docker network; expose only port 4000 to trusted agents
- Logging privacy: Prompts and responses are logged to Langfuse/Prometheus; ensure your logging backend is on private infrastructure if requests contain sensitive data
Typical use cases¶
- Multi-Agent Orchestration: Providing a single endpoint for OpenHands, OpenClaw, and Aider to share the same model pool and budget.
- Cost Management: Tracking and limiting spend for local experiments vs. production cloud calls.
- Resilient AI Pipelines: Implementing automatic failover from local Ollama models to cloud providers (Anthropic/OpenRouter) during heavy load or local downtime.
- Local Development: Simulating cloud model APIs using local models (e.g., using Qwen 2.5-Coder as an OpenAI-compatible substitute).
Strengths¶
- Protocol normalisation: Every agent speaks one language (OpenAI Chat Completions), regardless of backend
- 100+ provider support: OpenAI, Anthropic, Ollama, Bedrock, Azure, Vertex AI, OpenRouter, Replicate, and more
- Built-in fallbacks: Automatic failover when a model is down or rate-limited
- Cost tracking: Per-key and per-model spend tracked in real time
- Self-hostable: Full control; no third-party telemetry
- Management UI: Visual key management and spend dashboard without external tooling
- Embeddable: Works as a long-running proxy or imported as a Python library
Limitations¶
- Operational overhead: Adds a service to maintain; needs health checks and restart policies
- PostgreSQL dependency: Full UI + key persistence requires a Postgres instance
- Feature parity gaps: Not all provider-specific parameters are exposed; some advanced provider features require raw passthrough
- Local-model latency: Proxying through LiteLLM adds ~5–20 ms per call vs direct Ollama calls
When to use it¶
- When running multiple AI agents with different LLM backends
- When you need a centralised place to track AI spend and enforce budgets
- For resilient systems that can survive provider outages via automatic fallback
- When tools only support OpenAI format but you want to use Ollama or Bedrock
- For team deployments where different people need different API key access levels
When not to use it¶
- If you only ever use one provider and one agent (direct calls are simpler)
- For very simple, low-volume scripts where a proxy adds unnecessary complexity
- When sub-5 ms latency is critical and you are already calling Ollama directly
Related tools / concepts¶
- OpenRouter — cloud-based model router (alternative to LiteLLM proxy for cloud-only use).
- Ollama — local model serving backend.
- OpenHands — software engineering agent; recommended to pair with LiteLLM.
- OpenClaw — agent platform; can use LiteLLM for model routing.
- Local LLMs — overview of local model options.
- vLLM — high-throughput alternative to Ollama for inference.
- Langfuse — observability backend for LiteLLM.
- Model Routing Guide — logic for choosing models via LiteLLM.
- Unified Search — script that can benefit from unified LLM access.
Sources / References¶
- LiteLLM Documentation
- GitHub — BerriAI/litellm
- LiteLLM Proxy Quick Start
- Virtual Keys & Budgets
- Supported Providers
Backlog¶
- [x] Perform quarterly technical freshness audit. (Completed 2026-05-26)
Contribution Metadata¶
- Last reviewed: 2026-05-26
- Confidence: high