Infrastructure¶
Inference engines, serving stacks, quantisation tools, vector databases, and deployment infrastructure for AI/LLM workloads.
Contents¶
| Tool | What it does |
|---|---|
| Aphrodite Engine | Inference engine forked from vLLM for local use |
| ClawRouter | Agent-native routing layer for OpenClaw model selection |
| ExLlamaV2 | Optimized GPTQ/EXL2 inference for consumer GPUs |
| Jan.ai | Local, open-source AI desktop client |
| llama.cpp | Lightweight local inference runtime for quantized LLMs |
| LiteLLM | Unified LLM API proxy |
| LocalAI | Self-hosted OpenAI-compatible local inference platform |
| MLX | Apple's array framework for ML on Apple Silicon |
| Msty | Local-first AI desktop app with model hub |
| OpenPipe | Data-driven fine-tuning platform |
| Ollama | Local LLM inference server |
| SGLang | Fast structured generation runtime from LMSYS |
| Supabase | Postgres-first backend platform for app and workflow state |
| Text Generation Inference (TGI) | Hugging Face's production inference server |
| vLLM | High-throughput LLM serving engine (PagedAttention) |
| ZSE | Fast cold-start LLM inference engine |
Hardware Highlights¶
As of early 2026, Apple Silicon continues to be the dominant platform for high-performance local AI inference in the homelab:
- Apple M5 Pro / M5 Max: Unveiled March 2026, offering up to 4× faster LLM prompt processing compared to previous generations, significantly reducing agentic loop latency.
- Apple M3 Ultra: Benchmark results for 11 MLX models (March 2026) confirm it as a premier choice for running large-scale local models with unified memory.
Sub-categories¶
- Inference engines — vLLM, TGI, llama.cpp, MLX, etc.
- Vector databases — Pinecone, Weaviate, Milvus, Qdrant, etc.
- Serving & routing — Load balancers, model routers, API gateways
- Quantisation & optimisation — GGUF, GPTQ, AWQ, etc.