Infrastructure¶

Inference engines, serving stacks, quantisation tools, vector databases, and deployment infrastructure for AI/LLM workloads.

Contents¶

Tool	What it does
Aphrodite Engine	Inference engine forked from vLLM for local use
ClawRouter	Agent-native routing layer for OpenClaw model selection
ExLlamaV2	Optimized GPTQ/EXL2 inference for consumer GPUs
Jan.ai	Local, open-source AI desktop client
llama.cpp	Lightweight local inference runtime for quantized LLMs
LiteLLM	Unified LLM API proxy
LocalAI	Self-hosted OpenAI-compatible local inference platform
MLX	Apple's array framework for ML on Apple Silicon
Msty	Local-first AI desktop app with model hub
OpenPipe	Data-driven fine-tuning platform
Ollama	Local LLM inference server
SGLang	Fast structured generation runtime from LMSYS
Supabase	Postgres-first backend platform for app and workflow state
Text Generation Inference (TGI)	Hugging Face's production inference server
vLLM	High-throughput LLM serving engine (PagedAttention)
ZSE	Fast cold-start LLM inference engine

As of early 2026, Apple Silicon continues to be the dominant platform for high-performance local AI inference in the homelab:

Apple M5 Pro / M5 Max: Unveiled March 2026, offering up to 4× faster LLM prompt processing compared to previous generations, significantly reducing agentic loop latency.
Apple M3 Ultra: Benchmark results for 11 MLX models (March 2026) confirm it as a premier choice for running large-scale local models with unified memory.