Skip to content

Local Vision Models Research

What it is

A research summary of local vision-language models (VLMs) and multi-modal models capable of running on homelab hardware. These models allow AI agents to "see" images and videos, providing semantic descriptions, object detection, and visual reasoning as of May 2026.

What problem it solves

Automates the tagging, captioning, and searchability of home video and image archives (e.g., "Find the video of the birthday party") without relying on cloud services. This ensures family memories remain private while gaining the benefit of modern semantic search and agentic visual understanding.

Where it fits in the stack

Processes raw video and image files stored on TrueNAS or managed by Immich. It acts as the Inference Layer for visual data, feeding structured descriptions into the Vector Database Comparison for long-term memory.

Typical use cases

  • Automated Metadata Generation: Generating descriptions for home video frames and images.
  • Semantic Media Search: Natural language search over video content (e.g., "scenes with the dog in the garden").
  • Agentic Visual Reasoning: Allowing a Home Admin Agent to answer questions about the physical world (e.g., "Is the garage door closed in this photo?").
  • Document Analysis: Advanced OCR and table extraction from complex document images using models like Florence-2.

Top Local Vision Models (2026)

Model Size Strengths Best For
InternVL2 1B - 76B State-of-the-art visual reasoning, excellent OCR. High-accuracy document & scene analysis.
Llama 3.2 Vision 11B / 90B Broad knowledge, excellent instruction following. General-purpose assistant with vision.
Florence-2 0.2B - 0.7B Extremely fast object detection, segmentation, OCR. High-throughput metadata tagging.
Moondream2 1.6B Compact, efficient, runs on almost any hardware. Fast, simple image captioning on CPUs.
Qwen2-VL 2B - 72B Exceptional at document understanding and multi-image. Multi-page PDF analysis and video-as-images.

Strengths

  • Privacy: Zero-egress processing of sensitive personal media.
  • Cost: No per-token or per-image costs common with cloud APIs like GPT-4o.
  • Native Integration: Directly integrates with local storage and n8n workflows.
  • Low Latency: High-speed processing on local NVIDIA/Apple Silicon hardware.

Limitations

  • VRAM Intensive: 11B+ models require 12GB+ VRAM for comfortable inference.
  • Sequential Processing: Analyzing high-FPS video requires significant compute; pooling or keyframe extraction is mandatory.
  • Accuracy: While competitive, local models may still lag behind flagship cloud models in extreme edge cases or high-resolution details.

When to use it

  • Use Whisper for all audio transcription needs in the homelab.
  • Use CLIP or SigLIP for implementing "search by description" in image/video galleries.
  • Use Florence-2 for specialized tasks like object detection, OCR, and regional captioning.
  • Use Moondream2 for generating quick, natural language captions for personal photos.
  • Use InternVL2 or Llama 3.2 Vision when complex reasoning about an image is required.

When not to use it

  • Do not use for real-time video surveillance analysis on low-power CPU-only nodes.
  • Do not rely on 100% accuracy for critical forensic identification without human verification.
  • Avoid using for long-form video understanding without a robust keyframe extraction or pooling pipeline.

Implementation Patterns

1. Keyframe Extraction (Efficient)

Extract one frame every X seconds or only when significant motion/scene change is detected.

# Extract one frame every 10 seconds
ffmpeg -i input.mp4 -vf "fps=1/10" frame_%04d.jpg

2. Video Token Pooling (Advanced)

Some 2026 models support "Video Input" natively by sampling frames and pooling their tokens into a single context window, allowing for better temporal understanding than individual frame analysis.

3. Visual RAG

Store image descriptions in a Vector DB. When a user asks "Where are the photos of the lake?", the agent searches the descriptions to find relevant timestamps/filenames.

Getting started

Running InternVL2 via Ollama

InternVL2 is often the 2026 recommendation for high-accuracy local vision.

# Pull the InternVL2 model (assuming 8B variant)
ollama pull internvl2:8b

# Query with an image
ollama run internvl2:8b "What is in this image?" --image ./kitchen.jpg

Python: Regional Captioning with Florence-2

Florence-2 is superior for finding where things are in a frame.

from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image

model_id = "microsoft/Florence-2-large"
model = AutoModelForVision2Seq.from_pretrained(model_id, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

image = Image.open("backyard.jpg")
prompt = "<DETAILED_CAPTION>"

inputs = processor(text=prompt, images=image, return_tensors="pt")
generated_ids = model.generate(input_ids=inputs["input_ids"], pixel_values=inputs["pixel_values"])
results = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
print(results)

Sources / references

Contribution Metadata

  • Last reviewed: 2026-05-28
  • Confidence: high