Audio Transcription Research: Whisper Variants for Long-Form Audio¶
This document compares different Whisper variants for transcribing personal audiobooks and podcasts in a homelab environment, focusing on performance, accuracy, and hardware requirements for long-form audio.
Comparison Table¶
| Model Variant | Engine | Speed (vs. Large-v3) | Memory (Approx.) | Multilingual | Best For |
|---|---|---|---|---|---|
| Whisper (Large-v3) | Transformers/OpenAI | 1.0x (Baseline) | ~10GB VRAM | Yes | Maximum accuracy (multilingual) |
| Faster-Whisper | CTranslate2 | 2x - 4x | ~5GB VRAM | Yes | Standard homelab CPU/GPU use |
| Distil-Whisper | Transformers | ~6x | ~5GB VRAM | No (English) | Speed & hallucination resistance |
| Faster-Distil-Whisper | CTranslate2 | ~8x - 10x | ~3GB VRAM | No (English) | Best performance on limited hardware |
| Whisper Turbo | Transformers | ~6x | ~6GB VRAM | Yes | Fast multilingual transcription |
Key Findings¶
1. Distil-Whisper (distil-large-v3)¶
- Performance: Up to 6x faster than
large-v3. - Accuracy: Within 1% Word Error Rate (WER) of the original model.
- Long-Form: Specifically optimized for long-form audio to reduce hallucinations (repeating phrases) often seen in vanilla Whisper during silence or background noise.
- Limitation: Currently only supports English.
2. Faster-Whisper¶
- Implementation: Uses CTranslate2, a fast inference engine for Transformer models.
- Efficiency: Significantly faster and more memory-efficient than the Hugging Face
transformersimplementation. - Flexibility: Can load
distil-whispermodels, providing the best of both worlds (distilled architecture + CTranslate2 speed).
3. Hardware Requirements¶
- GPU: NVIDIA GPU with at least 8GB VRAM is recommended for
largeordistil-largemodels in float16. - CPU:
faster-whisperis highly optimized for CPU (using INT8 quantization), making it viable for NAS-based transcription without a dedicated GPU.
Recommendations for Homelab¶
- Primary Choice (English): Use
faster-whisperwith thedistil-large-v3model. This provides the best balance of speed, low resource usage, and accuracy for English podcasts/audiobooks. - Multilingual Choice: Use
faster-whisperwithlarge-v3-turboor standardlarge-v3if accuracy is paramount for non-English content. -
Pipeline Strategy: Use Voice Activity Detection (VAD) to skip silence in long-form audio, which further improves speed and prevents hallucinations.
faster-whisperhas integrated Silero VAD support. -
Last reviewed: 2025-05-15
- Confidence: high