Fish Audio (Fish Speech)¶
What it is¶
Fish Audio (Fish Speech) is a state-of-the-art multilingual text-to-speech (TTS) platform powered by a Dual-Autoregressive (Dual-AR) architecture. It is designed for high-fidelity, expressive voice synthesis and rapid voice cloning with minimal latency.
What problem it solves¶
It provides an open-source, high-performance alternative to proprietary TTS services like ElevenLabs. By using a transformer-based architecture isomorphic to LLMs, it enables fine-grained emotional control and achieves industry-leading Real-Time Factor (RTF) using inference acceleration frameworks.
Where it fits in the stack¶
Category: AI Assistants & Knowledge / Audio Generation
Typical use cases¶
- Expressive Narrators: Generating audiobooks or podcast content with precise emotional cues.
- Conversational AI: Powering real-time agents that sound natural and responsive.
- Voice Cloning: Creating high-fidelity digital twins from as little as 10-30 seconds of reference audio.
- Multilingual Content: Synthesizing speech in over 80 languages without phoneme-level preprocessing.
Strengths¶
- Fine-Grained Emotion Control: Supports inline natural language tags (e.g.,
[whisper],[excited],[laughing]) to control prosody and emotion at the sub-word level. - Innovative Dual-AR Architecture: Combines a 4B parameter "Slow AR" model for semantic prediction with a 400M parameter "Fast AR" model for acoustic detail reconstruction.
- Extreme Performance: Powered by SGLang, achieving an RTF of ~0.195 and Time-to-First-Audio (TTFA) of ~100ms on high-end GPUs.
- RL Alignment: Uses Group Relative Policy Optimization (GRPO) to align generated speech with human acoustic preferences.
Limitations¶
- Hardware Intensity: The flagship 4B model requires significant VRAM (ideally NVIDIA H200/A100 or RTX 4090) for optimal throughput.
- Model Size: While optimized, the combined Dual-AR system is larger than lightweight models like Kokoro TTS.
Getting started¶
Installation¶
# Clone the repository
git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech
# Install dependencies using uv
uv sync
CLI Inference¶
# Generate speech from text using a reference audio for cloning
python -m tools.llama.generate \
--text "Hello, this is a test of Fish Audio S2 Pro." \
--prompt-text "Reference audio transcript" \
--prompt-tokens "path/to/reference.wav" \
--output "output.wav"
WebUI¶
# Launch the Gradio-based interface
python -m tools.webui
Technical details¶
- Architecture: Dual-Autoregressive Transformer + RVQ Audio Codec (10 codebooks, ~21 Hz).
- Training Data: Trained on over 10 million hours of audio data across 80+ languages.
- Optimization: Supports Continuous Batching, Paged KV Cache, and RadixAttention-based Prefix Caching via SGLang.
- Alignment: Employs Reward Models to score semantic accuracy, timbre similarity, and acoustic preference.
Related tools / concepts¶
- KokoClone (Lightweight local alternative)
- Whisper (Audio transcription)
- SGLang (Inference acceleration)
- ElevenLabs (Proprietary comparison)
- ChatTTS (Conversational TTS models)
Sources / references¶
Contribution Metadata¶
- Last reviewed: 2026-05-17
- Confidence: high