Skip to content

Fish Audio (Fish Speech)

What it is

Fish Audio (Fish Speech) is a state-of-the-art multilingual text-to-speech (TTS) platform powered by a Dual-Autoregressive (Dual-AR) architecture. It is designed for high-fidelity, expressive voice synthesis and rapid voice cloning with minimal latency.

What problem it solves

It provides an open-source, high-performance alternative to proprietary TTS services like ElevenLabs. By using a transformer-based architecture isomorphic to LLMs, it enables fine-grained emotional control and achieves industry-leading Real-Time Factor (RTF) using inference acceleration frameworks.

Where it fits in the stack

Category: AI Assistants & Knowledge / Audio Generation

Typical use cases

  • Expressive Narrators: Generating audiobooks or podcast content with precise emotional cues.
  • Conversational AI: Powering real-time agents that sound natural and responsive.
  • Voice Cloning: Creating high-fidelity digital twins from as little as 10-30 seconds of reference audio.
  • Multilingual Content: Synthesizing speech in over 80 languages without phoneme-level preprocessing.

Strengths

  • Fine-Grained Emotion Control: Supports inline natural language tags (e.g., [whisper], [excited], [laughing]) to control prosody and emotion at the sub-word level.
  • Innovative Dual-AR Architecture: Combines a 4B parameter "Slow AR" model for semantic prediction with a 400M parameter "Fast AR" model for acoustic detail reconstruction.
  • Extreme Performance: Powered by SGLang, achieving an RTF of ~0.195 and Time-to-First-Audio (TTFA) of ~100ms on high-end GPUs.
  • RL Alignment: Uses Group Relative Policy Optimization (GRPO) to align generated speech with human acoustic preferences.

Limitations

  • Hardware Intensity: The flagship 4B model requires significant VRAM (ideally NVIDIA H200/A100 or RTX 4090) for optimal throughput.
  • Model Size: While optimized, the combined Dual-AR system is larger than lightweight models like Kokoro TTS.

Getting started

Installation

# Clone the repository
git clone https://github.com/fishaudio/fish-speech.git
cd fish-speech

# Install dependencies using uv
uv sync

CLI Inference

# Generate speech from text using a reference audio for cloning
python -m tools.llama.generate \
    --text "Hello, this is a test of Fish Audio S2 Pro." \
    --prompt-text "Reference audio transcript" \
    --prompt-tokens "path/to/reference.wav" \
    --output "output.wav"

WebUI

# Launch the Gradio-based interface
python -m tools.webui

Technical details

  • Architecture: Dual-Autoregressive Transformer + RVQ Audio Codec (10 codebooks, ~21 Hz).
  • Training Data: Trained on over 10 million hours of audio data across 80+ languages.
  • Optimization: Supports Continuous Batching, Paged KV Cache, and RadixAttention-based Prefix Caching via SGLang.
  • Alignment: Employs Reward Models to score semantic accuracy, timbre similarity, and acoustic preference.

Sources / references

Contribution Metadata

  • Last reviewed: 2026-05-17
  • Confidence: high