Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    List Of The Best Foods That Are Rich In Iron

    March 11, 2026

    California studio helps autistic adults channel creativity into Hollywood careers

    March 11, 2026

    Pundit Reveals Why Bitcoin Is Headed For Another Crash To $42,000

    March 11, 2026
    Facebook X (Twitter) Instagram
    Wednesday, March 11
    Trending
    • List Of The Best Foods That Are Rich In Iron
    • California studio helps autistic adults channel creativity into Hollywood careers
    • Pundit Reveals Why Bitcoin Is Headed For Another Crash To $42,000
    • Mind Body Spirit Tree
    • All Slay The Spire 2 Bosses
    • Tameer i Wattan Public School & College Abbottabad Jobs 2026 2026 Job Advertisement Pakistan
    • Magnetic Messaging: Sizzling provide for a sizzling market- Excessive Conversions
    • The Curator: 10 best face sunscreens to try in 2026 – National
    • What are “sleeper cells” and why are intelligence agencies concerned?
    • ICC denies unfair treatment of squads stranded in India after T20 World Cup amid Middle East crisis
    Facebook X (Twitter) Instagram Pinterest Vimeo
    The News92The News92
    • Home
    • World
    • National
    • Sports
    • Crypto
    • Travel
    • Lifestyle
    • Jobs
    • Insurance
    • Gaming
    • AI & Tech
    • Health & Fitness
    The News92The News92
    Home - AI & Tech - Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion
    AI & Tech

    Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion

    Naveed AhmadBy Naveed AhmadMarch 11, 2026No Comments5 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    The landscape of Text-to-Speech (TTS) is moving away from modular pipelines toward integrated Large Audio Models (LAMs). Fish Audio’s release of S2-Pro, the flagship model within the Fish Speech ecosystem, represents a shift toward open architectures capable of high-fidelity, multi-speaker synthesis with sub-150ms latency. The release provides a framework for zero-shot voice cloning and granular emotional control using a Dual-Auto-Regressive (AR) approach.

    Architecture: The Dual-AR Framework and RVQ

    The fundamental technical distinction in Fish Audio S2-Pro is its hierarchical Dual-AR architecture. Traditional TTS models often struggle with the trade-off between sequence length and acoustic detail. S2-Pro addresses this by bifurcating the generation process into two specialized stages: a ‘Slow AR’ model and a ‘Fast AR’ model.

    1. The Slow AR Model (4B Parameters): This component operates on the time-axis. It is responsible for processing linguistic input and generating semantic tokens. By utilizing a larger parameter count (approximately 4 billion), the Slow AR model captures long-range dependencies, prosody, and the structural nuances of speech.
    2. The Fast AR Model (400M Parameters): This component processes the acoustic dimension. It predicts the residual codebooks for each semantic token. This smaller, faster model ensures that the high-frequency details of the audio—timbre, breathiness, and texture—are generated with high efficiency.

    This system relies on Residual Vector Quantization (RVQ). In this setup, raw audio is compressed into discrete tokens across multiple layers (codebooks). The first layer captures the primary acoustic features, while subsequent layers capture the ‘residuals’ or the remaining errors from the previous layer. This allows the model to reconstruct high-fidelity 44.1kHz audio while maintaining a manageable token count for the Transformer architecture.

    Emotional Control via In-Context Learning and Inline Tags

    Fish Audio S2-Pro achieves what the developers describe as ‘absurdly controllable emotion’ through two primary mechanisms: zero-shot in-context learning and natural language inline control.

    In-Context Learning (ICL):

    Unlike older generations of TTS that required explicit fine-tuning to mimic a specific voice, S2-Pro utilizes the Transformer’s ability to perform in-context learning. By providing a reference audio clip—ideally between 10 and 30 seconds—the model extracts the speaker’s identity and emotional state. The model treats this reference as a prefix in its context window, allowing it to continue the “sequence” in the same voice and style.

    Inline Control Tags:

    The model supports dynamic emotional transitions within a single generation pass. Because the model was trained on data containing descriptive linguistic markers, developers can insert natural language tags directly into the text prompt. For example:

    [whisper] I have a secret [laugh] that I cannot tell you.

    The model interprets these tags as instructions to modify the acoustic tokens in real-time, adjusting pitch, intensity, and rhythm without requiring a separate emotional embedding or external control vector.

    Performance Benchmarks and SGLang Integration

    Integrating TTS into real-time applications, the primary constraint is ‘Time to First Audio’ (TTFA). Fish Audio S2-Pro is optimized for a sub-150ms latency, with benchmarks on NVIDIA H200 hardware reaching approximately 100ms.

    Several technical optimizations contribute to this performance:

    • SGLang and RadixAttention: S2-Pro is designed to work with SGLang, a high-performance serving framework. It utilizes RadixAttention, which allows for efficient Key-Value (KV) cache management. In a production environment where the same “master” voice prompt (reference clip) is used repeatedly, RadixAttention caches the prefix’s KV states. This eliminates the need to re-compute the reference audio for every request, significantly reducing the prefill time.
    • Multi-Speaker Single-Pass Generation: The architecture allows for multiple speaker identities to be present within the same context window. This permits the generation of complex dialogues or multi-character narrations in a single inference call, avoiding the latency overhead of switching models or reloading weights for different speakers.

    Technical Implementation and Data Scaling

    The Fish Speech repository provides a Python-based implementation utilizing PyTorch. The model was trained on a diverse dataset comprising over 300,000 hours of multi-lingual audio. This scale is what enables the model’s robust performance across different languages and its ability to handle ‘non-verbal’ vocalizations like sighs or hesitations.

    The training pipeline involves:

    1. VQ-GAN Training: Training the quantizer to map audio into a discrete latent space.
    2. LLM Training: Training the Dual-AR transformers to predict those latent tokens based on text and acoustic prefixes.

    The VQ-GAN used in S2-Pro is specifically tuned to minimize artifacts during the decoding process, ensuring that even at high compression ratios, the reconstructed audio remains ‘transparent’ (indistinguishable from the source to the human ear).

    Key Takeaways

    • Dual-AR Architecture (Slow/Fast): Unlike single-stage models, S2-Pro splits tasks between a 4B parameter ‘Slow AR’ model (for linguistic and prosodic structure) and a 400M parameter ‘Fast AR’ model (for acoustic refinement), optimizing both detail and speed.
    • Sub-150ms Latency: Engineered for real-time conversational AI, the model achieves a Time-to-First-Audio (TTFA) of ~100ms on high-end hardware, making it suitable for live agents and interactive applications.
    • Hierarchical RVQ Encoding: By using Residual Vector Quantization, the system compresses 44.1kHz audio into discrete tokens across multiple layers. This allows the model to reconstruct complex vocal textures—including breaths and sighs—without the computational bloat of raw waveforms.
    • Zero-Shot In-Context Learning: Developers can clone a voice and its emotional state by providing a 10–30 second reference clip. The model treats this as a prefix, adopting the speaker’s timbre and prosody without requiring additional fine-tuning.
    • RadixAttention & SGLang Integration: Optimized for production, S2-Pro leverages RadixAttention to cache KV states of voice prompts. This allows for nearly instant generation when using the same speaker repeatedly, drastically reducing prefill overhead.

    Check out Model Card and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.




    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleCenter East battle might set off inflation, export strain in Pakistan: FPCCI official
    Next Article David Warner to captain Karachi Kings for second consecutive term
    Naveed Ahmad
    • Website
    • Tumblr

    Related Posts

    AI & Tech

    Medical health insurance startup Alan reaches €5B valuation

    March 11, 2026
    AI & Tech

    How to Build a Self-Designing Meta-Agent That Automatically Constructs, Instantiates, and Refines Task-Specific AI Agents

    March 11, 2026
    AI & Tech

    Google AI Introduces Gemini Embedding 2: A Multimodal Embedding Model that Lets Your Bring Text, Images, Video, Audio, and Docs into the Embedding Space

    March 11, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Demo
    Top Posts

    Plans for formidable six-nation movie enterprise introduced

    March 5, 20261 Views

    List Of The Best Foods That Are Rich In Iron

    March 11, 20260 Views

    California studio helps autistic adults channel creativity into Hollywood careers

    March 11, 20260 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Demo
    Most Popular

    Plans for formidable six-nation movie enterprise introduced

    March 5, 20261 Views

    List Of The Best Foods That Are Rich In Iron

    March 11, 20260 Views

    California studio helps autistic adults channel creativity into Hollywood careers

    March 11, 20260 Views
    Our Picks

    List Of The Best Foods That Are Rich In Iron

    March 11, 2026

    California studio helps autistic adults channel creativity into Hollywood careers

    March 11, 2026

    Pundit Reveals Why Bitcoin Is Headed For Another Crash To $42,000

    March 11, 2026

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • Advertise
    • Disclaimer
    © 2026 TheNews92.com. All Rights Reserved. Unauthorized reproduction or redistribution of content is strictly prohibited.

    Type above and press Enter to search. Press Esc to cancel.