Inworld AI has launched Inworld TTS-1.5, an improve to its TTS-1 household that targets realtime voice brokers with strict constraints on latency, high quality, and price. TTS-1.5 is described because the quantity high ranked textual content to speech system on Artificial Analysis and is designed to be extra expressive and extra secure than prior generations whereas remaining appropriate for giant scale client deployments.
Realtime latency for interactive brokers
TTS-1.5 focuses on P90 time to first audio latency, which is a important metric for consumer perceived responsiveness. For TTS-1.5 Max, P90 time to first audio is under 250 ms. For TTS-1.5 Mini, P90 time to first audio is under 130 ms. These values are about 4 instances sooner than the prior TTS era in keeping with Inworld.
The TTS-1.5 stack helps streaming over WebSocket so synthesis and playback can begin as quickly as the primary audio chunk is generated. In observe this retains finish to finish interplay latency in the identical vary as typical realtime language mannequin responses when fashions run on trendy GPUs, which is necessary when TTS is a part of a full agent pipeline.
Inworld recommends TTS-1.5 Max for many purposes as a result of it balances latency close to 200 ms with larger stability and high quality. TTS-1.5 Mini is positioned for latency delicate workloads similar to actual time gaming or extremely responsive voice brokers the place each millisecond is necessary.
Expression, stability and benchmark place
TTS-1.5 builds on TTS-1 and it delivers about 30 % extra expressive vary and about 40 % higher stability than the sooner fashions.
Right here expression refers to options similar to prosody, emphasis, and emotional variation. Stability is measured by metrics similar to phrase error fee and output consistency throughout lengthy sequences and diverse prompts. The discount in phrase error fee reduces points like truncated sentences, unintended phrase substitutions, or artifacts, which is necessary when TTS output is pushed immediately from generated language mannequin textual content.
Pricing and price profile at client scale
TTS-1.5 is priced with two important configurations. Inworld TTS-1.5 Mini prices 5 {dollars} per 1 million characters, which is about 0.005 {dollars} per minute of speech. TTS-1.5 Max prices 10 {dollars} per 1 million characters, which is about 0.01 {dollars} per minute.
This price profile makes it possible to run TTS constantly in excessive utilization merchandise similar to voice native companions, schooling platforms, or buyer assist strains with out TTS turning into the dominant variable price.
Multilingual assist, voice cloning and deployment choices
Inworld TTS-1.5 helps 15 languages. The record consists of English, Spanish, French, Korean, Dutch, Chinese language, German, Italian, Japanese, Polish, Portuguese, Russian, Hindi, Arabic, and Hebrew. This enables a single TTS pipeline to cowl a large set of markets with out separate fashions per area.
The system supplies immediate voice cloning {and professional} voice cloning. Immediate voice cloning can create a customized voice from about 15 seconds of audio and is uncovered immediately within the Inworld portal and thru API. Skilled voice cloning makes use of at the least half-hour of fresh audio, with 20 minutes or extra really helpful for greatest outcomes, and targets branded voices and fewer frequent accents.
For deployment, TTS-1.5 is out there as a cloud API and in addition as an on prem resolution, the place the total mannequin runs contained in the buyer infrastructure for knowledge sovereignty and compliance. The identical high quality profile is maintained throughout each deployment modes, and the fashions combine with associate platforms similar to LiveKit, Pipecat, and Vapi for finish to finish voice agent stacks.
Key Takeaways
- Inworld TTS 1.5 delivers realtime efficiency, with P90 time to first audio beneath 250 ms for the Max mannequin and beneath 130 ms for the Mini mannequin, about 4 instances sooner than the prior era.
- The mannequin will increase expressiveness by about 30 % and improves stability with about 40 % decrease phrase error fee.
- Pricing is optimized for client scale, TTS 1.5 Mini prices about 5 {dollars} per 1 million characters and TTS 1.5 Max prices about 10 {dollars} per 1 million characters, which is considerably cheaper per minute than many competing methods.
- TTS 1.5 helps 15 languages and provides immediate {and professional} voice cloning, enabling customized and branded voices from brief reference audio or longer recorded datasets.
- The system is out there as a cloud API and as an on prem deployment, and integrates with present voice agent stacks, which makes it appropriate for manufacturing realtime brokers that require express ensures on latency, high quality, and knowledge management.
Take a look at the Technical details. Additionally, be happy to comply with us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

