Not like textual content, which is comparatively uniform, spoken language is richly-layered—with cultural nuances, colloquialisms and emotion. Startups constructing voice-first AI fashions are actually doubling down on one factor above all else: the depth and variety of datasets.
Why voice is rising because the frontline interface
In India, the place oral custom performs a pivotal position in communication, voice isn’t only a comfort—it’s a necessity. “We’re not an English-first or perhaps a text-first nation. Even after we sort in Hindi, we frequently use the English script as a substitute of Devanagari. That’s precisely why we have to construct voice-first fashions—as a result of oral custom performs such a significant position in our tradition,” stated Abhishek Upperwal, chief government officer (CEO) of Soket AI Labs.
Voice can be proving crucial for customer support and accessibility. “Voice performs an important position in bridging accessibility gaps, notably for customers with disabilities,” stated Mahesh Makhija, chief, expertise consulting, at EY.
“Many shoppers even favor voicing complaints over typing, just because speaking feels extra direct and human. Furthermore, voice is much extra frictionless than navigating cell apps or interfaces—particularly for customers who’re digitally-illiterate, older, or not fluent in English,” stated Makhija, including that “speaking in vernacular languages opens entry to the following half a billion shoppers, which is a serious focus for enterprises.”
Startups like Gnani.ai are already deploying voice techniques throughout banking and monetary providers to streamline buyer assist, help with mortgage functions, and get rid of digital queues. “One of the best ways to achieve folks—no matter literacy ranges or demographics—is thru voice within the native language, so it is essential to seize the tonality of the conversations,” stated Ganesh Gopalan, CEO of Gnani.ai.
The hunt for wealthy, real-world knowledge
As of mid-2025, India’s AI panorama exhibits a transparent tilt towards text-based AI, with over 90 Indian firms lively within the house, in comparison with 57 in voice-based AI. Textual content-based platforms are inclined to deal with doc processing, chat interfaces, and analytics. In distinction, voice-based firms are extra concentrated in customer support, telephony, and regional language entry, in accordance with knowledge from Tracxn.
When it comes to funding, voice-first AI startups have attracted bigger funding rounds at later phases, whereas textual content AI startups present broader distribution, particularly at earlier phases.
For instance, Skit.ai, a voice-first AI agency, raised a complete of $47.6 million throughout 5 funding rounds. Equally, Yellow.ai has cumulatively secured round $102 million, together with a serious $78.15M Collection C spherical in 2021, making it one of many top-funded startups in voice AI, knowledge from Tracxn exhibits.
Nonetheless, knowledge stays the foundational problem for voice fashions. Voice AI techniques want large, numerous datasets that not solely cowl completely different languages, but in addition regional accents, slangs and emotional tonality.
Chaitanya C., co-founder and chief technological officer of Ozonetel Communications, put it merely: “The datasets matter essentially the most—talking as an AI engineer, I can say it is not about anything; it is all in regards to the knowledge.”
IndiaAI Mission has allotted ₹199.55 crore for datasets—nearly 2% of the mission’s complete ₹10,300 crore funds —whereas 44% has gone to compute. “Investments solely in compute are inherently transient—their worth fades as soon as consumed. However, investments in datasets construct sturdy, reusable property that proceed to ship worth over time,” stated Chaitanya.
He additionally emphasised the shortage of wealthy, culturally-relevant knowledge in regional languages like Telugu and Kannada. “The quantity of information simply accessible in English, compared with Telugu and Kannada or Hindi, it’s not even comparable,” he stated. “Someplace it is simply not good, it wouldn’t be pretty much as good as an English story, which is why I wouldn’t need it to inform a Telugu story for my child.”
“Some film comes out, no person’s going to put in writing it in authorities paperwork, however persons are going to speak about it, and that’s misplaced,” he added, mentioning that authorities datasets typically lack cultural nuance and on a regular basis language.
Gopalan of Gnani.ai agreed. “The colloquial language is commonly very completely different from the written type. Language specialists have a fantastic profession path forward of them as a result of they not solely perceive the language technically, but in addition know tips on how to converse naturally and grasp colloquial nuances.”
Startups are actually using artistic strategies to fill these gaps. “First, we acquire knowledge immediately from the sector utilizing a number of strategies—and we’re cautious with how we deal with that knowledge. Second, we use artificial knowledge in some circumstances. Third, we increase that artificial knowledge additional. As well as, we additionally leverage a considerable quantity of open-source knowledge accessible from universities and different sources,” Gopalan stated.
Artificial knowledge is artificially-generated knowledge that mimics real-world knowledge to be used in coaching, testing, or validating fashions.
Upperwal added that Soket AI makes use of an identical method: “We begin by coaching smaller AI fashions with the restricted actual voice knowledge now we have. As soon as these smaller fashions are fairly correct, we use them to generate artificial voice knowledge—primarily creating new, synthetic examples of speech.”
Nonetheless, some intend to consciously avoid artificial knowledge.
Ankush Sabarwal, CEO and founding father of CoRover AI, stated the corporate depends completely on actual knowledge, intentionally avoiding artificial knowledge, “If I’m a shopper and I’m interacting with an AI bot, the AI bot will change into clever by the advantage of it interacting with a human like me.”
The moral labyrinth of voice AI
As firms start to scale their knowledge pipelines, the brand new Digital Private Information Safety (DPDP) Act will form how they acquire and use voice knowledge.
“The DPDP legislation emphasizes three key areas: it mandates clear, particular, and knowledgeable consent earlier than amassing knowledge. Second, it enforces objective limitation—knowledge can solely be used for official, said functions like KYC or employment, not unrelated mannequin coaching. Third, it requires knowledge localization, that means crucial private knowledge should reside on servers in India,” stated Makhija.
He added, “Firms have begun together with consent notices firstly of buyer calls, typically mentioning AI coaching. Nonetheless, the precise strategy of how this knowledge flows into mannequin coaching pipelines remains to be evolving and can change into clearer as DPDP guidelines are absolutely carried out.”
Outsourcing voice knowledge assortment raises purple flags, too. “For a deep-tech firm like ours, voice knowledge is likely one of the strongest types of IP (mental property) now we have, and outsourcing it may compromise its integrity and possession. What if somebody is utilizing copyrighted materials?” stated Gopalan.