For most of the past two years, enterprises have spent enormous resources on a single question: how do you make AI voice models good enough? Speech recognition that stumbled, mispronunciations, error rates too high for production: the problems were obvious, and so seemed the solution. Fix the model, the thinking went. But what if they were solving the wrong problem?
A customer opens an app and speaks, asking about a recent order or trying to resolve a problem. There are no menus to navigate or buttons to press. The interaction feels natural, almost conversational. When it works, it works because the models are now genuinely good. When it fails, the models are usually not the problem.
In practice, delivering that experience relies on a complex stack of models, pipelines and compute infrastructure working in real time. Voice interfaces are improving not because the challenges have disappeared, but because companies are getting better at identifying where systems break down in real-world use and refining how those systems work together.
Moving into production
Enterprises are now pushing agentic AI beyond pilots and proof-of-concept deployments. Customer service, internal support and operational workflows are beginning to show measurable returns, prompting organizations to extend these systems further. Voice is a natural next step, reducing friction in real-time interaction and enabling more flexible use across contexts.
But the shift into production has exposed how fragile the stack can be.
Speech is converted into text, interpreted, routed through enterprise systems and checked against policies. A response is generated, validated and converted back into speech. Each step may be fast in isolation. String them together, and delays compound quickly.
Where it breaks
And it doesn't break where most enterprises expect. The real failures live in the seams between systems, in the orchestration layers. Latency is the most visible symptom. Even small delays disrupt conversational flow, but the source of delay is often deeper in the system.
Arto Yeritsyan, founder and CEO of AI-powered video, audio and voice generation company Async, pointed to one issue inside the text-to-speech layer. Large language models often output raw text containing variables such as currency amounts or IDs. Traditional systems either mispronounce these or require preprocessing, introducing delay.
“That delay kills conversational flow,” he said. “Traditional approaches can add 200 to 500 milliseconds.”
By handling normalization within the streaming pipeline itself, Async now delivers time to first audio in under 100 milliseconds.

“In production, it’s primarily an infrastructure and integration problem,” said Sayali Patil, who has worked on real-time conversational AI systems. “Your model can be excellent and your deployment still fails because the surrounding systems can’t keep up.”
Patil said systems that appear robust in staging can fail under real call-center load, where even 400-millisecond delays are enough to break conversational flow. The challenge is not just speech recognition or response generation, but coordinating everything around them, from CRM lookups to downstream queries.
Orchestration and control
Production deployments are built as chains of specialized components. On the input side, providers such as Deepgram handle speech-to-text in real-world conditions, including background noise and accents. On the output side, other providers, including ElevenLabs, focus on generating natural, expressive speech across languages and personas.
Nick Holda, vice president of AI Technology Partnerships at IBM, said the underlying voice components at the front and back of the system can be very fast, but problems tend to emerge in the middle of the pipeline.
“Usually, where this breaks down in the enterprise context is when it comes to bridging to those systems,” he said. “It’s possible to connect those pipes and have it work, but very slowly.”
Enterprise voice deployments also introduce constraints that do not exist in consumer applications. Errors are not just frustrating. A business system that exposes the wrong customer record, mishandles health information or ignores a permissions boundary creates a very different level of risk.
To operate safely at scale, those risks have to be managed within the system itself. Governance is essential for safe production use, but it also adds operational complexity. Poorly integrated controls can create new points of delay and failure; well-integrated ones make scale possible.
Where voice AI actually delivers
This is also why voice is not replacing other interfaces so much as joining them. Customers want to interact across multiple channels.
The companies getting traction with enterprise voice AI right now are the ones that have started focusing on the surrounding workflows AI voice model touches.
Dippu Kumar Singh, emerging technologies lead at Fujitsu North America, said the company has used generative AI in contact centers to automate after-call work, including summarizing customer interactions and extracting action items from transcripts. In production environments, he said, post-call processing can take as long as the call itself, averaging 6.3 minutes per interaction. Fujitsu reduced that to less than 3.1 minutes by automating extraction directly from the transcript.
For Singh, the misconception is that enterprise voice AI is mainly about routing customers to the right FAQ or replacing human agents outright. In practice, much of the value lies in reducing administrative burden, grounding outputs tightly in conversation context and making existing workflows more efficient.
Yeritsyan's 100-millisecond optimization, Patil's staging-versus-production gap, Singh's six minutes of post-call paperwork — the specific problems are different. The diagnosis is the same: voice AI fails not where it speaks, but where it hands off.




