A customer opens an app and speaks, asking about a recent order or trying to resolve a problem. There are no menus to navigate or buttons to press. The interaction feels natural, almost conversational.
In practice, delivering that experience relies on a complex stack of models, pipelines and compute infrastructure working in real time. Voice interfaces are improving not because the challenges have disappeared, but because companies are getting better at identifying where systems break down in real-world use and refining how those systems work together.
In the process, large technology providers and specialized startups are being pulled together around the same problem: how to make voice reliable enough for enterprise use.
Moving into production
Enterprises are now pushing agentic AI beyond pilots and proof-of-concept deployments. Customer service, internal support and operational workflows are beginning to show measurable returns, prompting organizations to extend these systems further. Voice is a natural next step, reducing friction in real-time interaction and enabling more flexible use across contexts.
For IBM, the challenge is not just improving voice quality but making voice interactions work reliably inside complex enterprise systems.
“Enterprises are seeing initial ROI, and as they do that, they’re looking to extend it and make the experience even lower friction,” said Nick Holda, vice president of AI Technology Partnerships at IBM.
A system, not a model
That shift is driving a more modular approach. Rather than relying on a single system, production deployments are built as chains of specialized components.
Speech is converted into text, interpreted, routed through enterprise systems and checked against policies. A response is generated, validated and converted back into speech. Each step may be fast in isolation. String them together, and delays compound quickly.
IBM’s watsonx Orchestrate platform sits at the center of that process, coordinating requests across models, enterprise applications and policy engines, while integrating specialist providers at the edges of the pipeline.

On the input side, partners such as Deepgram handle speech-to-text in real-world conditions, including background noise and accents. On the output side, other providers, including ElevenLabs, focus on generating natural, expressive speech across languages and personas.
Where it breaks
Holda said the underlying voice components at the front and back of the system can be very fast, but problems tend to emerge in the middle of the pipeline.
“Usually, where this breaks down in the enterprise context is when it comes to bridging to those systems,” he said. “It’s possible to connect those pipes and have it work, but very slowly.”
Latency is the most visible symptom. Even small delays disrupt conversational flow, but the source of delay is often deeper in the system.
Arto Yeritsyan, founder and CEO of AI-powered video, audio and voice generation company Async, points to one issue inside the text-to-speech layer. Large language models often output raw text containing variables such as currency amounts or IDs. Traditional systems either mispronounce these or require preprocessing, introducing delay.
“That delay kills conversational flow,” he said. “Traditional approaches can add 200 to 500 milliseconds.”
Async addresses this by handling normalization within the streaming pipeline itself, reducing time to first audio to under 100 milliseconds. It is a small optimization, but in real-time systems, delays at one point in the stack compound delays elsewhere.
“In production, it’s primarily an infrastructure and integration problem,” said Sayali Patil, who has worked on real-time conversational AI systems. “Your model can be excellent and your deployment still fails because the surrounding systems can’t keep up.”
Patil said systems that appear robust in staging can fail under real call-center load, where even 400-millisecond delays are enough to break conversational flow. The challenge is not just speech recognition or response generation, but coordinating everything around them, from CRM lookups to downstream queries.
That is where orchestration layers become critical.
Orchestration and control
In IBM’s architecture, watsonx Orchestrate acts as the coordinating layer between voice interfaces and enterprise systems. It interprets requests, determines which systems need to be queried, retrieves data, checks permissions and applies policy controls before a response is returned.
A single request may require an AI system to access multiple internal applications, verify whether the user is authorized to receive the information requested, screen for sensitive or regulated data, and then generate a response in language that is accurate, compliant and natural enough to speak aloud. All of that must happen fast enough to sustain the illusion of a seamless conversation.
“If the response takes more than 10 seconds, that experience is burned,” Holda said. “We pay a lot of attention to what we call end-to-end trip time.”
Enterprise voice deployments also introduce constraints that do not exist in consumer applications. Errors are not just frustrating. A business system that exposes the wrong customer record, mishandles health information or ignores a permissions boundary creates a very different level of risk.
To operate safely at scale, those risks have to be managed within the system itself.
Holda compares governance to brakes on a car. Without them, organizations cannot move faster with confidence. Those controls are necessary to make voice usable in production environments, but they also add complexity. If poorly integrated, they become another source of latency and failure. If well integrated, they create the conditions for scale.
Voice as part of the system
This is also why voice is not replacing other interfaces so much as joining them.
“It’s an ‘and’, not an ‘or’,” Holda said. “Customers want to interact in a multichannel fashion.”
Users move between voice, chat and text depending on context, often beginning an interaction in one channel and continuing it in another. Underneath, the same orchestration layer must preserve context, apply the same rules and deliver a consistent result. Voice is one part of a broader multimodal system, not a standalone feature.
That wider system view is also changing how enterprises think about business value. The goal is not always to automate the full interaction, but to reduce friction inside it.
Dippu Kumar Singh, emerging technologies lead at Fujitsu North America, said the company has used generative AI in contact centers to automate after-call work, including summarizing customer interactions and extracting action items from transcripts. In production environments, he said, post-call processing can take as long as the call itself, averaging 6.3 minutes per interaction. Fujitsu reduced that to less than 3.1 minutes by automating extraction directly from the transcript.
For Singh, the misconception is that enterprise voice AI is mainly about routing customers to the right FAQ or replacing human agents outright. In practice, much of the value lies in reducing administrative burden, grounding outputs tightly in conversation context and making existing workflows more efficient.
That same principle applies across voice systems more broadly. The hard part is not simply making AI speak. It is making the surrounding systems fast, accurate and governable enough that speaking becomes useful.
As enterprise agentic AI moves into production, voice exposes the entire stack. Users expect immediacy, clarity and continuity. Any weakness in integration, retrieval, policy enforcement or response generation surfaces immediately in the conversation.
Making AI sound human is now only one layer of the problem. Making it work across real systems, in real time, is the larger task.





