Inside a sleek office, a brunette woman in a burgundy turtleneck speaks to the camera. “Hey there! It’s me, Riley,” she begins. “I know — I sound different, brighter, bolder, more alive. That's because I’ve just been reborn, powered by the next-generation audio and video rendering engine.”
Riley is an AI avatar developed by Colossyan, a London-based startup. She’s among a growing flock of digital humans built for customer service, training and marketing.
Colossyan’s founder and CEO, Dominik Mate Kovacs, began working on the technology in 2018, when the divide between synthetic and real humans appeared insurmountable. “It wasn’t a gap; it was an abyss,” Kovacs told The Infinite Loop.

Colossyan: Real people making AI avatars. Credit: Colossyan
That abyss now has a bridge: reverse-engineering the human presence, organ by organ.
The skeletal foundation
In the early days of AI video, a dependence on motion capture left avatars looking stiff or unnatural. Synthesia, a platform used by over 80% of the Fortune 100, avoids these pitfalls by focusing on structural integrity.
The company’s latest engine, Express-2, moves beyond its predecessors by calculating how a human frame actually shifts. Instead of just animating a face, the system predicts the avatar’s skeleton and movement. “This includes the face and mouth areas, which we’re able to predict extremely accurately with this approach,” said Sundar Solai, senior product manager at Synthesia. “And then we layer the avatar’s appearance on top of this skeleton.”
To match body movements to sound, a frontier foundation model generates “candidate” motions for the audio input. A CLIP-like model then evaluates the motions and selects the best one. Finally, a Diffusion Transformer (DiT) translates the motions into the avatar. The result: an avatar whose gestures don’t just accompany speech — they anticipate it.
Sundar Solai, senior product manager at Synthesia. Credit: Synthesia
But a skeleton is merely a frame. A sterner test of avatars lies in the face.
The trusted face
If the skeleton provides the movement, the face provides the connection. A single misplaced micro-expression can shatter not only the illusion, but also the user’s trust.
D-ID, an Israeli pioneer in the space, discovered this while developing a de-identification product. Its software made people unrecognizable to facial recognition algorithms while remaining unchanged to the human eye. “As the underlying tech matured, we realized the same core capabilities of privacy protection could bring faces to life for communication, learning, support, and more,” said Tal Ron-Pereg, D-ID’s director of product.
The pivot wasn’t easy. The human face is driven by 43 distinct muscles capable of producing over 10,000 unique configurations — a combinatorial problem that defeated early attempts at digital realism. D-ID relies on deep learning to bridge this gap by prioritizing the flow between frames over rigid 3D models. The system first learns how facial muscles behave when expressing emotion, then converts text into synchronized gestures that look human.
The resulting lip movements occur within 30 milliseconds of the audio — tighter than broadcast-grade standards.
The avatars are also designed to handle real-time dialogue. By combining LLMs with grounding mechanisms (tools that keep the AI anchored to facts) and memory, they can maintain context across conversational turns, detect when users speak, and manage interruptions with minimal latency. “The visual interactions are synchronized with the speech, mirroring real human dynamics,” Ron-Pereg said.
The vocal soul
A convincing face still isn’t enough. The moment an avatar opens its mouth, a second illusion has to hold — one that took years longer to perfect.
For years, flat text-to-speech engines failed to smoothly integrate sounds with visuals, leaving a “sync gap” that human ears could instantly detect. In response, developers have turned to AI models that treat voice and video as a single, unified performance.
Synthesia’s EXPRESS-Voice engine is a leading example of this shift. The system preserves the speaker’s identity through a two-stage Transformer architecture. An 800-million-parameter model first generates the coarse phonetic structure of the speech, which a second model then refines by adding detailed audio.
Both models operate directly on text tokens and are conditioned on reference audio, without depending on an explicit speaker embedding. This means the system doesn't require a stored voice sample for every speaker. To ensure their stability, Synthesia trained the models on a curated dataset of high-quality human recordings.
The approach mitigates a familiar issue: US hegemony.
“A common industry problem with AI voices is that they’d often sound too American, even if the original speaker doesn’t have an American accent,” Solai said. “AI models are ultimately a representation of their training data, and so we focused on creating a more diverse dataset of English speakers encompassing accents and speech patterns of all varieties.”
Lower-resource languages introduce a second layer of complexity. To navigate these linguistic hurdles, Colossyan works with multiple voice providers around the world, handpicking the best ones for the location, language, or use case. The strategy has yielded “huge success in Southeast Asia,” Kovacs said, but “penetrating the MENA region with all the Arabic dialects is really hard.”

Dominik Mate Kovacs, founder and CEO of Colossyan. Credit: Colossyan
The scar tissue
The illusion is fragile at the edges. A camera placed too far back, a subject with heavy facial hair — small deviations from the expected that, Solai said, “can present more challenges to the model.”
The hardest problem isn’t building a convincing avatar. It’s keeping it convincing at speed. Every frame demands fresh calculations: muscle shifts in micro-expressions, light diffusing through synthetic skin, voice-to-mouth sync that must hold within milliseconds. “The remaining bottlenecks are often at the convergence of realism and speed, which is required to support real-time use cases,” said Ron-Pereg. No single failure is large. But the human brain, evolved over millennia to read faces, needs only one.
Nonetheless, the abyss Kovacs entered in 2018 has shrunk dramatically. By treating the human presence as a series of engineering problems, AI avatar developers are systematically bridging the uncanny valley.
Thumbnail credit: Colossyan





