Back to overview
Blog

Stop Building Voice Wrappers: The Architecture Behind Reliable Voice Agents

Read on
Hakim Amri

Hakim Amri

ML Engineer
Read on
Updated
26 Jun 2026
Published
26 Jun 2026
Reading time
20 min
Tags
Stop Building Voice Wrappers: The Architecture Behind Reliable Voice Agents
Share this on:
Stop Building Voice Wrappers: The Architecture Behind Reliable Voice Agents
25:15

Executive summary

Voice AI only creates real customer value when it feels like a live conversation, not a chatbot with speech added. Customers expect instant acknowledgement, natural turn-taking, and continuity while complex reasoning happens in the background. By separating real-time interaction from deeper reasoning through a Fast Brain / Slow Brain architecture, enterprises can turn existing AI agents into responsive, reliable voice experiences that are ready for production.
The key takeaway: do not block interaction on reasoning. Voice agents need a conversational control layer, not just STT and TTS around a chatbot.

A Voice Agent Is Not a Chatbot That Talks

Modern AI can reason, retrieve data, write code, and coordinate multi-step workflows. That is why we have already helped many teams investing in chat-based agents for support, sales, onboarding, internal operations, and domain-specific workflows.

Next to that, there are clear market trends around Voice AI. With 1.23B$ raised by Voice AI startups in January 2026 alone. Voice AI is making it into enterprise architectures, the goal being to converse with a chatbot.

The next step therefore looks obvious: take the chatbot that already works, add speech-to-text on the way in, add text-to-speech on the way out, and now you have a voice agent.

Except we usually do not.

What we get is a talking API.

The difference matters. A talking API can process spoken requests. A conversational voice agent can participate in a live interaction. It keeps the experience alive by having an interaction latency as low as possible while real reasoning happens somewhere else.

This is the architectural gap most Voice AI projects hit after the first demo. The model may be strong. The business logic may be correct. The TTS voice may sound natural. But the experience still feels wrong.

Throughout several Voice AI projects at ML6, we observed that this is explained by a simple reason: voice is not just another input/output channel. Voice changes the control flow of the product: the product must remain responsive while processing a request.

voice AI 1

The trap: confusing voice with conversation

A chat interface is usually built around discrete turns:

  1. The user sends a message.
  2. The backend reasons, retrieves, calls tools, or queries databases.
  3. The model returns an answer.
  4. The UI displays the result.

That pattern works well for text. In text, waiting is normal. The user can see that something is loading with visual cues and automatic “thinking, loading, or processing” messages. They can scroll, think, or edit their next message. The interface does not have to fill every second.

Voice is different.

In a voice conversation, silence is information. A 300 millisecond pause can feel natural. A two second pause can feel hesitant. Five seconds of silence can make the user wonder whether the system crashed, whether the microphone stopped working, or whether they should repeat themselves.

Human conversation is also not made of clean API calls. People pause mid-thought. They restart sentences. They say "actually..." and change direction. They interrupt. They ask "are you still there?" while the other person is thinking. They use backchannels like "uh-huh" or "yeah" that are not new requests, but signals that the conversation is still alive.

When you put STT and TTS around a synchronous chat system, the architecture still assumes this:

request → reasoning → answer

 

But the user expects this:

continuous listening → bot presence signals → reasoning in the background → interruption handling → adaptive response

 

That mismatch is why many voice agents feel brittle even when each individual model in the stack is good.

Why a simple STT/TTS layer breaks down

The most common failure mode in Voice AI is not bad speech recognition or bad speech synthesis. Those can be problems, but they are not the core problem.

The core problem is that a request-based system is being asked to behave like a real-time conversational system.

1. The biological latency crisis

A normal chat agent can take time. It may need to retrieve documents, run a search, call an internal API, inspect a user account, query a database, or synthesize a multi-step answer. Depending on the domain, that may take 1.5 seconds, 5 seconds, or longer.

For text, that delay can be acceptable. The interface gives the user visual evidence that the system is working: a loading spinner, a typing indicator, or simply the persistent chat window. The user can also look away, reread the previous message, or start thinking about their next input without feeling that the interaction has collapsed.

For voice, the same delay is experienced as silence. There is no visible surface carrying the interaction forward. The user has to infer solely from the absence of sound whether the agent is thinking, listening, frozen, or waiting for them to speak again.

Humans are extremely sensitive to turn-taking latency. We expect the other side to react quickly, even if the final answer takes longer. That reaction does not need to be the full answer. It can be a small bot presence signal: "Got it," "Let me check that," or "I understand what you are asking."

But it must happen fast.

If the system waits until the full reasoning chain is complete before saying anything, the conversation dies during the most important part of the interaction: the moment right after the user stops speaking.

2. The busy brain dead air

Now imagine a banking assistant.

The user asks:

"Can you check when my phone subscription was paid last month?"

This is not a simple text generation task. The assistant may need to authenticate the user, identify the right subscription, call a transaction API, inspect payments, and generate a safe answer.

In a synchronous chat architecture, the same component that manages the conversation is also busy doing the work. While it is querying the backend, it cannot gracefully manage the human. It cannot acknowledge uncertainty, handle a user rectifying its initial request, or react when the user interrupts.

The brain is busy, so the mouth goes silent.

Adding canned filler phrases helps only superficially. A one-time "let me check" is better than silence, but it is not enough. The user may speak again. They may clarify. They may change the date range. They may ask whether the assistant is still there. A real voice agent has to keep listening and managing the interaction while the slow work continues.

3. Turn-taking is not a silence threshold

Many early voice systems use a simple rule: when the user stops speaking for N milliseconds, treat the turn as complete.

That works in demos. It fails in real conversations.

A silence threshold cannot reliably distinguish between:

  • a user finishing a request;
  • a user pausing mid-sentence;
  • a user thinking;
  • a user backchanneling;
  • a user correcting themselves;
  • a user interrupting the agent;
  • background noise or cross-talk.

For example:

"I need to change my flight from Paris to... actually, no, wait, can you first check the refund policy?"

A rigid system may prematurely send "I need to change my flight from Paris" to the reasoning backend, start the wrong workflow, and then treat the correction as a new request. From the user's perspective, the assistant is not listening. It is just chopping speech into API calls.

4. Discrete turns fight continuous interaction

The deeper issue is control flow.

A chat system treats each message as a unit of work. A voice conversation is a stream of events. The system needs to react to speech start, partial transcripts, pauses, barge-ins, confirmations, cancellations, slow-tool results, and final answer delivery.

If every new user utterance restarts the whole pipeline, the agent becomes fragile. If the user interrupts while the system is generating a response, the system should not simply restart, ignore the interruption, or continue speaking over them. It should update the conversation state.

This is not a model problem. It is an architecture problem.

The key reframing: decouple interaction from reasoning

To turn a chat interface into a successful voice chatbot, the goal should not be "add voice to the chatbot."

The goal should be:

Build a conversational layer that decouples interaction from reasoning.

The existing chat agent, business logic, tools, and retrieval pipelines should not necessarily be thrown away in a Voice chatbot use case. In many enterprise systems, that backend is the hard-earned part. It contains the domain knowledge, compliance logic, integrations, prompts, evaluation work, and edge-case handling.

But it should no longer be responsible for the live mechanics of conversation.

That is where the Fast Brain / Slow Brain architecture becomes useful.

Fast Brain / Slow Brain: using dual-process thinking for interaction design

The idea is inspired by Kahneman's dual-process theory: a fast, reactive system for immediate behavior and a slower, deliberate system for deeper reasoning.

In AI, people often talk about this distinction in the context of reasoning quality: use a fast model for easy tasks and a slower model for hard tasks. That is useful, but it misses the more important point for Voice AI.

In voice, fast/slow is not only about how the system thinks. It is about how the system behaves.

The Fast Brain manages the human.

The Slow Brain manages the problem.

The Fast Brain is the interaction layer. It handles the real-time mechanics of the conversation: listening, giving presence signals, turn-taking, interruptions, conversational state, possibly lightweight routing, and the spoken delivery.

The Slow Brain is the reasoning layer. It handles the expensive work: retrieval, business logic, database queries, planning, and final answer synthesis.

The two systems communicate asynchronously. The Fast Brain can send a request to the Slow Brain, but it does not block while waiting. It continues to manage the live conversation.

voice AI blog 2

The Fast Brain: the interaction layer

The Fast Brain is not the place for heavy business reasoning. Its job is to keep the conversation alive.

It should be optimized for latency, interruptibility, and state awareness. In practice, it behaves less like a classic chatbot and more like a real-time controller.

Its responsibilities include:

Instant and continuous presence signals

Presence signals are short conversational responses that keep the interaction alive. It can for example include acknowledgements or progress updates.

When the system receives an input, it should quickly signal that it heard the user.

This can be as small as:

"Got it."

or as specific as:

"Okay, I will check the payment date for that subscription."

This instant signal does two things. It reduces perceived latency, and it confirms that the system understood enough to proceed.

Additionally, when the Slow Brain needs time, the Fast Brain should generate short, context-aware presence signals:

"Let me look that up."

"I am checking the latest transaction now."

"This may take a moment because I need to verify it against your account history."

These should not be random fillers. They should reflect the actual system state. If the system is querying a database, say that. If it is comparing multiple records, say that. If it needs confirmation, ask for it.

Conversational state

The Fast Brain tracks what is happening in the interaction:

  • Is the user still speaking?
  • Has the user completed a thought?
  • Is the Slow Brain currently working?
  • Has the user interrupted the current answer?
  • Has the user modified the request?
  • Is the system waiting for confirmation?
  • Is the system allowed to speak now?

This state is separate from the reasoning state. That separation is critical.

Turn-taking

The Fast Brain decides when to listen, when to speak, when to wait, and when to yield.

It should treat speech as a stream, not as a sequence of perfectly separated turns. That means it needs to handle partial information and revise its decision as the user continues speaking.

Interruption handling

A useful voice agent must be built for barge-in.

If the assistant is speaking and the user interrupts, the Fast Brain should stop or fade the output, listen to the new input, and decide whether to cancel, update, or continue the Slow Brain task.

Without this, the interaction feels non-human. The user is forced to wait for the machine to finish, even when the machine is no longer answering the right question.

Minimal routing

The Fast Brain can decide whether a request needs the Slow Brain at all.

Some inputs do not require heavy reasoning:

"Can you repeat that?"
"Stop."
"Wait."
"Uh-huh."
"No, I meant last month."
"Are you still there?"

Sending every one of these to the full backend is wasteful and often harmful. The Fast Brain should handle interaction-level intents locally and route only domain work to the Slow Brain.

Final response rendering

The Slow Brain may produce the answer content, but the Fast Brain should decide how to deliver it conversationally.

A long, structured answer that works in chat may be terrible when spoken. The Fast Brain can compress it, sequence it, ask whether the user wants more detail, adapt it to what happened during the conversation, or even disregard it if it is determined as not relevant anymore.

For voice, final answer generation is not just "read the text aloud." It is response design.

This does introduce governance concerns. In many enterprise systems, the existing chat agent is already constrained by policies around safety, compliance, tone, or what can and cannot be said. If the Fast Brain reformulates the Slow Brain’s answer, it becomes part of the governed answer path as well.

The Fast Brain can adapt delivery, but it should not silently change the meaning, compliance posture, or policy guarantees of the Slow Brain’s answer.

The Slow Brain: the reasoning layer

The Slow Brain is where the existing chat agent usually belongs.

It should be allowed to be slow, deliberate, and tool-heavy. Its job is not to manage live turn-taking. Its job is to solve the problem correctly.

Its responsibilities include:

Tool calling

The Slow Brain can call internal APIs, execute workflows, query CRMs, inspect transactions, create tickets, or retrieve account-specific information. It could also run RAG pipelines over policies, contracts, support docs, product catalogs, or enterprise knowledge bases.

Domain logic

It can enforce business rules, apply compliance constraints, follow escalation policies, and use domain-specific prompts or models.

Answer generation

It can synthesize the result into a structured answer payload for the Fast Brain.

That payload does not need to be the final spoken response. A better pattern is for the Slow Brain to return the facts, confidence, source references, and recommended answers. The Fast Brain then decides how to deliver it in the live conversation.

The invariant is simple:

The Fast Brain never blocks. The Slow Brain never owns the interaction.

A concrete example: “Are you still there?”

Imagine a user asking a banking assistant a slightly complex but very practical question:

“Can you check whether my phone subscription was paid last month, and if yes, tell me from which account?”

This is not a simple FAQ. The assistant may need to authenticate the user, inspect recent transactions, identify the merchant, match it to a recurring subscription, and then generate a useful answer.

In a chat interface, waiting a few seconds is acceptable. In a voice interaction, silence immediately feels broken.

The basic STT/TTS wrapper approach

In a naive voice setup, the system still behaves like a request-response API:

Voice AI blog 3

When the user asks if the bot is still there, the system detects a new user input. Because the architecture is built around discrete turns, it treats “Are you still there?” as the next request.

The pipeline therefore restarts, and the assistant simply responds to the latest request.

But this is wrong from the user’s perspective.

The assistant has technically answered the latest utterance, but it has lost the interactional thread. The previous request may still exist somewhere in the conversation context, but the active control flow has moved on. The system is no longer managing the original task as an ongoing interaction.

The user now has to repeat themselves:

 User: I asked whether my phone subscription was paid last month.

 

This is the core failure: the model may have memory of the previous request, but the voice system does not have a conversational state machine that understands the previous request is still pending.

The problem is not that the LLM forgot. The problem is that the architecture has no separate interaction layer responsible for saying:

“I heard you. I’m still checking that.”

Instead, every user utterance competes to become the new main request.

The Fast/Slow Brain approach

In a Fast/Slow brain architecture, the same interaction behaves differently:

Voice AI blog 4

The slow brain receives the complex banking request and starts working asynchronously.

Meanwhile, the Fast brain remains active. It does not block while the slow brain is working. Its job is to manage the live conversation.

The important difference is therefore that asking “Are you still there?” does not overwrite the original request. The Fast brain understands it as an interaction-management utterance, not a new domain task for the Slow brain.

The Slow brain continues working in the background until it delivers the result and the Fast brain delivers the answer.

The result is not just lower perceived latency. It is a fundamentally different interaction model. The assistant no longer behaves like a voice-enabled API endpoint. It behaves like an agent that can stay present while thinking.

Choosing the right voice stack for the Fast Brain

Once you accept that you need a conversational layer, the next question is implementation.

There are three common options for building the speech-to-speech loop. None is universally best. The right choice depends on latency requirements, customization, and how much control you need over the intermediate representations.

Voice stack How it works Main strengths Main tradeoffs Best fit
STT → LLM → TTS User audio is transcribed by an STT model. The transcript is passed the Fast Brain LLM. The response text is then spoken by a TTS model. Most modular approach. High control over each layer. Clear visibility into what text the LLM received. STT and TTS can be tuned independently. Works well with a properly designed Fast Brain. Highest latency of the three options, because every stage adds time. More logic to define for managing the Fast Brain, unless a platform is used. Regulated or enterprise environments; teams that need control and auditability; domains where correctness matters more than ultra-low latency.
AudioLM → TTS The Fast Brain reasons more directly over audio or audio-derived representations. A TTS layer still controls the final spoken output. Lower latency than a classic STT pipeline. Preserves some control over the generated voice and response rendering through the TTS layer. Less logic to define for managing the Fast Brain. Observability is more complex because input audio is directly ingested by the generative model. A parallel transcript can help, but may not match the exact model input. Less tuning flexibility for user input understanding. Products that need faster interaction than classic STT; teams that still want custom TTS voices; use cases where partial observability is acceptable.
Native speech-to-speech A native speech model listens and responds directly, removing many intermediate steps. Fastest and most fluid option. Can offer more natural prosody and interruption behavior. Least logic to define for managing the Fast Brain. Lowest control and transparency. Harder to know exactly what the model heard or how it represented the request. Usually less TTS tuning capability, which can be a limitation when custom voice is important. Consumer experiences where naturalness is the main differentiator; prototypes or products where speed matters more than auditability; teams that do not need custom TTS voices.

 

The migration path: do not rebuild the whole agent

The most useful part of the Fast Brain / Slow Brain approach is that it does not require throwing away the backend nor building a separate one.

Most teams already have a Slow Brain. It may be a chat agent, an agentic workflow, a RAG system, a customer-service automation backend, or a set of deterministic business services.

The migration path is:

  1. Keep the existing backend as the Slow Brain.
  2. Wrap it in an asynchronous task interface.
  3. Build a Fast Brain that owns live conversation state.
  4. Route only domain work to the Slow Brain.
  5. Let the Fast Brain handle presence signals, interruptions, waiting, corrections, and spoken response design.

This is how you move from a voiced chatbot to a voice-native agent without rewriting an entire system.

What good looks like

A strong voice agent should be judged less by whether it can speak, and more by whether it behaves conversationally.

A good system should be able to:

  • respond immediately without waiting for heavy reasoning;
  • keeps the conversation going while tool calls run;
  • handle interruptions without losing state;
  • distinguish corrections from new requests;
  • adapt long text answers into voice-friendly responses;
  • preserve business logic and compliance boundaries;
  • recover gracefully when the Slow Brain fails.

The core product question becomes:

Can the user interact with the system while the system is thinking?

If the answer is no, you do not yet have a conversational voice agent. You have a voice wrapper.

Conclusion: stop building voice wrappers

Conversationality is an architectural feature, not just a model capability.

Better speech recognition helps. Better voices help. Lower-latency models help. But none of them remove the need for a conversational control layer.

If the system blocks interaction while it reasons, users will experience silence. If every pause is treated as the end of a turn, users will feel interrupted. If every correction restarts the pipeline, users will feel unheard. If the final chat answer is simply read aloud, users will feel like they are listening to a form, not talking to an assistant.

The solution is to decouple interaction from reasoning.

The Fast Brain should manage the human: presence signals, turn-taking, interruption, state, and spoken delivery.

The Slow Brain should manage the problem: retrieval, tools, business logic, safety, and answer generation.

Fast/slow is not just how the system thinks. It is how the system behaves.

That is the difference between adding voice to a chatbot and building a voice agent that can actually hold a conversation.

Where ML6 fits in

At ML6, we help organizations move from Voice AI demos to production-ready conversational systems.

Many teams already have strong chat agents, RAG pipelines, tool integrations, business rules, and domain-specific workflows. The challenge is turning those systems into voice experiences that feel live, responsive, and reliable. It requires the right architecture: a Fast Brain that manages the interaction, a Slow Brain that handles the reasoning, and the engineering discipline to make both work in production.

Through our Conversational AI offering, we build custom AI agents and voice-enabled assistants tailored to real business workflows: customer support, internal operations, onboarding, sales enablement, procurement, field service, and knowledge work. Our role is to help you preserve the value of your existing backend while adding the conversational layer needed for real-time voice interaction.

Beyond individual assistants, we help enterprises productionize these systems through strong AI engineering practices: evaluation, observability, latency optimization, guardrails, integration with enterprise systems, and governance. The goal is not just to make the agent speak. The goal is to make it useful, safe, measurable, and robust enough for business-critical environments.

Whether you are exploring Voice AI for the first time, extending an existing chat agent into a voice channel, or trying to move a prototype into production, ML6 can help you design and build the architecture that makes conversational AI enterprise-ready.

If you are exploring how voice-native agents could create value in your organization, we are happy to start the conversation.

About the author

Hakim Amri

Hakim is passionate about building AI solutions that create tangible business value. As a Machine Learning Engineer at ML6, he focuses on designing and shipping scalable Voice AI and Generative AI applications, with hands-on experience across cloud platforms, MLOps, LLMs, and end-to-end ML systems. Hakim combines a strong technical foundation with a business engineering background, allowing him to bridge stakeholder needs and engineering requirements in enterprise environments.

The answers you've been looking for

Frequently asked questions