
Thomas De Decker
Machine Learning Engineer
Think about the last time you spoke to a digital assistant. Now, imagine that this assistant truly understands you and responds naturally in your language. What if it even acts on your behalf? But before you picture a sudden robot takeover, let's clarify: this is not about replacing your human touch. This is the rapidly advancing reality of AI voice agents. Continue reading to find out if your business can leverage voice agents.
We are no longer limited to simple text-based interactions with an assistant, where you type messages and get text replies. Now, envision intelligent voice assistants that can understand spoken requests and proactively get things done. This technology offers a spectrum of possibilities, from simply giving you more hands-free control over your existing applications to enabling fully autonomous actions (e.g., scheduling appointments). For businesses navigating through today’s global markets where efficient and natural communication is paramount, grasping the full potential of this technology is becoming increasingly important.
AI is no longer just a buzzword; it’s a strategic reality. You already know about ChatGPT, chatbots, and Large Language Models. Now, there’s a new game-changer on the rise: AI Agents. If you're unfamiliar with the term, it’s time to catch up—here’s what you need to know.
Leading AI research lab Anthropic has an interesting article[1] about understanding these systems. They talk about "agentic systems" and distinguish between two ways these systems operate: workflows and agents.
A workflow is like following a set of instructions. Imagine you have a really detailed instruction manual for assembling furniture. It tells you exactly which screw goes where and in what order. That is similar to how an AI workflow operates. It uses AI, more specifically a powerful language model, along with other tools to follow a path pre-defined by the people who built it. The steps and tools are decided in advance, like in that fixed instruction manual.
An agent is more like a capable, intelligent assistant. Let's say you give it a task, “plan a three-day trip to Ghent for under €500, including accommodation and activities”. Now, that agent takes it from there and decides which websites to check, how to compare prices, and what order to do things in. The goal of an agent is to use their intelligence and available tools to figure out the best way to achieve a goal.
This difference is essential for businesses. AI Workflows are best for predictable, routine tasks needing consistency and simpler development. AI agents are better suited for complex, changing goals that require adaptability, although they are typically more complex to design and manage.
As AI agents become smarter, how we interact with them must be equally intuitive for the users. That’s where other modalities come into play, e.g. vision, voice, etc. Among these, voice is a particularly compelling interface since it is inherently natural and intuitive for humans. Giving AI agents the power of speech unlocks significant possibilities for businesses.
Currently, the dominant architecture enabling this voice is a sequential 3-step pipeline: Speech-to-Text (STT) →Large Language Model (LLM) →Text-to-Speech (TTS). First, when a user speaks, the Real-time STT component, also known as Automatic Speech Recognition (ASR), captures the voice as input and transcribes it into digital text. This text then serves as input for an LLM. The LLM acts as the agent’s cognitive core, performing Natural Language Generation (NLG). Finally, the TTS component takes the LLM-generated text response and synthesizes it into audible speech for the user. This modular approach allows for flexibility and independent optimization of each component.
While the three-step pipeline is the most used approach, an alternative approach known as Speech-to-Speech (S2S) is gaining increasing attention. In an S2S system, the AI directly transforms spoken input into spoken output, potentially bypassing the intermediate textual representation and the distinct LLM text generation stage altogether.
The benefits of S2S include the possibility of reduced latency, leading to faster interactions by eliminating the sequential processing steps of the traditional pipeline. Furthermore, S2S allows you to prompt the model, influencing its understanding of the spoken input, the style and nuances (e.g. language, accent, tone, style, …). Finally, directly processing speech enables a more direct understanding of the spoken input, capturing information that might be lost during text transcription.
However, S2S also presents several disadvantages and challenges:
While voice offers clear benefits, remember that voice data (recordings, voiceprints) are subject to specific regulations. In the European Union, this data is sensitive personal data under GDPR, requiring explicit consent and strong security. The same applies to other regulations in other regions. Businesses must be mindful of and prioritize compliance with all applicable data privacy laws when using voice agents.
We've explored voice agents and how they work, but let's cut to the chase: why should your business care in the first place? Fully automating customer interactions with a purely autonomous system can feel impersonal.
However, the true power of AI voice agents lies in their ability to blend the efficiency of automation with the natural and intuitive modality of human speech. You have the flexibility to deploy them at various levels of autonomy for voice interactions, tailoring their implementation to address specific communication challenges. The crucial step is to analyze your unique use cases and pinpoint where voice agents can provide the most significant benefit, whether it’s augmenting human agents with instant information, automating high-volume routine voice inquiries, or optimizing voice-based customer journeys.
So, how do you identify those high-impact use cases? While voice isn’t a universal solution, especially for tasks needing visual data or complex non-verbal input, it provides distinct advantages in specific scenarios. Consider prioritising voice when:
Here are a few examples of how businesses could strategically apply them:
You can leverage voice agents to achieve tangible and measurable business value by strategically identifying the best use cases like these. This can affect customer experiences. The key is to move beyond the theoretical and focus on practical applications that address your specific challenges and opportunities. Also, remember that the voice itself is key: it impacts the User Experience (UX), more specifically Voice Experience (VE), and it represents your brand. Customizing a natural, brand-aligned voice builds trust and enhances perception.
However, remember that not every use case is the same. While completely overhauling a business process with the latest voice agents might seem exciting at first, it is definitely not the only step forward. Some applications would also greatly benefit from the increased customer experience when adding an STT or TSS component to their existing application. Think about chatbots with the option to read a message aloud or transcribe user input. Adding the audio modality to your application can also enhance accessibility. At the same time, some industries will see a complete makeover in a few years due to their reliance on voice and audio as the core channel. Think about the significant impact voice agents will have on every business that requires calls over the phone. In this case, entirely new business models and solutions will incubate, while others will completely dwindle.
Furthermore, while constantly improving, even advanced voice agents can sometimes struggle to interpret speech amidst heavy background noise accurately, understand strong regional accents or highly nuanced language like sarcasm, or maintain coherent context across very long interactions. Additionally, the interface itself makes tasks requiring visual information (like links or data displays) or complex data input significantly more challenging to execute effectively through purely voice-based interaction.
Thus, you should carefully consider the level of autonomy required for your use case and the importance of the audio modality. Always remember that the user’s needs and the value they derive are the most crucial drivers for any successful implementation.
Investing in voice agents now isn't about chasing the hype, but recognizing a strategic imperative. The rapid advancements in AI Agents, particularly real-time Speech-to-Speech models such as the OpenAI GPT-4o real-time models[2], Google’s Gemini models[3], and AWS’s newest Nova[4] are enabling a new generation of voice agents capable of low-latency, more natural conversations with improved handling of nuances like tone, emotion, and even language. Embracing these cutting-edge voice technologies early provides hands-on experience that drives innovative user interactions and enhances your organization’s adaptability, setting you up for sustained success beyond the constraints of traditional text.
This post has revealed that AI agents represent a pivotal shift in artificial intelligence, moving beyond passive tools to autonomous systems capable of dynamically planning and executing tasks. The integration of voice enhances this potential, creating more intuitive and efficient interactions. However, it also introduces crucial considerations, particularly in regions like the EU, where data privacy regulations like GDPR and the upcoming AI Act demand proactive compliance and ethical deployment.
Resources
[1]https://www.anthropic.com/engineering/building-effective-agents
[2]https://openai.com/index/introducing-the-realtime-api/
[3]https://developers.googleblog.com/en/gemini-2-5-flash-pro-live-api-veo-2-gemini-api/