The 4 steps an AI voice agent runs
An AI voice agent in 2026 is a four-block pipeline that fires every time the caller says something. The whole loop takes 0.8–2 seconds, and the person on the line never notices it.
- ASR (Automatic Speech Recognition): the agent "hears" the caller's speech and turns it into text
- NLU + LLM: the language model understands intent, picks a reaction, and drafts the reply
- Logic + Memory: the dialog engine checks the conversation branch, the caller's state, and updates the CRM
- TTS (Text-to-Speech): the synthesizer speaks the reply in a voice you can't tell from a human
Each block is its own neural network or module, and they run in real time: while the caller is still finishing a sentence, ASR is already transcribing the start, the LLM is getting the first tokens, and TTS is beginning to synthesize. This is called streaming — it's what gives the conversation its "live" pace.
What separates a modern AI voice agent from a legacy IVR isn't one technology — it's a pipeline of four neural networks running in streaming mode. Without the LLM, it isn't a voice agent, it's an auto-attendant.
Which AI models are used in 2026
Under the hood of a modern voice agent is a stack of best-in-class models, each owning one block of the pipeline:
| Block | Recommended models & tools |
|---|---|
| ASR (speech-to-text) | Whisper Large-v3, Deepgram Nova-3, AssemblyAI |
| LLM (reasoning) | GPT-4o, Claude Opus 4.7, Gemini 3 Pro |
| TTS (voice synthesis) | ElevenLabs Multilingual v3, OpenAI TTS HD |
| VAD (pause detection) | Silero VAD, WebRTC VAD, pyannote |
| Orchestration | Vapi, Retell, Bland |
| Telephony | Twilio, SIP / VoIP trunks |
In real PrimexAI deployments we combine the best of each layer: most often Deepgram or Whisper for ASR, GPT-4o or Claude for the LLM, and ElevenLabs for the voice, orchestrated on Vapi, Retell, or Bland and wired to phone numbers through Twilio. For lighter-weight projects we lean on faster, cheaper models where the script doesn't need a top-tier LLM.
How the agent hears speech — ASR
ASR (speech recognition) is the first block. The audio stream from the phone line (8 kHz or 16 kHz mono) is sliced into 100–300-millisecond chunks and fed to a neural network that returns text.
The hard parts of ASR over a phone line:
- The narrow frequency band (300–3,400 Hz) cuts the high end — higher-pitched voices come through worse
- Line noise: cellular interference, wind, someone mumbling next to the caller
- Filler words ("um," "you know," "like") and mid-sentence changes of direction
- Accents and regional dialects
Modern ASR engines hit 92–97% accuracy on a clean line and 82–88% on a noisy one. That's enough to get the meaning — the LLM on the next step reconstructs poorly recognized stretches from context. For a deeper overview of the technology, see our guide on what an AI voice agent is.
How the agent understands meaning — NLU & LLM
Once it has the transcript, the agent has to figure out what the caller wants: a yes or no, a question, an objection, an agreement to book. Before 2022 this was done with intent classification — an intent classifier. Accuracy ran 75–85%, which meant a lot of misreads.
In 2026, classification is replaced by the LLM. The language model gets four things as input:
- The system prompt — the agent's role, tone, and call goals
- The knowledge base — products, pricing, answers to objections
- The history of the current call — everything both sides have said
- The caller's latest utterance — the recognized text
The LLM returns the agent's next line plus flags (switch the script branch, escalate to a human, close the deal). On GPT-4o or Claude Opus, comprehension accuracy is 95%+ across any scenario, including off-script objections.
A natural voice — TTS
TTS (speech synthesis) is the last block. The LLM's reply text is turned into sound. This is where the gap between a "robocall voice" and a real person comes down to which engine you pick.
What sets 2026-grade TTS apart:
- Emotional coloring — the agent can deliver a line with a smile, surprise, or a touch of regret
- Contextual pauses — before hard words, after exclamations, between thoughts
- Prosody — correct stress and intonation that highlights the key words
- Custom voices — you can clone a specific person's voice (legally — with their consent)
In 2026, ElevenLabs Multilingual v3 and OpenAI TTS HD deliver quality where 70–80% of listeners can't tell they're talking to AI.
Connecting to CRM and telephony
A voice agent without integrations is a toy. To run in production, it needs three connections:
Telephony. The agent connects to your phone system over a SIP/VoIP trunk — most often Twilio. It takes inbound calls, places outbound ones, reads the caller ID, and can warm-transfer the call to a human rep.
CRM. On every call the agent either finds the contact record (by number) or creates a new one. When the call ends, it updates the stage, tags, and fields like "need," "budget," and "timeline," and assigns a task to the right rep. Supported: HubSpot, Salesforce, GoHighLevel, Pipedrive — and anything else over a REST API.
External systems. Booking and scheduling tools (Calendly, Acuity, Cal.com), applicant-tracking systems in HR (Greenhouse, Lever), and industry platforms — so the agent can read and write data in real time.
A real call transcript
Scenario: reactivating a dental practice's dormant patient list. The contact is a woman, 38, whose last visit was 8 months ago.
Agent: Hi, good afternoon! This is Dina calling from Implantum Dental. Is this Maria?
Patient: Yes, speaking.
Agent: Maria, I see your last visit with us was a cleaning back in August. I wanted to let you know we've just opened up our schedule for preventive check-ups at 30% off. Would it be okay if I looked at some open dates for you?
Patient: Let me think about it, I'm busy right now.
Agent: Of course, I won't keep you. Just so I understand — is a check-up something you'd want in principle, just not right now? Or are you seeing a different practice these days?
Patient: No, in principle yes — I've been meaning to for a while.
Agent: Great. Then two options: I can book you for a date that works right now, or text you a link so you can pick the time yourself — whichever you prefer?
Patient: Let's do the text.
Agent: Got it — I'll send it to this number within a minute. Have a great day!
What happened technically: ASR turned speech into text, the LLM caught the "I'm busy" objection and didn't give up — it gently qualified whether the need was there at all. Once it had a yes, it offered two ways to close. The agent didn't make the sale on the call, but it captured a warm lead and queued an SMS task. A human rep picks the lead up the moment the patient clicks the link. For more on scenarios in this industry, see our dentistry case study.
Quality metrics for a voice agent
To judge whether an agent is working well, PrimexAI looks at six metrics:
- Connect rate — % of calls that reach a live person. Normal range 30–50% (depends on the list).
- Engagement rate — % of conversations longer than 30 seconds. Normal range 60–80%.
- Target action rate — % of calls with the target action (a booking, a yes). Normal range 8–22% on reactivation.
- WER (Word Error Rate) — recognition accuracy. Target ≤10%.
- Drop rate — % of abrupt hang-ups by the caller. Target ≤15%.
- Scenario completion — how many conversations cleared every branch without falling back to a human. Target 80%+.
Where voice agents break
Common situations where an agent stumbles:
- Unusual names and last names — ASR can mishear them. Fix: verify against the CRM record before the call.
- Interruptions — the caller talks over the agent. Fix: VAD plus interrupt-handling in the engine.
- Emotional reactions (anger, crying, profanity) — an LLM emotion classifier plus an automatic hand-off to a human.
- Numbers and abbreviations — addresses, ZIP codes, account or order numbers. Fix: spell-mode TTS and a confirming read-back.
- Ambiguous "yes" answers (when the caller agrees without knowing to what). Fix: control questions.
Want to see how an AI voice agent would work in your industry?
On a free diagnostics call I'll play recordings of real conversations from your space and run the ROI math on your own numbers.
Free diagnostics →FAQ
Is an agent really different from a 2010s auto-attendant?
Fundamentally. An auto-attendant is a recording plus DTMF key presses. A modern agent is a pipeline of four neural networks with an LLM at its core that holds a free-flowing conversation, handles objections, and understands context.
How long does it take to process one utterance?
In streaming mode — 0.8–2 seconds from the moment the caller stops speaking to the start of the agent's reply. That's on par with a live rep's latency and doesn't read as "robotic."
Can I use only free, open models?
You can — for example Whisper Open Source for ASR, a self-hosted LLM (Llama 3.1, Mistral) and an open TTS engine. Quality will be below the best-in-class stack, but for simple scripts and low volumes it works.
Can the agent transfer the call to a human?
Yes. Via SIP redirect or the REFER method, the agent transfers the call without dropping it: the caller speaks with the agent, then is smoothly connected to a live rep who sees the contact record and the conversation history.
How do I train the agent for my specifics?
Through the system prompt plus a knowledge base. You spell out products, pricing, FAQs, common objections, and the call script. No model fine-tuning is needed — the LLM is flexible enough to work from instructions (in-context learning).