How to Build a Voice Sales Agent: From Setup to Production Ideas
- Jayant Upadhyaya
- Jan 21
- 8 min read
Voice agents are moving fast from “cool demo” to real business tool. Instead of typing into a chatbot, users can talk naturally, get answers instantly, and even be routed to the right specialist. For sales, this matters because voice is faster, more human, and closer to how real customer conversations happen.
This guide explains how to build a voice sales agent that can:
listen in real time
convert speech to text
understand what the customer wants
pull product info from external context (your sales materials)
speak back instantly in a natural voice
handle objections and keep the conversation on message
scale to multi-agent workflows (sales + pricing + technical specialist)
The approach shown here follows a workshop-style build that uses:
a real-time voice infrastructure layer (LiveKit)
a fast inference provider (Cerebras)
voice components for speech-to-text and text-to-speech (Cartesia)
voice activity detection and turn-taking logic (Silero VAD + end-of-turn prediction)
an agent SDK that wires it all together
Even if you swap providers later, the architecture stays the same.
Why Voice Agents Are Different From Chatbots

A chatbot is simple: user types a message, the system replies.
A voice agent is more complex because it must handle:
real-time audio streaming
interruptions
pauses and turn-taking
fast response so it feels natural
continuous state across the conversation
A good voice agent cannot wait for the user to fully finish typing. It has to detect when the user is done speaking. It should not interrupt on every pause. It needs to start responding quickly, even while still “thinking.”
That’s why voice agents are usually built as stateful systems that run multiple processes at the same time.
What You Build in This Project
The goal is a sales agent that can have natural conversations and use your company’s sales materials in real time.
By the end, the agent can:
greet visitors
answer product questions using provided context
respond with pricing details
handle common objections with pre-written responses
qualify leads and route deeper questions to specialists
speak in a responsive, streaming way
To reach that, it helps to first understand the main building blocks.
The 3 Phases Inside Every Voice Agent

A voice agent can be explained as three main phases that run in a loop:
1) Listening Phase (Speech to Text)
The agent listens to spoken audio and converts it into text in real time.
This includes:
Speech-to-Text (STT): converts voice → text
Voice Activity Detection (VAD): detects when someone is speaking
End-of-Turn Detection: decides when the user is actually done
End-of-turn detection matters because:
people pause mid-sentence
people think while speaking
if the agent interrupts too early, the experience feels bad
A common approach is:
VAD detects speech segments
a small model (often running on CPU) predicts if the user is done
the agent waits if it predicts the user will continue
This improves conversation flow a lot.
2) Thinking Phase (LLM Reasoning + Context Retrieval)
Once the text is ready, the agent sends it to a large language model (LLM), which acts as the “brain.”
In this phase, the agent may:
interpret what the user wants
decide what to say
look up details from your company documents
call tools (like searching a knowledge base or checking stock)
A key point: LLMs are general-purpose. They know a lot, but they do not know your internal details unless you provide them.
For a sales agent, accuracy matters. You want correct pricing, correct product claims, and consistent messaging. That means the LLM must be grounded in your materials.
3) Speaking Phase (Text to Speech)
The agent turns the response into audio and streams it back.
This usually uses:
Text-to-Speech (TTS): converts text → spoken audio
streaming output so the agent can start speaking before the full response is finished
Streaming is important. If the agent waits for the full response, it feels slow. If it speaks as tokens arrive, it feels responsive and human-like.
Why Real-Time Infrastructure Matters
The internet’s most common protocol is HTTP, which was designed for text requests and responses. Voice agents need continuous, low-latency streaming. That is a different job.
Real-time voice agents typically use WebRTC, which is built for real-time audio and video.
A platform like LiveKit handles this layer. It:
manages audio streams
supports low latency (often under 100ms globally)
supports multiple concurrent sessions
can be self-hosted (it is open source)
provides an agent SDK so you don’t have to build the orchestration from scratch
In simple terms: LiveKit is the “phone line” and “call room” for your agent.
Why Inference Speed Matters for Voice
Voice is unforgiving. People notice even small delays.
If an agent takes too long to respond, users feel like it is broken or unnatural.
That’s why fast inference is a big deal for voice agents, especially in live conversations. The workshop build highlights Cerebras for this reason: fast token generation helps reduce the delay between user speech and agent response.
Two main factors affect speed:
hardware architecture
software decoding strategy
The Hardware Problem: Memory Bandwidth Bottlenecks

On GPUs like the H100, many values needed during inference (weights, activations, KV cache) are often stored off-chip. That means cores repeatedly fetch data from external memory. This creates memory bandwidth bottlenecks.
Cerebras takes a different approach:
a wafer-scale chip (very large physical size)
hundreds of thousands of cores
each core has direct on-chip memory (SRAM) next to it
data access is faster because it is local
The key concept: less time moving data around, more time computing.
For real-time voice, that can translate into a faster, smoother experience.
The Software Trick: Speculative Decoding
Standard decoding generates one token at a time, sequentially.
Speculative decoding speeds things up by using two models:
a small “draft” model generates tokens quickly
a larger model verifies them
This can improve speed while maintaining quality.
For voice, faster token generation means:
the agent starts speaking sooner
less awkward silence
smoother interaction
Step 1: Set Up the Agent Stack
A production-style voice agent needs a few core pieces:
LiveKit Agent SDK (or equivalent orchestration layer)
STT provider (speech to text)
TTS provider (text to speech)
VAD (voice activity detection)
LLM provider (the reasoning model)
a way to load company context (docs, pricing, objection handling)
In the workshop build, the stack includes:
LiveKit agents
Cartesia for speech components
Silero for VAD
an OpenAI-compatible interface layer
a hosted LLM (example: Llama 3.3)
The exact providers can change, but the roles stay the same.
Step 2: Teach the Sales Agent Your Business

LLMs are powerful but not automatically correct about your company.
If someone asks:
“What’s your pricing?”
“What’s included in the plan?”
“How do integrations work?”
“What’s your refund policy?”
A general model might:
guess
hallucinate
give vague answers
say “I don’t know”
The solution is to load context into the agent.
What should you load?
A strong sales agent typically needs:
Product information
product description
key features
who it’s for
what problem it solves
Pricing
plan names
price ranges
billing rules
trial details
any limits
Value and benefits
measurable results
outcomes
differentiators
Objection handlingPre-written answers for common objections like:
“It’s too expensive”
“We already have a tool”
“We’re not ready”
“What about security?”
“How long does setup take?”
Objection handling is important because it keeps the agent consistent and prevents it from wandering off-message.
Structure matters
Instead of dumping a huge document, it helps to organize the context clearly, like:
sections
bullet-style info in the backend (even if you don’t want bullet points spoken aloud)
labeled answers the agent can retrieve quickly
The agent can then reference this context instead of improvising.
Step 3: Build the Sales Agent Class
At a high level, a voice sales agent class usually does four things:
1) Load the business context
This is often a function like load_context() that pulls:
product info
pricing
objection responses
2) Define the voice behavior rules
Voice responses should be spoken-friendly:
avoid bullet points in speech
keep sentences short
ask clarifying questions when needed
confirm details naturally
stay polite, helpful, and on-message
Also, an important rule:Only use the provided context for company-specific claims.
This reduces hallucinations.
3) Initialize the main components
The agent is wired with:
STT
VAD
LLM
TTS
conversation state manager
This is usually done in a setup or super() call.
4) Start the conversation automatically
A method like on_enter() triggers when a user joins the call or room.
Instead of silence, the agent begins with a greeting, like a real salesperson would:
welcoming message
short question to understand intent
offer to answer questions
This small detail improves the experience a lot.
Step 4: Launch the Agent Session
A typical entry point function acts like a “start button.”
It usually does three things:
Connect to a LiveKit roomThink of it like joining a conference call.
Create an instance of the sales agentThis uses your configured STT, TTS, LLM, and context.
Start a session loopThis manages the back-and-forth conversation:
listen → transcribe → think → speak → repeat
Once running, the agent can handle real-time voice conversations.
How to Make the Sales Agent More Robust

A single agent can handle basic sales conversations, but real sales calls can go
deeper.
Two upgrades make voice agents much more useful:
Upgrade 1: Multi-Agent System (Specialists)
LLMs have limited context windows and limited “focus.”
A single agent trying to be:
greeter
qualifier
technical expert
pricing negotiator
often becomes messy.
A better approach is a team of agents, like a real company:
Greeting agent
welcomes the user
asks what they need
routes them correctly
Main sales agent
qualifies the lead
explains product basics
captures intent
Technical specialist agent
answers API and integration questions
handles deeper technical details
Pricing specialist agent
talks about budget, ROI
handles negotiation and plan comparisons
The key feature: handoff
The greeting agent detects intent and routes the user to the right specialist.
This improves:
accuracy
clarity
user trust
consistency
Upgrade 2: Tool Calling
Sometimes the agent needs real-time facts that are not inside the static context.
Tool calling allows the agent to:
search documents
pull product specs from an external source
check pricing tables
query inventory or stock
fetch company policy text
retrieve integration docs
Tool calling makes the agent more dynamic and reduces “made up” answers.
Practical Notes for Production Use
If the goal is to actually use this in a real business, a few principles matter.
1) Keep responses grounded
For sales, incorrect claims are risky. Strong rules help:
use provided context for company specifics
ask clarifying questions when unsure
avoid guessing pricing or policies
2) Optimize for latency
Voice agents must feel immediate. The main ways to reduce delay:
faster inference (token speed)
streaming TTS
good end-of-turn detection
reliable low-latency voice transport (WebRTC)
3) Design for interruptions
People interrupt each other in real conversation. Your agent should:
stop speaking when user speaks
handle partial turns
continue smoothly
4) Log and improve
Store transcripts and outcomes (with permission and privacy controls) so you can:
find where the agent fails
add better objection handling
update sales materials
improve routing logic
Closing Thought: Why Voice Sales Agents Are Taking Off
Voice agents work well in sales because they reduce friction. Users can simply speak. No typing. No menus. No learning curve.
For businesses, voice agents can:
answer questions instantly
qualify leads 24/7
keep messaging consistent
route technical questions to specialists
scale customer conversations without scaling headcount
The most important part is not just the model. It is the system: real-time voice transport, turn-taking, context grounding, and careful orchestration.
Build those pieces correctly, and a voice sales agent can feel surprisingly close to a real conversation.



Comments