How to Build a Voice Sales Agent: From Setup to Production Ideas

Jayant Upadhyaya
Jan 21
8 min read

Voice agents are moving fast from “cool demo” to real business tool. Instead of typing into a chatbot, users can talk naturally, get answers instantly, and even be routed to the right specialist. For sales, this matters because voice is faster, more human, and closer to how real customer conversations happen.

This guide explains how to build a voice sales agent that can:

listen in real time
convert speech to text
understand what the customer wants
pull product info from external context (your sales materials)
speak back instantly in a natural voice
handle objections and keep the conversation on message
scale to multi-agent workflows (sales + pricing + technical specialist)

The approach shown here follows a workshop-style build that uses:

a real-time voice infrastructure layer (LiveKit)
a fast inference provider (Cerebras)
voice components for speech-to-text and text-to-speech (Cartesia)
voice activity detection and turn-taking logic (Silero VAD + end-of-turn prediction)
an agent SDK that wires it all together

Even if you swap providers later, the architecture stays the same.

Why Voice Agents Are Different From Chatbots

AI image generated by Gemini

A chatbot is simple: user types a message, the system replies.

A voice agent is more complex because it must handle:

real-time audio streaming
interruptions
pauses and turn-taking
fast response so it feels natural
continuous state across the conversation

A good voice agent cannot wait for the user to fully finish typing. It has to detect when the user is done speaking. It should not interrupt on every pause. It needs to start responding quickly, even while still “thinking.”

That’s why voice agents are usually built as stateful systems that run multiple processes at the same time.

What You Build in This Project

The goal is a sales agent that can have natural conversations and use your company’s sales materials in real time.

By the end, the agent can:

greet visitors
answer product questions using provided context
respond with pricing details
handle common objections with pre-written responses
qualify leads and route deeper questions to specialists
speak in a responsive, streaming way

To reach that, it helps to first understand the main building blocks.

The 3 Phases Inside Every Voice Agent

Neon diagram shows "Listening," "Reasoning & Context," and "Speech Output" in a cycle with icons: microphone, brain, speaker on dark background. — AI image generated by Gemini

A voice agent can be explained as three main phases that run in a loop:

1) Listening Phase (Speech to Text)

The agent listens to spoken audio and converts it into text in real time.

This includes:

Speech-to-Text (STT): converts voice → text
Voice Activity Detection (VAD): detects when someone is speaking
End-of-Turn Detection: decides when the user is actually done

End-of-turn detection matters because:

people pause mid-sentence
people think while speaking
if the agent interrupts too early, the experience feels bad

A common approach is:

VAD detects speech segments
a small model (often running on CPU) predicts if the user is done
the agent waits if it predicts the user will continue

This improves conversation flow a lot.

2) Thinking Phase (LLM Reasoning + Context Retrieval)

Once the text is ready, the agent sends it to a large language model (LLM), which acts as the “brain.”

In this phase, the agent may:

interpret what the user wants
decide what to say
look up details from your company documents
call tools (like searching a knowledge base or checking stock)

A key point: LLMs are general-purpose. They know a lot, but they do not know your internal details unless you provide them.

For a sales agent, accuracy matters. You want correct pricing, correct product claims, and consistent messaging. That means the LLM must be grounded in your materials.

3) Speaking Phase (Text to Speech)

The agent turns the response into audio and streams it back.

This usually uses:

Text-to-Speech (TTS): converts text → spoken audio
streaming output so the agent can start speaking before the full response is finished

Streaming is important. If the agent waits for the full response, it feels slow. If it speaks as tokens arrive, it feels responsive and human-like.

Why Real-Time Infrastructure Matters

The internet’s most common protocol is HTTP, which was designed for text requests and responses. Voice agents need continuous, low-latency streaming. That is a different job.

Real-time voice agents typically use WebRTC, which is built for real-time audio and video.

A platform like LiveKit handles this layer. It:

manages audio streams
supports low latency (often under 100ms globally)
supports multiple concurrent sessions
can be self-hosted (it is open source)
provides an agent SDK so you don’t have to build the orchestration from scratch

In simple terms: LiveKit is the “phone line” and “call room” for your agent.

Why Inference Speed Matters for Voice

Voice is unforgiving. People notice even small delays.

If an agent takes too long to respond, users feel like it is broken or unnatural.

That’s why fast inference is a big deal for voice agents, especially in live conversations. The workshop build highlights Cerebras for this reason: fast token generation helps reduce the delay between user speech and agent response.

Two main factors affect speed:

hardware architecture
software decoding strategy

The Hardware Problem: Memory Bandwidth Bottlenecks

Diagram compares traditional GPU (left) with integrated memory (right). Left shows slow data flow; right shows fast, efficient movement. — AI image generated by Gemini

On GPUs like the H100, many values needed during inference (weights, activations, KV cache) are often stored off-chip. That means cores repeatedly fetch data from external memory. This creates memory bandwidth bottlenecks.

Cerebras takes a different approach:

a wafer-scale chip (very large physical size)
hundreds of thousands of cores
each core has direct on-chip memory (SRAM) next to it
data access is faster because it is local

The key concept: less time moving data around, more time computing.

For real-time voice, that can translate into a faster, smoother experience.

The Software Trick: Speculative Decoding

Standard decoding generates one token at a time, sequentially.

Speculative decoding speeds things up by using two models:

a small “draft” model generates tokens quickly
a larger model verifies them

This can improve speed while maintaining quality.

For voice, faster token generation means:

the agent starts speaking sooner
less awkward silence
smoother interaction

Step 1: Set Up the Agent Stack

A production-style voice agent needs a few core pieces:

LiveKit Agent SDK (or equivalent orchestration layer)
STT provider (speech to text)
TTS provider (text to speech)
VAD (voice activity detection)
LLM provider (the reasoning model)
a way to load company context (docs, pricing, objection handling)

In the workshop build, the stack includes:

LiveKit agents
Cartesia for speech components
Silero for VAD
an OpenAI-compatible interface layer
a hosted LLM (example: Llama 3.3)

The exact providers can change, but the roles stay the same.

Step 2: Teach the Sales Agent Your Business

Microphone in center with sound waves, connected to icons titled product info, pricing, FAQs, and objection handling. Light colors, tech theme. — AI image generated by Gemini

LLMs are powerful but not automatically correct about your company.

If someone asks:

“What’s your pricing?”
“What’s included in the plan?”
“How do integrations work?”
“What’s your refund policy?”

A general model might:

guess
hallucinate
give vague answers
say “I don’t know”

The solution is to load context into the agent.

What should you load?

A strong sales agent typically needs:

Product information

product description
key features
who it’s for
what problem it solves

Pricing

plan names
price ranges
billing rules
trial details
any limits

Value and benefits

measurable results
outcomes
differentiators

Objection handlingPre-written answers for common objections like:

“It’s too expensive”
“We already have a tool”
“We’re not ready”
“What about security?”
“How long does setup take?”

Objection handling is important because it keeps the agent consistent and prevents it from wandering off-message.

Structure matters

Instead of dumping a huge document, it helps to organize the context clearly, like:

sections
bullet-style info in the backend (even if you don’t want bullet points spoken aloud)
labeled answers the agent can retrieve quickly

The agent can then reference this context instead of improvising.

Step 3: Build the Sales Agent Class

At a high level, a voice sales agent class usually does four things:

1) Load the business context

This is often a function like load_context() that pulls:

product info
pricing
objection responses

2) Define the voice behavior rules

Voice responses should be spoken-friendly:

avoid bullet points in speech
keep sentences short
ask clarifying questions when needed
confirm details naturally
stay polite, helpful, and on-message

Also, an important rule:Only use the provided context for company-specific claims.

This reduces hallucinations.

3) Initialize the main components

The agent is wired with:

STT
VAD
LLM
TTS
conversation state manager

This is usually done in a setup or super() call.

4) Start the conversation automatically

A method like on_enter() triggers when a user joins the call or room.

Instead of silence, the agent begins with a greeting, like a real salesperson would:

welcoming message
short question to understand intent
offer to answer questions

This small detail improves the experience a lot.

Step 4: Launch the Agent Session

A typical entry point function acts like a “start button.”

It usually does three things:

Connect to a LiveKit roomThink of it like joining a conference call.
Create an instance of the sales agentThis uses your configured STT, TTS, LLM, and context.
Start a session loopThis manages the back-and-forth conversation:
listen → transcribe → think → speak → repeat

Once running, the agent can handle real-time voice conversations.

How to Make the Sales Agent More Robust

Flowchart illustration with avatars, waveforms, and icons showing a decision-making process. Text includes Greeting, Sales, Technical, Pricing. — AI image generated by Gemini

A single agent can handle basic sales conversations, but real sales calls can go

deeper.

Two upgrades make voice agents much more useful:

Upgrade 1: Multi-Agent System (Specialists)

LLMs have limited context windows and limited “focus.”

A single agent trying to be:

greeter
qualifier
technical expert
pricing negotiator

often becomes messy.

A better approach is a team of agents, like a real company:

Greeting agent

welcomes the user
asks what they need
routes them correctly

Main sales agent

qualifies the lead
explains product basics
captures intent

Technical specialist agent

answers API and integration questions
handles deeper technical details

Pricing specialist agent

talks about budget, ROI
handles negotiation and plan comparisons

The key feature: handoff

The greeting agent detects intent and routes the user to the right specialist.

This improves:

accuracy
clarity
user trust
consistency

Upgrade 2: Tool Calling

Sometimes the agent needs real-time facts that are not inside the static context.

Tool calling allows the agent to:

search documents
pull product specs from an external source
check pricing tables
query inventory or stock
fetch company policy text
retrieve integration docs

Tool calling makes the agent more dynamic and reduces “made up” answers.

Practical Notes for Production Use

If the goal is to actually use this in a real business, a few principles matter.

1) Keep responses grounded

For sales, incorrect claims are risky. Strong rules help:

use provided context for company specifics
ask clarifying questions when unsure
avoid guessing pricing or policies

2) Optimize for latency

Voice agents must feel immediate. The main ways to reduce delay:

faster inference (token speed)
streaming TTS
good end-of-turn detection
reliable low-latency voice transport (WebRTC)

3) Design for interruptions

People interrupt each other in real conversation. Your agent should:

stop speaking when user speaks
handle partial turns
continue smoothly

4) Log and improve

Store transcripts and outcomes (with permission and privacy controls) so you can:

find where the agent fails
add better objection handling
update sales materials
improve routing logic

Closing Thought: Why Voice Sales Agents Are Taking Off

Voice agents work well in sales because they reduce friction. Users can simply speak. No typing. No menus. No learning curve.

For businesses, voice agents can:

answer questions instantly
qualify leads 24/7
keep messaging consistent
route technical questions to specialists
scale customer conversations without scaling headcount

The most important part is not just the model. It is the system: real-time voice transport, turn-taking, context grounding, and careful orchestration.

Build those pieces correctly, and a voice sales agent can feel surprisingly close to a real conversation.

Talk to a Solutions Architect — Get a 1-Page Build Plan