top of page

Talk to a Solutions Architect — Get a 1-Page Build Plan

How to Build a Voice Sales Agent: From Setup to Production Ideas

  • Writer: Jayant Upadhyaya
    Jayant Upadhyaya
  • Jan 21
  • 8 min read

Voice agents are moving fast from “cool demo” to real business tool. Instead of typing into a chatbot, users can talk naturally, get answers instantly, and even be routed to the right specialist. For sales, this matters because voice is faster, more human, and closer to how real customer conversations happen.


This guide explains how to build a voice sales agent that can:

  • listen in real time

  • convert speech to text

  • understand what the customer wants

  • pull product info from external context (your sales materials)

  • speak back instantly in a natural voice

  • handle objections and keep the conversation on message

  • scale to multi-agent workflows (sales + pricing + technical specialist)


The approach shown here follows a workshop-style build that uses:

  • a real-time voice infrastructure layer (LiveKit)

  • a fast inference provider (Cerebras)

  • voice components for speech-to-text and text-to-speech (Cartesia)

  • voice activity detection and turn-taking logic (Silero VAD + end-of-turn prediction)

  • an agent SDK that wires it all together


Even if you swap providers later, the architecture stays the same.


Why Voice Agents Are Different From Chatbots


Smartphone with text bubbles on left, a profile speaking into a mic on right with sound waves, a play/pause symbol. Blue-gray background.
AI image generated by Gemini

A chatbot is simple: user types a message, the system replies.


A voice agent is more complex because it must handle:

  • real-time audio streaming

  • interruptions

  • pauses and turn-taking

  • fast response so it feels natural

  • continuous state across the conversation


A good voice agent cannot wait for the user to fully finish typing. It has to detect when the user is done speaking. It should not interrupt on every pause. It needs to start responding quickly, even while still “thinking.”


That’s why voice agents are usually built as stateful systems that run multiple processes at the same time.


What You Build in This Project


The goal is a sales agent that can have natural conversations and use your company’s sales materials in real time.


By the end, the agent can:

  • greet visitors

  • answer product questions using provided context

  • respond with pricing details

  • handle common objections with pre-written responses

  • qualify leads and route deeper questions to specialists

  • speak in a responsive, streaming way


To reach that, it helps to first understand the main building blocks.


The 3 Phases Inside Every Voice Agent


Neon diagram shows "Listening," "Reasoning & Context," and "Speech Output" in a cycle with icons: microphone, brain, speaker on dark background.
AI image generated by Gemini

A voice agent can be explained as three main phases that run in a loop:


1) Listening Phase (Speech to Text)

The agent listens to spoken audio and converts it into text in real time.


This includes:

  • Speech-to-Text (STT): converts voice → text

  • Voice Activity Detection (VAD): detects when someone is speaking

  • End-of-Turn Detection: decides when the user is actually done


End-of-turn detection matters because:

  • people pause mid-sentence

  • people think while speaking

  • if the agent interrupts too early, the experience feels bad


A common approach is:

  • VAD detects speech segments

  • a small model (often running on CPU) predicts if the user is done

  • the agent waits if it predicts the user will continue


This improves conversation flow a lot.


2) Thinking Phase (LLM Reasoning + Context Retrieval)

Once the text is ready, the agent sends it to a large language model (LLM), which acts as the “brain.”


In this phase, the agent may:

  • interpret what the user wants

  • decide what to say

  • look up details from your company documents

  • call tools (like searching a knowledge base or checking stock)


A key point: LLMs are general-purpose. They know a lot, but they do not know your internal details unless you provide them.


For a sales agent, accuracy matters. You want correct pricing, correct product claims, and consistent messaging. That means the LLM must be grounded in your materials.


3) Speaking Phase (Text to Speech)

The agent turns the response into audio and streams it back.


This usually uses:

  • Text-to-Speech (TTS): converts text → spoken audio

  • streaming output so the agent can start speaking before the full response is finished


Streaming is important. If the agent waits for the full response, it feels slow. If it speaks as tokens arrive, it feels responsive and human-like.


Why Real-Time Infrastructure Matters


The internet’s most common protocol is HTTP, which was designed for text requests and responses. Voice agents need continuous, low-latency streaming. That is a different job.


Real-time voice agents typically use WebRTC, which is built for real-time audio and video.


A platform like LiveKit handles this layer. It:

  • manages audio streams

  • supports low latency (often under 100ms globally)

  • supports multiple concurrent sessions

  • can be self-hosted (it is open source)

  • provides an agent SDK so you don’t have to build the orchestration from scratch


In simple terms: LiveKit is the “phone line” and “call room” for your agent.


Why Inference Speed Matters for Voice


Voice is unforgiving. People notice even small delays.


If an agent takes too long to respond, users feel like it is broken or unnatural.

That’s why fast inference is a big deal for voice agents, especially in live conversations. The workshop build highlights Cerebras for this reason: fast token generation helps reduce the delay between user speech and agent response.


Two main factors affect speed:

  1. hardware architecture

  2. software decoding strategy


The Hardware Problem: Memory Bandwidth Bottlenecks


Diagram compares traditional GPU (left) with integrated memory (right). Left shows slow data flow; right shows fast, efficient movement.
AI image generated by Gemini

On GPUs like the H100, many values needed during inference (weights, activations, KV cache) are often stored off-chip. That means cores repeatedly fetch data from external memory. This creates memory bandwidth bottlenecks.


Cerebras takes a different approach:

  • a wafer-scale chip (very large physical size)

  • hundreds of thousands of cores

  • each core has direct on-chip memory (SRAM) next to it

  • data access is faster because it is local


The key concept: less time moving data around, more time computing.

For real-time voice, that can translate into a faster, smoother experience.


The Software Trick: Speculative Decoding


Standard decoding generates one token at a time, sequentially.

Speculative decoding speeds things up by using two models:

  • a small “draft” model generates tokens quickly

  • a larger model verifies them


This can improve speed while maintaining quality.

For voice, faster token generation means:

  • the agent starts speaking sooner

  • less awkward silence

  • smoother interaction


Step 1: Set Up the Agent Stack


A production-style voice agent needs a few core pieces:

  • LiveKit Agent SDK (or equivalent orchestration layer)

  • STT provider (speech to text)

  • TTS provider (text to speech)

  • VAD (voice activity detection)

  • LLM provider (the reasoning model)

  • a way to load company context (docs, pricing, objection handling)


In the workshop build, the stack includes:

  • LiveKit agents

  • Cartesia for speech components

  • Silero for VAD

  • an OpenAI-compatible interface layer

  • a hosted LLM (example: Llama 3.3)


The exact providers can change, but the roles stay the same.


Step 2: Teach the Sales Agent Your Business


Microphone in center with sound waves, connected to icons titled product info, pricing, FAQs, and objection handling. Light colors, tech theme.
AI image generated by Gemini

LLMs are powerful but not automatically correct about your company.


If someone asks:

  • “What’s your pricing?”

  • “What’s included in the plan?”

  • “How do integrations work?”

  • “What’s your refund policy?”


A general model might:

  • guess

  • hallucinate

  • give vague answers

  • say “I don’t know”


The solution is to load context into the agent.


What should you load?

A strong sales agent typically needs:


Product information

  • product description

  • key features

  • who it’s for

  • what problem it solves


Pricing

  • plan names

  • price ranges

  • billing rules

  • trial details

  • any limits


Value and benefits

  • measurable results

  • outcomes

  • differentiators


Objection handlingPre-written answers for common objections like:

  • “It’s too expensive”

  • “We already have a tool”

  • “We’re not ready”

  • “What about security?”

  • “How long does setup take?”


Objection handling is important because it keeps the agent consistent and prevents it from wandering off-message.


Structure matters

Instead of dumping a huge document, it helps to organize the context clearly, like:

  • sections

  • bullet-style info in the backend (even if you don’t want bullet points spoken aloud)

  • labeled answers the agent can retrieve quickly


The agent can then reference this context instead of improvising.


Step 3: Build the Sales Agent Class


At a high level, a voice sales agent class usually does four things:


1) Load the business context

This is often a function like load_context() that pulls:

  • product info

  • pricing

  • objection responses


2) Define the voice behavior rules

Voice responses should be spoken-friendly:

  • avoid bullet points in speech

  • keep sentences short

  • ask clarifying questions when needed

  • confirm details naturally

  • stay polite, helpful, and on-message


Also, an important rule:Only use the provided context for company-specific claims.

This reduces hallucinations.


3) Initialize the main components

The agent is wired with:

  • STT

  • VAD

  • LLM

  • TTS

  • conversation state manager


This is usually done in a setup or super() call.


4) Start the conversation automatically

A method like on_enter() triggers when a user joins the call or room.


Instead of silence, the agent begins with a greeting, like a real salesperson would:

  • welcoming message

  • short question to understand intent

  • offer to answer questions


This small detail improves the experience a lot.


Step 4: Launch the Agent Session


A typical entry point function acts like a “start button.”


It usually does three things:

  1. Connect to a LiveKit roomThink of it like joining a conference call.

  2. Create an instance of the sales agentThis uses your configured STT, TTS, LLM, and context.

  3. Start a session loopThis manages the back-and-forth conversation:

  4. listen → transcribe → think → speak → repeat


Once running, the agent can handle real-time voice conversations.


How to Make the Sales Agent More Robust


Flowchart illustration with avatars, waveforms, and icons showing a decision-making process. Text includes Greeting, Sales, Technical, Pricing.
AI image generated by Gemini

A single agent can handle basic sales conversations, but real sales calls can go

deeper.


Two upgrades make voice agents much more useful:


Upgrade 1: Multi-Agent System (Specialists)

LLMs have limited context windows and limited “focus.”


A single agent trying to be:

  • greeter

  • qualifier

  • technical expert

  • pricing negotiator

often becomes messy.


A better approach is a team of agents, like a real company:

Greeting agent

  • welcomes the user

  • asks what they need

  • routes them correctly


Main sales agent

  • qualifies the lead

  • explains product basics

  • captures intent


Technical specialist agent

  • answers API and integration questions

  • handles deeper technical details


Pricing specialist agent

  • talks about budget, ROI

  • handles negotiation and plan comparisons


The key feature: handoff

The greeting agent detects intent and routes the user to the right specialist.


This improves:

  • accuracy

  • clarity

  • user trust

  • consistency


Upgrade 2: Tool Calling

Sometimes the agent needs real-time facts that are not inside the static context.


Tool calling allows the agent to:

  • search documents

  • pull product specs from an external source

  • check pricing tables

  • query inventory or stock

  • fetch company policy text

  • retrieve integration docs


Tool calling makes the agent more dynamic and reduces “made up” answers.


Practical Notes for Production Use


If the goal is to actually use this in a real business, a few principles matter.


1) Keep responses grounded

For sales, incorrect claims are risky. Strong rules help:

  • use provided context for company specifics

  • ask clarifying questions when unsure

  • avoid guessing pricing or policies


2) Optimize for latency

Voice agents must feel immediate. The main ways to reduce delay:

  • faster inference (token speed)

  • streaming TTS

  • good end-of-turn detection

  • reliable low-latency voice transport (WebRTC)


3) Design for interruptions

People interrupt each other in real conversation. Your agent should:

  • stop speaking when user speaks

  • handle partial turns

  • continue smoothly


4) Log and improve

Store transcripts and outcomes (with permission and privacy controls) so you can:

  • find where the agent fails

  • add better objection handling

  • update sales materials

  • improve routing logic

Closing Thought: Why Voice Sales Agents Are Taking Off


Voice agents work well in sales because they reduce friction. Users can simply speak. No typing. No menus. No learning curve.


For businesses, voice agents can:

  • answer questions instantly

  • qualify leads 24/7

  • keep messaging consistent

  • route technical questions to specialists

  • scale customer conversations without scaling headcount


The most important part is not just the model. It is the system: real-time voice transport, turn-taking, context grounding, and careful orchestration.

Build those pieces correctly, and a voice sales agent can feel surprisingly close to a real conversation.

Comments


bottom of page