Latency optimizations within the Cisco AI Agent

🤖 Increase your productiveness with AI! Discover Quso: all-in-one AI social media suite for good automation.

A deep dive sequence on Webex AI Agent

This blogpost is devoted to the reminiscence of Jay Patel, an enthusiastic champion of our AI Agent imaginative and prescient and a tireless advocate for each millisecond of enchancment.

Why low latency issues for voice AI

In voice AI, latency is every thing. People naturally pause for only some hundred milliseconds between talking turns — so when an AI agent waits longer, the dialog instantly feels robotic and inattentive. For AI brokers, staying inside that pure pause window is vital as a result of even small delays can break the circulate and frustrate prospects.

On telephony community (PSTN) the problem is even harder, as roughly 500 ms of latency is launched throughout the decision path – leaving only some hundred milliseconds for flip detection, retrieval, reasoning and speech synthesis. Effectivity in each element is important to maintain conversations flowing naturally.

But velocity alone isn’t sufficient. Smaller fashions could also be quick, however actual customer-facing brokers should even be correct, instruction-following, hallucination-resistant, and enterprise-grade — qualities tiny fashions merely don’t ship. Bigger fashions present that intelligence, however at the price of added latency. And that latency issues most as soon as you progress past internet demos and into actual calls. Delivering velocity is straightforward in internet demos, the place connections are larger constancy and keep away from the PSTN’s additional latency and encoding overhead. The actual problem is delivering that intelligence whereas nonetheless responding in underneath a second on actual telephony paths.

That is the place Cisco takes a singular strategy. By combining the intelligence of high-quality fashions with deep latency engineering, Webex AI Agent delivers responses which can be good, rapid, and really feel human.

Our strategy

To ship pure, low-latency responses, we use a modular pipeline: Voice Exercise Detection (VAD) → Computerized Speech Recognition (ASR) → enterprise logic → Massive Language Mannequin (LLM) → Textual content-to-Speech (TTS). This construction supplies transparency and permits every element to be tuned for each velocity and high quality.

VAD detects when the client begins and stops talking, enabling barge-in and switch taking, whereas ASR converts speech to textual content, a vital step since all downstream logic depends on its accuracy. The enterprise logic layer then interprets the transcript, managing flip detection (with entity-aware checks like digit sequences), grounding LLM responses with retrieval-augmented era (RAG), which fetches related details from company information bases to stop hallucinations, and dealing with extra selections resembling small-talk detection and gear utilization.

The LLM generates the reply utilizing transcript, context, and retrieved knowledge, and TTS produces the pure audio the client hears. We presently deploy trillion-parameter business fashions alongside Cisco’s inside fashions to stability accuracy and latency, as detailed in our Webex AI Transparency Notice. VAD and switch detection guarantee we all know exactly when to talk, RAG and enterprise logic floor the solutions, and the LLM and TTS ship high-quality responses – all inside strict timing constraints. Behind the scenes, a number of proprietary Cisco fashions present extra intelligence and latency optimizations, additional enhancing accuracy and responsiveness.

These decisions will proceed to evolve because the business and mannequin capabilities advance.

Producing the primary a part of the ultimate reply upfront

A key optimization is producing the start of the ultimate reply whereas the client remains to be talking. As a substitute of ready for a full transcript and accomplished retrievals, we pre-compute an preliminary phase in order that playback can begin instantly as soon as VAD and TD verify the tip of flip. This mirrors human dialog: expert brokers start talking as they’re nonetheless formulating the remainder of their response. By producing the primary half early, we preserve sub-second responsiveness whereas heavier, extra correct fashions proceed processing within the background.

Why This Issues

With out upfront era, we’d be compelled to make use of a lot smaller, much less correct fashions to fulfill latency budgets. Meaning actual tradeoffs: the LLM would have to be light-weight with weaker instruction following, the ASR might need to be ultra-fast however much less correct, and the TTS would seemingly depend on sooner however robotic-sounding engines. In apply, prospects would get responses sooner, however these responses could be noticeably much less useful, pure, and reliable.

By producing and caching the primary half early, we create invaluable time for slower, higher-quality reasoning, retrieval, and synthesis. That respiration room lets bigger, extra succesful fashions run within the background to carry out the heavier reasoning and retrieval wanted to generate clever responses – all earlier than the client ever notices a delay. The result’s velocity with out compromise, fast responses which can be each correct and pure.

Challenges and Design Decisions

Producing early output introduces a number of necessities:

Incomplete enter: The early phase should be secure, contextually believable, and capable of proceed naturally after the complete reasoning completes.
Continuation mannequin: We by no means discard the early half; the ultimate reply merely continues from it.
Mid-sentence flexibility: The early half doesn’t have to be a full sentence (e.g., “I’m completely happy to assist…”), making it mix seamlessly into the ultimate reply.
A number of candidates: We generate a number of potential begins and decide one of the best one as extra context arrives.

This design delivers a single, clean response from the client’s perspective, at the same time as advanced processing continues behind the scenes.

Further latency optimizations

Throughout the stack, we implement a number of engineering optimizations, every shaving small quantities of time to maintain the pipeline quick even when utilizing higher-quality fashions.

Streaming

Low-latency ASR/TTS streaming
LLM token streaming

Infrastructure

Regional media colocation
Reserved capability for vital LLM calls

Modelling

Hybrid multi-model mixtures for light-weight duties
Sturdy Finish of Speech (EOS) detection combining VAD, ASR, and customized indicators

Caching

Frequent immediate caching
Pre-synthesized audio

On the infrastructure aspect, we construct a number of layers of resilience to maintain latencies predictable underneath actual manufacturing situations. This contains an LLM proxy with regional failover, parallel “safety-net” requests to hedge towards sluggish LLM responses, coordinated retries and caching for ASR and TTS paths, and a system-wide orchestration layer that dynamically adjusts mannequin sizes and fallback methods.

Actual life latency numbers

We measure latency from the second the client stops chatting with once they hear the primary audio. In apply, the breakdown appears to be like like this:

VAD EOS: ~500 ms
Flip Detection: <75 ms p99
First a part of reply: Already generated + cached by EOS
TTS for the primary phase: Normally 10–20 ms (resulting from cache hits)
PSTN return path: ~500 ms

As a result of the early phase is prepared instantly at EOS, playback can begin virtually immediately. In the meantime, the heavier era and RAG retrieval full in parallel, seamlessly persevering with the reply. The result’s pure, sub-second responsiveness regardless of bigger fashions working within the background.

Our low latency benefit

Delivering a very human, pure, and rapid voice AI expertise requires greater than connecting ASR, LLM, and TTS. It calls for cautious orchestration of each element — precision flip detection, early-answer era, caching, and resilient infrastructure — all working collectively to attenuate latency with out compromising intelligence.

Webex AI Agent combines high-quality fashions with deep latency engineering to persistently obtain ~1.3 second PSTN latencies over actual telephony paths. The result’s an AI agent that feels human, attentive, and dependable, serving to enterprises meet buyer expectations whereas sustaining accuracy, grounding, and enterprise-grade reliability.

Uncover how Webex AI Agent can convey quick, pure, and correct voice experiences to your PSTN interactions — attain out to your Webex gross sales consultant or associate for a personalised demo.

🚀 Stage up your duties with GetResponse AI-powered instruments to streamline your workflow!

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Latency optimizations within the Cisco AI Agent

A deep dive sequence on Webex AI Agent

Why low latency issues for voice AI

Our strategy

Producing the primary a part of the ultimate reply upfront

Why This Issues

Challenges and Design Decisions

Further latency optimizations

Actual life latency numbers

Our low latency benefit

LEAVE A REPLY

Subscribe

Webex Suite Improvements at Cisco Reside EMEA 2026

How a freelancer automations their tutoring enterprise with IFTTT –

Zapier vs. OpenAI Frontier: What is the distinction?

The way to create targets: Science-backed methods for achievement

The 5 greatest time monitoring apps in 2026

More like this
Related

Webex Suite Improvements at Cisco Reside EMEA 2026

How a freelancer automations their tutoring enterprise with IFTTT –

Zapier vs. OpenAI Frontier: What is the distinction?

The way to create targets: Science-backed methods for achievement

About us

The latest posts

Webex Suite Improvements at Cisco Reside EMEA 2026

How a freelancer automations their tutoring enterprise with IFTTT –

Zapier vs. OpenAI Frontier: What is the distinction?

Newsletter Subscribe

Latency optimizations within the Cisco AI Agent

A deep dive sequence on Webex AI Agent

Why low latency issues for voice AI

Our strategy

Producing the primary a part of the ultimate reply upfront

Why This Issues

Challenges and Design Decisions

Further latency optimizations

Actual life latency numbers

Our low latency benefit

LEAVE A REPLY

Subscribe

More like thisRelated

About us

The latest posts

Newsletter Subscribe

More like this
Related