By Ehab Al Dissi – Production AI systems builder · Published May 2, 2026 · Category: AI Agents & Automation
Contact center AI fails after the demo when teams treat the model as the product. Production reliability comes from the architecture around the model: STT, VAD, intent triage, RAG, policy enforcement, CRM/order APIs, latency budgets, human handoff, agent assist, QA, and analytics loops.
In This Guide
What is contact center AI architecture? It is the integrated system of speech recognition, intent classification, knowledge retrieval, policy enforcement, and human handoff that makes AI agents reliable in real customer service operations—not just impressive in controlled demos.
Most AI customer-service demos are impressive for five minutes and useless in production. The difference is not the model. It is the operating architecture around the model.
I have built voice AI agents that handle e-commerce support, process returns, look up orders in real time, and hand off to humans when they should. I have also watched those same agents fail in ways no demo ever reveals: a customer screaming over a speech-to-text error, a CRM API timing out mid-sentence, a policy edge case that the training data never saw, and an agent who stops trusting the AI because it gave wrong advice twice in one shift.
This is what production contact-center AI actually looks like.
Key Takeaway: Production voice AI is not a chatbot with a microphone. It is a latency-bounded pipeline of verification, fallback, and human partnership that must survive noisy speech, angry customers, and broken APIs.
Why Demos Lie: The Controlled Environment Trap
Demos are controlled environments. The microphone is noise-cancelled. The questions are clean. The customer is polite. The CRM returns data in 200 milliseconds. There is no escalation pressure, no supervisor listening in, no compliance officer reviewing transcripts, and no angry customer who has already been transferred twice.
What is the demo-to-production gap in AI customer service? It is the difference between an AI answering pre-scripted questions in a quiet room and the same AI handling a mumbling customer, a timeout from the order database, and a policy exception that did not exist when the model was trained.
In a demo, the AI always knows the answer because the answer is in the room. In production, the answer is scattered across:
- A knowledge base last updated six months ago
- A Shopify order that just had its shipping address changed
- A return policy that differs by SKU and country
- A conversation history the agent cannot see because it lives in another system
- A customer who mumbles their order number while driving
The gap between demo and production is where most AI projects die. Not because the model is bad, but because the architecture around it was treated as an afterthought.
Key Takeaway: The most dangerous moment in an AI project is when the demo succeeds. That is when teams stop designing for failure and start assuming the model will handle reality.
The Real Architecture: A Production Voice AI Pipeline
A production voice AI stack is not a chatbot with a microphone. It is a pipeline of interdependent systems that must all stay within a latency budget while maintaining accuracy, compliance, and trust.
What are the components of a voice AI pipeline? The core components are: (1) Speech-to-Text (STT), (2) Voice Activity Detection (VAD), (3) Intent Classification and Triage, (4) Vector Search and Knowledge Retrieval, (5) Policy Engine and Guardrails, (6) LLM Response Generation, (7) Text-to-Speech (TTS), and (8) Human Handoff with Agent Assist.
Each component introduces failure modes that compound. A 5% STT error rate becomes a 15% intent misclassification rate when the customer has an accent. A 200ms CRM timeout becomes a 3-second silence that makes the customer hang up.
Speech-to-Text (STT): The First Point of Failure
Why does speech-to-text fail in contact centers? Because demo STT is trained on clean audio, while production audio includes background chatter, crosstalk, hold music, customers speaking while driving, and accents the base model rarely saw. A demo engine that hits 95% word accuracy in a quiet room can drop to 70% on a real contact-center floor.
How do you improve STT accuracy in production voice AI? The fix is not a better model alone. It is a layered approach:
- Noise suppression and voice activity detection (VAD) tuned for telephony audio: We use Twilio Media Streams with custom VAD thresholds (0.5s hangover, 300ms padding) to strip silence and reduce token waste.
- Custom vocabulary injection for product names, SKUs, and policy terms: OpenAI Whisper supports prompt-based vocabulary injection. We inject our top 500 product names and policy terms at the start of every session, raising SKU recognition accuracy from 62% to 89%.
- Real-time confidence scoring so low-confidence utterances trigger a clarification loop: If Whisper returns
avg_logprob < -0.5, we immediately ask the customer to repeat rather than guessing. - Fallback to DTMF or visual IVR for critical data like order numbers: When confidence drops below threshold, we switch to keypad input for 6-digit order numbers rather than risking transcription errors.
Key Takeaway: STT is not a solved problem. It is a continuous calibration problem that changes with every new product line, every seasonal hiring wave, and every API update from your telephony provider.
Intent Detection and Triage: The 200ms Decision
What is AI triage in customer service? Triage is the sub-200ms classification of customer intent, sentiment, and priority that decides whether the AI handles the request, pulls live data, or escalates to a human immediately.
The AI must classify intent in under 200ms while handling messy, multi-intent sentences: “I want to return this but also check if the new one shipped and by the way your last agent was rude.”
How does intent classification work in production? Our pipeline at Aserva uses a two-stage approach:
- Fast classifier (50ms): A lightweight model (DistilBERT fine-tuned on 12,000 support transcripts) classifies intent into 14 categories:
order_status,return_request,product_question,account_issue,billing_dispute,shipping_complaint,refund_request,exchange_request,technical_support,complaint,compliment,greeting,escalation_request, andother.
- Sentiment and priority scorer (30ms): A separate model scores sentiment (
positive,neutral,negative,angry) and priority (low,medium,high,critical). Anger detection is especially important: we have seen customers use perfectly polite language while being furious, and our model catches this through semantic tension analysis (detecting when word sentiment does not match emotional valence).
- Confidence threshold gate (20ms): If triage confidence is below 0.3, we escalate immediately rather than guessing. This threshold was not arbitrary: we A/B tested 0.2, 0.3, and 0.4 against agent resolution quality and found 0.3 minimized false escalations while catching genuinely ambiguous cases.
- Policy-aware routing (100ms): Some intents always require human handling regardless of confidence: billing disputes above $500, GDPR data requests, and legal threats. The policy engine checks the rules table before the LLM sees the query, and actual agent resolutions retrain the classifier weekly.
Key Takeaway: Intent classification is not natural language understanding. It is risk stratification. The goal is not to be right 100% of the time. It is to know when you are probably wrong and escalate before the customer notices.
Retrieval and Context Assembly (RAG): Beyond Simple Search
What is RAG in contact center AI? Retrieval-Augmented Generation (RAG) is the process of fetching relevant knowledge base articles, order data, and conversation history, then injecting them into the LLM prompt so the AI answers from verified sources rather than hallucinating.
Most customer questions are not in the training data. They are in the knowledge base, the order database, the shipping API, and the conversation history. The AI must retrieve and synthesize this context in real time.
How does vector search work in production AI? Our architecture at Aserva uses a multi-layer retrieval system:
Layer 1: Vector Semantic Search (Pinecone)
text-embedding-3-smallembeddings, Pinecone Serverless (us-east-1), top-10 initial retrieval.- 512-token chunks with 64-token overlap and metadata filters for
contentType,orgId,language, andlastUpdated.
When a customer asks “why is my order late,” we embed the query, search Pinecone for the top 10 most semantically similar chunks, then filter by the customer’s organization ID so we do not return another merchant’s shipping policy.
Layer 2: Keyword Fallback (PostgreSQL)
- If Pinecone exceeds 800ms or returns fewer than 3 results, PostgreSQL full-text search (
tsvector) catches rare SKUs, legal clause names, and other terms embeddings miss.
Layer 3: Reranking (Cohere / OpenAI)
- Raw vector search finds candidates; reranking chooses the prompt order. We rerank the top 10 chunks with Cohere Rerank v3, budget 150-300ms, and cache frequent queries such as “where is my order” and “how do I return.”
Layer 4: Conversation Memory
- Short-term memory: the last 10 turns of the current conversation, stored in Redis with 24h TTL.
- Long-term memory: prior conversations with the same customer, fetched from PostgreSQL and summarized by a lightweight model call before entering the main prompt.
- Cross-channel memory: if the customer emailed yesterday and is calling today, we pull the email thread summary into the voice prompt.
Layer 5: Live Data Injection
- Shopify/WooCommerce calls run in parallel with vector search. If the customer mentions an order number, we inject live order status, tracking number, and fulfillment state directly into the prompt.
- API timeout: 3 seconds hard cap. If the API is slow, we tell the customer “Let me look that up” rather than guessing.
The prompt is assembled dynamically: system personality, business hours context, site type (e-commerce vs. content), live order data, knowledge base sources, and conversation history. If any of these layers is stale or wrong, the answer is wrong.
Key Takeaway: RAG is not “search then answer.” It is a cascaded retrieval system where each layer catches what the previous one missed, and every layer has a latency budget and a fallback.
The Policy Engine: Guardrails Against Hallucination
What is a policy engine in AI customer service? A policy engine is a rules layer that sits between the LLM and the customer, enforcing that the AI cannot promise refunds above thresholds, give legal advice, disclose data without verification, or invent policy details it does not have in context.
Language models approximate policy from examples, which means they can hallucinate under pressure. A production system needs a policy layer that:
- Grounds answers in verified sources: If the AI states a return policy, it must cite the specific knowledge base article ID. If it cannot cite, it escalates.
- Marks unknowns explicitly: Legal questions, medical advice, high-value cancellations, and other restricted categories trigger immediate handoff regardless of model confidence.
- Blocks unsafe actions: The policy engine parses raw output before it reaches the customer. Refunds above threshold, legal advice, or unverified data disclosure are blocked and replaced with a human handoff message.
- Logs decisions for compliance: Every blocked response, escalation trigger, and guardrail activation is logged with prompt context, raw output, and the rule that fired.
Key Takeaway: The policy engine is not a safety feature. It is a product feature. It is what makes the AI trustworthy enough to deploy in regulated industries.
Human Handoff and Agent Assist: Escalation as Design
What is agent assist in contact center AI? Agent assist is the real-time coaching surface that shows human agents summaries, suggested replies, customer context, and compliance reminders during escalated conversations.
Escalation design is a product decision, not a failure mode. We escalate on:
- Low triage confidence (below 0.3)
- Explicit customer request for human agent
- Policy ambiguity detected by the guardrail layer
- Looping conversation (AI repeating itself without resolution)
- Sentiment dropping below threshold
But escalation is not a cliff. It is a ramp. Our agent-assist surface shows the human agent:
- Real-time summary: A running 2-sentence summary, updated every turn from the last 6 turns.
- Suggested next-best-action: For a return, the agent sees: “Verify order number → Check return window → Offer prepaid label.”
- Live customer context: Shopify order history, lifetime value, prior escalations, and conversation memory.
- Sentiment and confidence: A 10-turn sentiment trend plus the confidence score that triggered handoff.
- Pre-ranked knowledge articles: The same sources the AI used, ranked by relevance.
The agent can override, edit, or ignore every suggestion. The AI assists; it does not replace.
Key Takeaway: Agent assist is the most undervalued surface in contact center AI. It improves agent performance without requiring customers to trust an AI they have never met.
System Architecture Diagram
Here is the actual pipeline architecture we run at Aserva:
┌─────────────────────────────────────────────────────────────────────────────┐
│ CUSTOMER (Voice or Chat) │
└──────────────────────┬──────────────────────────────────────────────────────┘
│ Audio stream / Text message
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYER 1: INGESTION │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Twilio │ │ WebSocket │ │ Widget │ │
│ │ Media │──│ Gateway │──│ API │ │
│ │ Streams │ │ (Node.js) │ │ (Next.js) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└──────────────────────┬──────────────────────────────────────────────────────┘
│ Normalized message payload
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYER 2: SPEECH PROCESSING (Voice only) │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Whisper API │───▶│ VAD Filter │───▶│ Confidence │ │
│ │ (STT) │ │ (300ms pad) │ │ Gate │ │
│ │ p95: 450ms │ │ Strip silence │ │ logprob > -0.5 │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└──────────────────────┬──────────────────────────────────────────────────────┘
│ Transcript text
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYER 3: TRIAGE (Parallel execution, max 200ms) │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Intent Class │ │ Sentiment │ │ Policy Check │ │
│ │ (DistilBERT) │ │ Scorer │ │ (Rules engine) │ │
│ │ 14 categories │ │ 4 sentiments │ │ Hardcoded │ │
│ │ p95: 80ms │ │ p95: 30ms │ │ p95: 20ms │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │ │ │ │
│ └───────────────────────┼───────────────────────┘ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Confidence │ │
│ │ Threshold │──▶ Escalate if < 0.3 │
│ │ Gate │ │
│ └─────────────┘ │
└──────────────────────┬──────────────────────────────────────────────────────┘
│ Triage result (intent, sentiment, priority, confidence)
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYER 4: CONTEXT RETRIEVAL (Parallel, max 800ms) │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Vector Search │ │ Shopify API │ │ Memory Fetch │ │
│ │ (Pinecone) │ │ (Order lookup) │ │ (Redis + PG) │ │
│ │ Top-10 chunks │ │ 3s timeout │ │ Last 10 turns │ │
│ │ p95: 350ms │ │ p95: 400ms │ │ p95: 80ms │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ RERANKER (Cohere Rerank v3) │ │
│ │ Reorders top 10 by relevance, selects top 4 for prompt │ │
│ │ p95: 200ms (cached for common queries) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
└──────────────────────┬──────────────────────────────────────────────────────┘
│ Context blocks (knowledge, order data, memory)
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYER 5: POLICY & GUARDRAILS │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Prompt Builder │───▶│ Policy Engine │───▶│ LLM Call │ │
│ │ (Dynamic system│ │ (Rules + Regex)│ │ GPT-4-turbo │ │
│ │ + context) │ │ Block/rewrite │ │ Streaming │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└──────────────────────┬──────────────────────────────────────────────────────┘
│ Streamed tokens (voice) or full response (chat)
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAYER 6: OUTPUT │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ TTS (ElevenLabs)│ │ Text Response │ │ Agent Assist │ │
│ │ p95: 250ms │ │ (Chat widget) │ │ (Dashboard) │ │
│ │ Voice streaming │ │ Full response │ │ Summary + │ │
│ │ to customer │ │ to customer │ │ suggestions │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Critical path latency: 1.5s (STT) + 0.2s (triage) + 0.8s (retrieval, parallel) + 1.0s (LLM, streaming first token) + 0.25s (TTS) = 3.75s worst case before the customer hears the first audio byte.
Our optimization target is 2.5s p95. We hit this by caching frequent queries, pre-warming LLM connections, and streaming TTS tokens before the full sentence is complete.
Key Takeaway: Architecture diagrams in pitch decks show happy paths. Production architecture diagrams show failure paths, timeout budgets, and fallback triggers at every layer.
Latency Is a Business Metric: The 3.4-Second Budget
Why is latency critical in voice AI? In voice AI, one to two seconds of silence changes trust. Slow AI feels broken even when it is correct. Customers hang up. Agents interrupt. Supervisors lose confidence.
Our latency budget breaks down like this:
| Pipeline Stage | Target (p50) | Worst Case (p95) | Optimization Strategy |
|---|---|---|---|
| Speech-to-text | 300ms | 500ms | Whisper API, custom vocabulary |
| Intent classification and triage | 100ms | 200ms | DistilBERT, edge caching |
| Vector search and reranking | 400ms | 800ms | Pinecone Serverless, Cohere rerank cache |
| LLM response generation (streaming) | 500ms | 1500ms | GPT-4-turbo, pre-warmed connections |
| Text-to-speech | 200ms | 400ms | ElevenLabs streaming, phoneme caching |
| Total | 1.5s | 3.4s | Parallel execution, caching, streaming |
That is 1.5 to 3.4 seconds before the customer hears the first word. We optimize with:
- Streaming responses: Start audio while the rest of the sentence generates.
- Parallel execution: Run triage, vector search, and CRM lookup simultaneously.
- Cached frequent queries: “Where is my order”, “how do I return”, and “what are your hours” represent 40% of queries, so we cache reranked retrieval results in Redis for one hour.
- Pre-generated fallback: If the LLM takes longer than 2 seconds to emit the first token, stream “Let me check that for you” while the full answer generates.
Key Takeaway: Latency is not an engineering vanity metric. It is a customer trust metric. A 3-second delay feels like incompetence even if the eventual answer is perfect.
Automation vs. Escalation: The Real Product Decision
What is escalation quality in AI customer service? Escalation quality is the measure of how effectively the AI hands off complex or sensitive conversations to human agents, including the completeness of context, speed of transfer, and accuracy of the reason for escalation.
The enterprise fear is not that AI will be bad. It is that AI will be bad and nobody will know until a customer complains on Twitter.
Containment rate—the percentage of contacts resolved without human intervention—is the metric vendors love to quote. It is also dangerous. A 90% containment rate means nothing if the 10% who escalated were the highest-value customers with the most complex problems.
What is the right containment rate for contact center AI? There is no universal right number. The right containment rate is the one that maximizes customer satisfaction while minimizing cost. We have seen deployments where 70% containment with high CSAT outperformed 90% containment with angry customers.
We design for escalation quality, not just containment volume:
- 3-second escalation summaries: Intent, sentiment trajectory, order context, prior history, and the exact escalation reason.
- Customer state before hello: Example: “Angry about late delivery, mentioned legal action, prefers phone callback.”
- Revenue-risk flags: VIP customers, subscription cancellations, fraud signals, and high-CLV escalations.
- Post-escalation analytics: Weekly review labels each case as AI-should-have-handled, correct escalation, or training-data edge case.
Key Takeaway: Escalation is not a failure. It is a product feature that builds trust. The goal is not to prevent escalation. It is to make escalation feel seamless to the customer and informative to the agent.
Agent Assist Beats Replacement in Enterprise Environments
Why is agent assist better than full automation? Real-time coaching is the undervalued surface. An AI that listens to every agent conversation and whispers next-best-action, compliance reminders, and upsell suggestions transforms agent performance without replacing the human.
In our deployments, agent assist consistently outperforms full automation on complex B2B and high-value B2C tickets. Agents with AI assist resolve tickets 23% faster than agents without, and their CSAT scores are 8% higher. Full automation only wins on simple, repetitive queries (order status, password resets, store hours).
Our agent assist provides:
- Live summaries: Updated every 3 turns, saving agents 30-60 seconds per ticket.
- Suggested replies: Three policy-aware replies with 45% first-suggestion acceptance and 67% top-three acceptance.
- Real-time QA scoring: Identity verification, return-window mention, discount handling, and other checklist items.
- Supervisor alerts: Slack alerts when sentiment drops below -0.3 or the agent misses two compliance checks.
The result is not fewer agents. It is better agents, faster ramp time, lower attrition, and consistent quality across shifts and locations.
Key Takeaway: Full automation is a commodity. Agent assist is a competitive advantage. Companies that invest in agent assist see ROI in 6 weeks. Companies that invest in full automation see ROI in 6 months—if they ever do.
ROI Must Be Designed Before Deployment
How do you measure ROI for contact center AI? You cannot measure what you did not define. We establish metrics before launch:
| Metric | Definition | Measurement Method | Target |
|---|---|---|---|
| Cost per contact | Total AI + agent cost divided by resolved conversations | Sum of API costs, infrastructure, and agent wages / conversation count | Reduce by 30% vs. pre-AI baseline |
| Average handle time (AHT) | Minutes from first message to resolution | Timestamp delta in conversation table | Reduce by 25% with agent assist |
| Containment rate | % resolved without human | escalated = false / total conversations | 60-80% (varies by industry) |
| CSAT | Customer satisfaction score | Post-chat survey, 1-5 scale | > 4.2 for AI-handled, > 4.0 for escalated |
| NPS | Net Promoter Score | Post-resolution survey | > 40 for AI-handled cohort |
| QA scores | Agent consistency with coaching | Random sample review, checklist score | +15% vs. pre-AI baseline |
| Escalation rate | % escalated by reason | Categorized in escalation taxonomy | < 15% for “AI should have handled” |
| Revenue recovery | Saved subscriptions, accepted upsells | Track offer acceptance in Shopify/CRM | 5% revenue protection rate |
| Agent ramp time | Days to proficiency | Time to first solo shift | Reduce by 40% with AI assist |
These metrics feed a weekly review cycle. If containment rises but CSAT drops, the AI is succeeding at the wrong thing. If escalation rate spikes on order-status queries, the CRM integration has a latency problem. The architecture must be observable before it is optimizable.
What is the payback period for contact center AI? In our e-commerce deployments, payback is typically 8-12 weeks. The fastest ROI comes from agent assist on existing tickets (no customer-facing risk, immediate AHT reduction). Full voice automation takes 4-6 months to break even due to higher infrastructure costs and longer calibration periods.
Key Takeaway: ROI is not a post-launch exercise. It is a pre-launch design constraint. If you cannot define how you will measure success, you will not know whether the AI is working.
What I Learned Building Aserva: 7 Production Lessons
Aserva started as a Shopify support widget and grew into a multi-channel AI support platform handling voice, chat, and email. The lessons came from failures, not successes.
1. RAG Is Not Enough Without Reranking
Pinecone returns relevant chunks, but without a reranker the prompt fills with noise. In our first deployment, the top vector result was often a generic “Welcome to our help center” article that happened to share keywords with the query. We added Cohere Rerank v3 and saw a 34% improvement in answer relevance scores (measured by agent thumbs-up/down on AI responses).
The fix: Always rerank. Vector search finds candidates. Reranking picks winners. Budget 200ms for this step and cache common queries.
2. Streaming Is Non-Negotiable for Voice
Waiting for a full LLM response before speaking destroys the conversational rhythm. In our first voice prototype, the AI paused for 2.5 seconds before answering. Customers hung up 18% of the time. When we switched to streaming tokens to ElevenLabs TTS as they generated, hang-ups dropped to 4%.
The fix: Stream everything. STT partial results, LLM tokens, TTS phonemes. The customer should hear audio within 500ms of finishing their sentence.
3. Handoff Is a Feature, Not a Bug
Early versions tried to maximize containment. We set confidence thresholds aggressively low (0.15) to keep conversations in the AI. CSAT crashed. Agents complained about cleaning up AI messes. When we raised the threshold to 0.3 and designed elegant handoff messages, CSAT recovered and agent satisfaction rose.
The fix: Optimize for escalation quality, not containment volume. A clean handoff builds more trust than a stubborn AI.
4. Support Inbox Design Matters
Agents need to see AI confidence, intent classification, sources used, and conversation memory in one view. Our first dashboard hid this metadata in tabs. Agents ignored it. When we surfaced it in the conversation header—confidence badge, intent tag, source list, memory summary—agents started trusting the AI because they could verify its reasoning.
The fix: Transparency beats accuracy. An AI that shows its work is trusted more than an AI that is right but opaque.
5. Commerce Actions Need API Guardrails
Letting the AI call Shopify to refund an order is powerful and terrifying. In our first month of operation, the AI incorrectly refunded a $2,400 order because the customer used ambiguous language (“I want to send this back” interpreted as “initiate return and refund”). We implemented a two-step confirmation: the AI summarizes the action, asks the customer to confirm, and logs the decision.
The fix: Every commerce action requires policy verification, confirmation prompts, and audit logging. Never let the AI execute irreversible actions without human-verifiable consent.
6. QA and UAT Must Include Angry Customers
Synthetic test cases with polite questions do not find the edge cases. We test with real support transcripts, including the worst ones. Our UAT suite includes 50 “nightmare conversations”: customers who swear, customers who contradict themselves, customers who demand supervisors, and customers who provide wrong order numbers three times in a row.
The fix: Test with production data, not synthetic data. If your QA suite does not make you uncomfortable, it is not realistic.
7. Fallback Design Saves the Product
When Pinecone is down, we fall back to Postgres keyword search. When the LLM times out, we fall back to a pre-generated response. When the CRM is slow, we tell the customer we are looking it up rather than guessing.
The fix: Every external dependency needs a fallback. The fallback does not need to be perfect. It needs to be honest. “I’m having trouble accessing your order details. Let me connect you with an agent who can help” is better than a hallucinated tracking number.
Key Takeaway: Every production lesson at Aserva came from a failure that cost us a customer, an agent, or a night’s sleep. The architecture you build after those failures is the architecture that survives.
Real Failure Log: When the Architecture Saves You
Incident 1: The Shopify API Timeout Cascade (March 2024)
- What happened: Shopify’s API experienced 6-second latency spikes during a flash sale. Our voice agent told 12 customers their orders “could not be found” before we detected the pattern.
- Root cause: No timeout fallback. The AI interpreted API timeout as “order does not exist.”
- The fix: Added a 3-second hard timeout with a holding message (“Let me check that for you”) and silent retry. If the retry fails, the AI says: “I’m having trouble accessing our order system right now. Let me connect you with an agent who can look this up manually.”
- Result: Zero incorrect “order not found” responses in the 6 months since.
Incident 2: The SKU Hallucination (April 2024)
- What happened: A customer asked about “the blue one.” The AI invented a SKU (“BLU-2847”) that did not exist and provided incorrect dimensions.
- Root cause: The LLM hallucinated a SKU when the retrieval context did not contain the specific product.
- The fix: Added a policy rule: if the query mentions a color, size, or variant without a specific product name or SKU, the AI must ask for clarification rather than guessing. Added SKU validation against the Shopify catalog before any product fact is stated.
- Result: Product hallucinations dropped to zero.
Incident 3: The Accidental Refund (May 2024)
- What happened: See Lesson 5 above. A $2,400 order was incorrectly refunded due to ambiguous language.
- Root cause: No confirmation step for financial actions.
- The fix: Two-step confirmation for all order modifications. AI states the action, customer confirms, action is logged.
Key Takeaway: Your failure log is your competitive moat. Vendors who have not shipped production systems do not have these stories. You do.
Technology Stack Comparison: STT, LLM, Vector DB, and TTS Options
What is the best technology stack for contact center AI in 2026? There is no single best stack. The right stack depends on your latency budget, accuracy requirements, and integration complexity. Here is how the major options compare:
Speech-to-Text (STT)
| Provider | Accuracy (clean) | Accuracy (noisy) | Latency | Custom Vocab | Best For |
|---|---|---|---|---|---|
| OpenAI Whisper | 95% accuracy | 78% accuracy | 400ms | Prompt-based | General purpose, fast setup |
| Google Cloud STT | 93% accuracy | 80% accuracy | 350ms | Phrase hints | GCP-native stacks |
| Amazon Transcribe | 92% accuracy | 76% accuracy | 450ms | Custom vocab | AWS-native stacks |
| Deepgram Nova-2 | 94% accuracy | 82% accuracy | 300ms | Keyword boosting | Real-time streaming |
| AssemblyAI | 93% accuracy | 79% accuracy | 320ms | Custom spelling | Startups, fast iteration |
Our choice: Whisper API for general STT, Deepgram Nova-2 for high-volume voice streams where every 100ms matters.
Large Language Models (LLM)
| Model | Latency (first token) | Cost / 1M tokens | Context | Reasoning | Best For |
|---|---|---|---|---|---|
| GPT-4-turbo | 800ms | $30 | 128K | Excellent | Primary response generation |
| GPT-4o | 600ms | $15 | 128K | Excellent | Cost-optimized primary |
| Claude 3.5 Sonnet | 700ms | $15 | 200K | Excellent | Long context, policy analysis |
| Llama 3.1 70B (self-hosted) | 1200ms | $5* | 128K | Good | Data sovereignty requirements |
| Mistral Large | 900ms | $12 | 128K | Good | EU deployment |
*Self-hosted cost estimate including GPU infrastructure.
Our choice: GPT-4-turbo for primary voice/chat responses, Claude 3.5 Sonnet for policy analysis and escalation summary generation.
Vector Databases
| Database | Latency (p95) | Cost / GB | Hybrid Search | Metadata Filtering | Best For |
|---|---|---|---|---|---|
| Pinecone Serverless | 350ms | $0.33 | Yes | Excellent | Serverless, auto-scaling |
| Weaviate | 250ms | $0.25 | Yes | Good | GraphQL-native, local deploy |
| Qdrant | 200ms | $0.10 | Yes | Good | Open source, self-hosted |
| pgvector (PostgreSQL) | 400ms | $0.05 | No | Good | Postgres-native stacks |
| Milvus | 300ms | $0.20 | Yes | Good | Enterprise, multi-tenant |
Our choice: Pinecone Serverless for production vector search, pgvector as the keyword fallback.
Text-to-Speech (TTS)
| Provider | Latency | Voice Quality | Streaming | Cost / 1M chars | Best For |
|---|---|---|---|---|---|
| ElevenLabs | 200ms | Excellent | Yes | $30 | Premium voice experiences |
| OpenAI TTS | 300ms | Good | Yes | $15 | Cost-optimized voice |
| Amazon Polly | 400ms | Good | Yes | $16 | AWS-native stacks |
| Google Cloud TTS | 350ms | Good | Yes | $16 | GCP-native stacks |
Our choice: ElevenLabs for premium voice deployments, OpenAI TTS for cost-sensitive chat-to-voice fallback.
Key Takeaway: The best stack is the one your team can operate at 2 AM when three systems are failing simultaneously. Familiarity beats benchmark performance in production.
FAQ: Contact Center AI Architecture
A demo uses clean audio, pre-loaded data, and happy-path questions. Production deployment handles noisy speech, API timeouts, angry customers, policy edge cases, and latency budgets. The architecture around the model—not the model itself—determines whether the system survives contact with reality.
You cannot prevent hallucination entirely. You can architect around it: (1) RAG with verified sources, (2) policy engines that block ungrounded claims, (3) confidence thresholds that trigger human review, (4) two-step confirmation for irreversible actions, and (5) continuous audit logging.
The ideal end-to-end latency from customer speech stop to AI speech start is under 2.5 seconds (p95). Breakdown: STT (500ms), triage (200ms), retrieval (800ms), LLM first token (500ms), TTS (250ms). Optimizations include parallel execution, caching, and streaming.
Full automation replaces the human agent entirely. Agent assist provides real-time coaching, summaries, and suggestions to human agents while they handle the conversation. Agent assist is lower risk, faster to deploy, and often higher ROI for complex B2B or high-value B2C use cases.
Measure before deployment: cost per contact, average handle time (AHT), containment rate by intent, CSAT/NPS split by AI vs. human, QA scores, escalation rate with reason taxonomy, revenue recovery, and agent ramp time. Review weekly. If containment rises but CSAT drops, the AI is succeeding at the wrong thing.
RAG (Retrieval-Augmented Generation) fetches relevant documents and injects them into the LLM prompt so the AI answers from verified sources. Raw vector search returns relevant chunks but not in optimal order. Reranking (e.g., Cohere Rerank) reorders chunks by query relevance, ensuring the most precise information appears first in the prompt where the LLM pays most attention.
Detect anger early (sentiment scoring + semantic tension analysis), respond with de-escalation scripts, never argue, offer human handoff proactively, and flag the conversation for supervisor review. The goal is not to solve the problem with AI. It is to prevent the customer from leaving before a human can help.
The best STT depends on your stack. OpenAI Whisper with custom vocabulary injection works for most use cases (95% → 78% accuracy on noisy audio). Deepgram Nova-2 with keyword boosting is best for high-volume streaming where latency is critical (300ms first response).
Buy the components, build the orchestration. Use managed APIs for STT, LLM, and TTS. Build the policy engine, triage logic, escalation design, and agent assist surface yourself. These are your competitive differentiators. The infrastructure is a commodity.
A policy engine is a rules layer between the LLM and the customer that enforces business rules: no refunds above thresholds without approval, no legal advice, no data disclosure without verification, and no ungrounded policy claims. It blocks, rewrites, or escalates responses that violate rules.
Conclusion
The next winning contact-center AI companies will not be the ones with the flashiest chatbot. They will be the ones that understand operations deeply enough to make AI trusted by customers, agents, and executives.
That means designing for noise, latency, escalation, compliance, and ROI from day one. It means treating the model as one component in a system of verification, fallback, and human partnership. It means admitting that some conversations should never be automated—and building the architecture to hand them off gracefully.
What is the most important architecture decision in contact center AI? It is not which LLM you choose. It is how you design for failure when every component in your pipeline is having a bad day simultaneously.
The demo is not the product. The product is what happens when the demo ends.
This article reflects production experience building voice and chat AI agents for e-commerce and content platforms. If you are building contact-center AI and want to discuss architecture, escalation design, or agent-assist surfaces, I am open to conversation.