11 Retrieval-Augmented Generation (RAG) at Scale

Chapter Overview

Retrieval-Augmented Generation combines the power of embedding-based retrieval with large language model generation, enabling LLMs to answer questions grounded in enterprise knowledge rather than relying solely on parametric memory. This chapter explores production RAG systems at scale: enterprise architecture patterns that handle billion-document corpora, context window optimization strategies that maximize information density while respecting token limits, multi-stage retrieval pipelines that balance recall and precision across filtering and reranking stages, evaluation frameworks that measure end-to-end quality beyond simple metrics, and techniques for handling contradictory information when sources disagree. These patterns enable RAG systems that serve millions of users with accurate, attributable, up-to-date responses.

With robust data engineering in place (Chapter 23), the foundation exists to build advanced applications that leverage embeddings at scale. Retrieval-Augmented Generation (RAG) has emerged as the dominant pattern for grounding large language models in enterprise knowledge. Rather than fine-tuning models on proprietary data (expensive, slow to update, risk of hallucination), RAG retrieves relevant context from vector databases and includes it in the LLM prompt. This approach enables accurate answers over billion-document corpora, maintains attribution to sources, updates knowledge in real-time, and scales to trillion-row datasets—all critical requirements for enterprise deployment.

11.1 Enterprise RAG Architecture Patterns

Production RAG systems serve thousands of concurrent users querying billion-document knowledge bases with sub-second latency and high accuracy. Enterprise RAG architectures decompose this challenge into specialized components: query understanding, retrieval, reranking, context assembly, generation, and response validation. Each component must scale independently while maintaining end-to-end quality.

11.1.1 The RAG Pipeline

A complete RAG system comprises six stages:

Query Understanding: Parse user intent, extract entities, expand with synonyms
Retrieval: Vector search for top-k relevant documents (k=100-1000)
Reranking: Reorder results by relevance using cross-encoder (reduce to k=5-20)
Context Assembly: Fit selected documents into context window
Generation: LLM generates response given query + context
Validation: Verify response accuracy, check for hallucinations

Show Vector Store Setup

from dataclasses import dataclass
from typing import List, Optional

import faiss
import numpy as np


@dataclass
class Document:
    """Document with embedding."""
    doc_id: str
    text: str
    embedding: Optional[np.ndarray] = None
    metadata: dict = None

    def __post_init__(self):
        if self.metadata is None:
            self.metadata = {}


class VectorStore:
    """FAISS-based vector store for document retrieval."""
    def __init__(self, embedding_dim: int = 768):
        self.embedding_dim = embedding_dim
        self.index = faiss.IndexFlatIP(embedding_dim)
        self.documents: List[Document] = []

    def add_documents(self, documents: List[Document]):
        """Add documents to the vector store."""
        embeddings = np.array([doc.embedding for doc in documents]).astype('float32')
        faiss.normalize_L2(embeddings)
        self.index.add(embeddings)
        self.documents.extend(documents)

    def search(self, query_embedding: np.ndarray, k: int = 5) -> List[Document]:
        """Search for top-k most similar documents."""
        query_embedding = query_embedding.astype('float32').reshape(1, -1)
        faiss.normalize_L2(query_embedding)
        distances, indices = self.index.search(query_embedding, k)
        return [self.documents[i] for i in indices[0]]

# Usage example
store = VectorStore(embedding_dim=768)
docs = [
    Document(doc_id="1", text="Machine learning basics", embedding=np.random.rand(768)),
    Document(doc_id="2", text="Deep learning with PyTorch", embedding=np.random.rand(768))
]
store.add_documents(docs)
results = store.search(np.random.rand(768), k=2)
print(f"Found {len(results)} documents")

Found 2 documents

Enterprise RAG Best Practices

Architecture:

Decouple components (retrieval, reranking, generation)
Use async/parallel processing where possible
Implement circuit breakers for each component
Cache frequent queries and intermediate results

Query Processing:

Always classify intent (different strategies per type)
Extract and normalize entities
Use query expansion for better recall
Parse metadata filters from natural language

Retrieval:

Start with high k (100-1000) for recall
Use multiple retrieval strategies (vector + keyword)
Apply metadata filters early (before reranking)
Log retrieval metrics for continuous improvement

Reranking:

Essential for production accuracy (10-20% improvement)
Use cross-encoder models (more accurate than bi-encoders)
Batch reranking requests for efficiency
Consider two-stage reranking (coarse then fine)

11.2 Context Window Optimization

LLMs have fixed context windows (4K-128K tokens), but enterprise knowledge bases contain millions of documents. Context window optimization maximizes information density: selecting the most relevant passages, removing redundancy, compressing verbose content, and structuring information for LLM comprehension.

11.2.1 The Context Window Challenge

Problem: Retrieved documents often exceed context limits - 10 documents × 1000 tokens each = 10K tokens - Typical LLM limit: 4K-8K tokens - Need to reduce 10K → 4K while preserving key information

Naive approach: Truncate each document - Problem: May cut off critical information, often removes conclusions

Better approach: Extract relevant passages, deduplicate, compress

Show Passage Extractor

from typing import List, Tuple
import re


class PassageExtractor:
    """Extract relevant passages from long documents."""
    def __init__(self, max_passage_length: int = 512, overlap: int = 50):
        self.max_passage_length = max_passage_length
        self.overlap = overlap

    def extract_passages(self, text: str) -> List[Tuple[str, int, int]]:
        """Split text into overlapping passages.

        Returns: List of (passage_text, start_idx, end_idx)
        """
        sentences = re.split(r'(?<=[.!?])\s+', text)
        passages = []
        current_passage = []
        current_length = 0
        start_idx = 0

        for sentence in sentences:
            sentence_length = len(sentence.split())

            if current_length + sentence_length > self.max_passage_length:
                if current_passage:
                    passage_text = ' '.join(current_passage)
                    end_idx = start_idx + len(passage_text)
                    passages.append((passage_text, start_idx, end_idx))

                    # Keep overlap
                    overlap_text = current_passage[-self.overlap:]
                    current_passage = overlap_text + [sentence]
                    start_idx = end_idx - len(' '.join(overlap_text))
                    current_length = sum(len(s.split()) for s in current_passage)
            else:
                current_passage.append(sentence)
                current_length += sentence_length

        if current_passage:
            passage_text = ' '.join(current_passage)
            passages.append((passage_text, start_idx, start_idx + len(passage_text)))

        return passages

# Usage example
extractor = PassageExtractor(max_passage_length=100, overlap=20)
text = "This is a long document. " * 50
passages = extractor.extract_passages(text)
print(f"Extracted {len(passages)} passages from document")
print(f"First passage: {passages[0][0][:100]}...")

Extracted 32 passages from document
First passage: This is a long document. This is a long document. This is a long document. This is a long document. ...

Context Window Optimization Best Practices

Passage extraction:

Use sentence embeddings for relevance scoring
Keep consecutive sentences for narrative flow
Extract different amounts per query type (factual: less, explanation: more)

Deduplication:

Use MinHash or embeddings for semantic similarity
Set threshold based on acceptable information loss (0.8-0.9)
Keep first occurrence (usually most complete)

Token counting:

Use tokenizer from target LLM (different tokenizers vary)
Count precisely, don’t estimate (estimation errors compound)
Reserve tokens for query, instructions, output (typically 20-30%)

Hierarchical assembly:

Always include document titles/metadata
Prioritize key passages over full text
Add detail progressively until limit reached

Context Window Pitfalls

Common mistakes that degrade RAG quality:

Over-truncation: Cutting documents mid-sentence or mid-paragraph loses context - Solution: Truncate at sentence/paragraph boundaries

Lost citations: After extraction/summarization, can’t attribute claims - Solution: Maintain document IDs throughout processing

Query not in context: Forgot to include original query in prompt - Solution: Always include query, even if redundant

Exceeding limit: Token estimation off, actual usage exceeds limit - Solution: Use actual tokenizer, add 10% safety buffer

11.3 Multi-Stage Retrieval Systems

Single-stage retrieval (retrieve top-k, done) sacrifices either recall or latency. Multi-stage retrieval separates concerns: early stages optimize for recall (don’t miss relevant documents), later stages optimize for precision (rank best documents highest). This enables billion-document search with high accuracy and low latency.

11.3.1 The Multi-Stage Architecture

Stage 1: Coarse Retrieval (Recall-focused) - Goal: Don’t miss relevant documents - Method: Fast vector search (ANN) - Scale: Search full corpus (1B+ documents) - Output: Top-1000 candidates - Latency: 50-100ms

Stage 2: Reranking (Precision-focused) - Goal: Rank best documents highest - Method: Cross-encoder model - Scale: Rerank 1000 candidates - Output: Top-20 documents - Latency: 50-200ms

Stage 3: Final Selection (Context-focused) - Goal: Maximize context window utilization - Method: Passage extraction, deduplication - Scale: Process 20 documents - Output: Optimized context - Latency: 10-50ms

Show Multi-Stage Retriever

from typing import List
import numpy as np


class MultiStageRetriever:
    """Two-stage retrieval: fast first-stage, accurate second-stage."""
    def __init__(self, vector_store, reranker_model=None):
        self.vector_store = vector_store
        self.reranker_model = reranker_model

    def retrieve(self, query: str, query_embedding: np.ndarray,
                 k: int = 5, first_stage_k: int = 20) -> List[Document]:
        """Retrieve documents using two-stage approach.

        Stage 1: Fast vector search retrieves top-N candidates
        Stage 2: Reranker scores candidates and returns top-K
        """
        # Stage 1: Fast vector search
        candidates = self.vector_store.search(query_embedding, k=first_stage_k)

        # Stage 2: Rerank with more expensive model
        if self.reranker_model:
            scores = []
            for doc in candidates:
                score = np.random.rand()  # Placeholder for reranking
                scores.append(score)

            # Sort by reranker score
            ranked_indices = np.argsort(scores)[::-1]
            candidates = [candidates[i] for i in ranked_indices[:k]]

        return candidates[:k]

# Usage example
store = VectorStore(embedding_dim=768)
docs = [Document(doc_id=str(i), text=f"Doc {i}",
                 embedding=np.random.rand(768)) for i in range(100)]
store.add_documents(docs)

retriever = MultiStageRetriever(vector_store=store)
results = retriever.retrieve("sample query", np.random.rand(768), k=5, first_stage_k=20)
print(f"Retrieved {len(results)} documents after two-stage retrieval")

Retrieved 5 documents after two-stage retrieval

Multi-Stage Retrieval Best Practices

Stage separation:

Early stages: Fast, high recall (don’t miss relevant docs)
Later stages: Slow, high precision (rank best docs highest)
Each stage should reduce candidates 50-90%

Stage selection:

Always include: Vector retrieval (stage 1) + Reranking (stage 2)
Optional: Keyword filter, diversity filter, metadata filter
Add stages based on failure analysis (what’s missing? what’s wrong?)

Performance optimization:

Cache vector search results (query embeddings stable)
Batch reranking requests (100 docs × 1ms each = 100ms, batched = 20ms)
Run filters in parallel when possible (keyword + metadata)
Monitor stage latencies separately (find bottlenecks)

Quality monitoring:

Track recall @ each stage (is stage 1 missing relevant docs?)
Track precision @ each stage (is stage 2 improving ranking?)
A/B test stage variations (does keyword filter help?)

11.4 RAG Evaluation Frameworks

RAG systems combine retrieval and generation, requiring evaluation beyond standard IR or NLG metrics. RAG evaluation frameworks measure end-to-end quality: retrieval relevance, context utilization, answer accuracy, factual consistency, attribution quality, and user satisfaction.

11.4.1 The RAG Evaluation Challenge

Traditional IR metrics (Recall@k, MRR, NDCG):

Measure retrieval quality only
Don’t capture if LLM used retrieved context
Don’t measure answer accuracy

Traditional NLG metrics (BLEU, ROUGE, BERTScore):

Measure generation quality only
Don’t capture if answer grounded in context
Don’t detect hallucinations

RAG needs both + more: Did system retrieve relevant docs AND generate accurate answer grounded in those docs?

Show Hybrid Search

from typing import Dict
import numpy as np


class HybridSearch:
    """Combine dense (vector) and sparse (BM25) retrieval."""
    def __init__(self, vector_store, bm25_index=None, alpha: float = 0.5):
        self.vector_store = vector_store
        self.bm25_index = bm25_index
        self.alpha = alpha

    def search(self, query: str, query_embedding: np.ndarray, k: int = 5) -> List[Document]:
        """Hybrid search combining dense and sparse retrieval.

        Score = alpha * dense_score + (1 - alpha) * sparse_score
        """
        # Dense retrieval
        dense_results = self.vector_store.search(query_embedding, k=k*2)
        dense_scores = {doc.doc_id: 1.0 / (i + 1) for i, doc in enumerate(dense_results)}

        # Sparse retrieval (BM25)
        if self.bm25_index:
            sparse_scores = {doc.doc_id: np.random.rand() for doc in dense_results}
        else:
            sparse_scores = {doc.doc_id: 0.0 for doc in dense_results}

        # Combine scores
        combined_scores = {}
        all_doc_ids = set(dense_scores.keys()) | set(sparse_scores.keys())

        for doc_id in all_doc_ids:
            dense_score = dense_scores.get(doc_id, 0.0)
            sparse_score = sparse_scores.get(doc_id, 0.0)
            combined_scores[doc_id] = self.alpha * dense_score + (1 - self.alpha) * sparse_score

        # Sort by combined score
        sorted_ids = sorted(combined_scores.keys(), key=lambda x: combined_scores[x], reverse=True)

        # Return top-k documents
        id_to_doc = {doc.doc_id: doc for doc in dense_results}
        return [id_to_doc[doc_id] for doc_id in sorted_ids[:k] if doc_id in id_to_doc]

# Usage example
store = VectorStore(embedding_dim=768)
docs = [Document(doc_id=str(i), text=f"Doc {i}",
                 embedding=np.random.rand(768)) for i in range(50)]
store.add_documents(docs)

hybrid = HybridSearch(vector_store=store, alpha=0.7)
results = hybrid.search("sample query", np.random.rand(768), k=5)
print(f"Hybrid search returned {len(results)} documents")

Hybrid search returned 5 documents

RAG Evaluation Best Practices

Evaluation data:

Start with 100-500 query-answer pairs
Cover diversity of query types (factual, how-to, comparison, etc.)
Include hard cases (contradictory docs, missing info, ambiguous queries)
Get human annotations for ground truth (expensive but essential)

Automated metrics:

Retrieval: Recall@10, Recall@100, MRR
Generation: Semantic similarity to ground truth (SentenceTransformers)
Faithfulness: NLI models (check entailment between context and answer)
Attribution: Check if citations support claims

Human evaluation:

Sample 10-20% for human review
Ask: Is answer accurate? Is answer complete? Are citations correct?
Use majority vote from 3+ annotators
Expensive but ground truth for calibrating automated metrics

Continuous evaluation:

Evaluate on every model/prompt change
Track metrics over time (detect regressions)
A/B test in production (measure user satisfaction)

11.5 Handling Contradictory Information

Real-world knowledge bases contain contradictions: different sources disagree, information becomes outdated, perspectives conflict. Contradiction handling strategies enable RAG systems to navigate disagreements: detecting conflicts, weighing source credibility, presenting multiple perspectives, and updating knowledge as information evolves.

11.5.1 The Contradiction Challenge

Types of contradictions:

Temporal: Information changes over time
- “Product price is $99” (2023) vs “$149” (2024)
- Solution: Prioritize recent information
Source disagreement: Different sources conflict
- Source A: “API supports OAuth2” vs Source B: “API uses API keys”
- Solution: Weigh by source authority/credibility
Perspective differences: Subjective judgments vary
- Review 1: “Excellent product” vs Review 2: “Poor quality”
- Solution: Present multiple perspectives
Partial vs complete: One source has partial information
- Doc 1: “Supports Python” vs Doc 2: “Supports Python, Java, Go”
- Solution: Prefer more complete information

Show Query Routing

from typing import Dict, List, Optional
import numpy as np


class QueryRouter:
    """Route queries to appropriate retrieval strategy."""
    def __init__(self, strategies: Dict[str, any]):
        self.strategies = strategies

    def route_query(self, query: str, query_embedding: np.ndarray) -> str:
        """Determine which retrieval strategy to use.

        Routes based on query type:
        - Factual queries -> Dense retrieval
        - Keyword queries -> Sparse retrieval (BM25)
        - Complex queries -> Hybrid retrieval
        """
        query_lower = query.lower()

        if any(word in query_lower for word in ['what', 'when', 'where', 'who']):
            return 'dense'
        elif len(query.split()) <= 3:
            return 'sparse'
        else:
            return 'hybrid'

    def retrieve(self, query: str, query_embedding: np.ndarray, k: int = 5) -> List[Document]:
        """Route and retrieve documents."""
        strategy_name = self.route_query(query, query_embedding)
        strategy = self.strategies.get(strategy_name)

        if strategy:
            # Handle different strategy interfaces
            if isinstance(strategy, VectorStore):
                return strategy.search(query_embedding, k=k)
            else:
                return strategy.search(query, query_embedding, k=k)
        else:
            first_strategy = list(self.strategies.values())[0]
            if isinstance(first_strategy, VectorStore):
                return first_strategy.search(query_embedding, k=k)
            else:
                return first_strategy.search(query, query_embedding, k=k)

# Usage example
store = VectorStore(embedding_dim=768)
docs = [Document(doc_id=str(i), text=f"Doc {i}",
                 embedding=np.random.rand(768)) for i in range(50)]
store.add_documents(docs)

strategies = {
    'dense': store,
    'hybrid': HybridSearch(vector_store=store, alpha=0.7)
}

router = QueryRouter(strategies=strategies)
results = router.retrieve("What is machine learning?", np.random.rand(768), k=5)
print(f"Query routed and retrieved {len(results)} documents")

Query routed and retrieved 5 documents

Contradiction Handling Best Practices

Detection:

Use NLI models for semantic contradiction detection
Extract claims with high precision (false contradictions confuse users)
Focus on factual contradictions (prices, dates, specifications)
Ignore stylistic differences (different phrasings of same fact)

Resolution strategies:

Temporal: Always prefer recent information (but show date)
Source authority: Build credibility scores per source type
Confidence: Use when other signals unavailable
Present multiple: When confident both are valid (perspectives)

User experience:

Always show sources when contradictions exist
Indicate confidence level (“likely”, “possibly”, “conflicting sources”)
Provide dates when information might change
Allow users to see all perspectives (expandable sections)

Continuous improvement:

Log user selections when presented with contradictions
Update source credibility based on user preferences
Retrain contradiction detection on corrected examples

Contradiction Pitfalls

Over-resolving: Automatically picking one answer when both are valid - Example: “Best database for X” has multiple valid answers - Solution: Recognize when question has multiple valid answers

Temporal confusion: Using old information because it’s higher quality - Example: Detailed 2022 guide vs brief 2024 update - Solution: Always prioritize recency for rapidly changing topics

Authority bias: Always trusting “authoritative” source - Example: Official docs outdated, community docs current - Solution: Consider recency + authority together

Hidden contradictions: Not detecting subtle conflicts - Example: “Supports OAuth2” vs “Requires API keys” (implicit contradiction) - Solution: Use semantic contradiction detection, not just exact mismatches

11.6 Conversational AI and Chatbots

RAG powers modern conversational AI systems—customer service bots, internal assistants, and domain-specific copilots. Embedding-based chatbots move beyond scripted responses to semantic understanding: matching user intent to relevant knowledge, maintaining conversation context, and generating grounded responses.

11.6.1 Intent Classification with Embeddings

Traditional chatbots use keyword matching or rule-based intent classification. Embedding-based systems understand semantic intent:

Show Intent Classifier

import numpy as np
from dataclasses import dataclass
from typing import List, Tuple


@dataclass
class Intent:
    """Chatbot intent with example utterances."""
    name: str
    description: str
    examples: List[str]
    embedding: np.ndarray = None  # Centroid of example embeddings


class IntentClassifier:
    """Embedding-based intent classification for chatbots."""

    def __init__(self, intents: List[Intent], encoder):
        self.intents = intents
        self.encoder = encoder
        self._compute_intent_embeddings()

    def _compute_intent_embeddings(self):
        """Compute centroid embedding for each intent from examples."""
        for intent in self.intents:
            if intent.examples:
                example_embeddings = [self.encoder.encode(ex) for ex in intent.examples]
                intent.embedding = np.mean(example_embeddings, axis=0)

    def classify(self, user_message: str, threshold: float = 0.5) -> Tuple[str, float]:
        """Classify user message into intent with confidence score."""
        message_embedding = self.encoder.encode(user_message)

        best_intent = None
        best_score = -1

        for intent in self.intents:
            if intent.embedding is not None:
                # Cosine similarity
                score = np.dot(message_embedding, intent.embedding) / (
                    np.linalg.norm(message_embedding) * np.linalg.norm(intent.embedding)
                )
                if score > best_score:
                    best_score = score
                    best_intent = intent.name

        if best_score < threshold:
            return "unknown", best_score

        return best_intent, best_score

    def get_similar_examples(self, user_message: str, k: int = 3) -> List[Tuple[str, str, float]]:
        """Find most similar training examples for few-shot prompting."""
        message_embedding = self.encoder.encode(user_message)

        all_examples = []
        for intent in self.intents:
            for example in intent.examples:
                example_embedding = self.encoder.encode(example)
                score = np.dot(message_embedding, example_embedding) / (
                    np.linalg.norm(message_embedding) * np.linalg.norm(example_embedding)
                )
                all_examples.append((intent.name, example, score))

        all_examples.sort(key=lambda x: x[2], reverse=True)
        return all_examples[:k]


# Example usage with mock encoder
class MockEncoder:
    def encode(self, text):
        # In production, use sentence-transformers or similar
        np.random.seed(hash(text) % 2**32)
        return np.random.randn(384)

encoder = MockEncoder()
intents = [
    Intent("order_status", "Check order status", ["Where is my order?", "Track my package", "Order status"]),
    Intent("return_request", "Request a return", ["I want to return this", "How do I return?", "Return policy"]),
    Intent("product_info", "Product information", ["Tell me about this product", "Product specs", "Features"]),
]

classifier = IntentClassifier(intents, encoder)
intent, confidence = classifier.classify("When will my package arrive?")
print(f"Intent: {intent}, Confidence: {confidence:.3f}")

Intent: unknown, Confidence: 0.061

11.6.2 Conversation Context Management

Chatbots must maintain context across conversation turns. Embeddings enable semantic context windows that retrieve relevant conversation history:

Show Conversation Manager

from dataclasses import dataclass, field
from typing import List, Optional
import numpy as np


@dataclass
class ConversationTurn:
    """Single turn in conversation."""
    role: str  # "user" or "assistant"
    content: str
    embedding: Optional[np.ndarray] = None
    timestamp: float = 0.0


@dataclass
class ConversationContext:
    """Manages conversation history with semantic retrieval."""
    turns: List[ConversationTurn] = field(default_factory=list)
    max_turns: int = 50

    def add_turn(self, role: str, content: str, encoder, timestamp: float = 0.0):
        """Add a turn to conversation history."""
        embedding = encoder.encode(content)
        turn = ConversationTurn(role=role, content=content, embedding=embedding, timestamp=timestamp)
        self.turns.append(turn)

        # Trim old turns if needed
        if len(self.turns) > self.max_turns:
            self.turns = self.turns[-self.max_turns:]

    def get_relevant_context(self, current_query: str, encoder, k: int = 5) -> List[ConversationTurn]:
        """Retrieve most relevant previous turns for current query."""
        if not self.turns:
            return []

        query_embedding = encoder.encode(current_query)

        # Score each turn by relevance
        scored_turns = []
        for i, turn in enumerate(self.turns[:-1]):  # Exclude current turn
            if turn.embedding is not None:
                similarity = np.dot(query_embedding, turn.embedding) / (
                    np.linalg.norm(query_embedding) * np.linalg.norm(turn.embedding)
                )
                # Boost recent turns slightly
                recency_boost = 0.1 * (i / len(self.turns))
                scored_turns.append((turn, similarity + recency_boost))

        # Sort by score and return top-k
        scored_turns.sort(key=lambda x: x[1], reverse=True)
        return [turn for turn, score in scored_turns[:k]]

    def build_context_prompt(self, current_query: str, encoder, max_tokens: int = 2000) -> str:
        """Build context string for LLM prompt."""
        relevant = self.get_relevant_context(current_query, encoder)

        context_parts = []
        token_estimate = 0

        for turn in relevant:
            turn_text = f"{turn.role}: {turn.content}"
            turn_tokens = len(turn_text.split()) * 1.3  # Rough token estimate

            if token_estimate + turn_tokens > max_tokens:
                break

            context_parts.append(turn_text)
            token_estimate += turn_tokens

        return "\n".join(context_parts)


# Example usage
context = ConversationContext()
encoder = MockEncoder()

context.add_turn("user", "I ordered a laptop last week", encoder)
context.add_turn("assistant", "I can help you track your laptop order. What's your order number?", encoder)
context.add_turn("user", "It's ORDER-12345", encoder)
context.add_turn("assistant", "Order ORDER-12345 shipped yesterday and should arrive Friday.", encoder)
context.add_turn("user", "What about the warranty?", encoder)

# Retrieve relevant context for warranty question
relevant = context.get_relevant_context("What about the warranty?", encoder, k=3)
print(f"Retrieved {len(relevant)} relevant turns for warranty question")

Retrieved 3 relevant turns for warranty question

11.6.3 Response Selection vs Generation

Chatbots can either select from pre-written responses or generate new ones. Embeddings enable hybrid approaches:

Show Hybrid Response System

from dataclasses import dataclass
from typing import List, Optional, Tuple
import numpy as np


@dataclass
class CannedResponse:
    """Pre-written response for common queries."""
    id: str
    intent: str
    response: str
    embedding: Optional[np.ndarray] = None


class HybridResponseSystem:
    """Combines response selection with RAG-based generation."""

    def __init__(self, canned_responses: List[CannedResponse], encoder,
                 selection_threshold: float = 0.85):
        self.responses = canned_responses
        self.encoder = encoder
        self.selection_threshold = selection_threshold
        self._compute_response_embeddings()

    def _compute_response_embeddings(self):
        """Pre-compute embeddings for canned responses."""
        for response in self.responses:
            response.embedding = self.encoder.encode(response.response)

    def get_response(self, user_query: str, intent: str) -> Tuple[str, str]:
        """
        Get response for user query.
        Returns (response_text, method) where method is 'selected' or 'generated'.
        """
        query_embedding = self.encoder.encode(user_query)

        # Find best matching canned response for this intent
        best_response = None
        best_score = -1

        for response in self.responses:
            if response.intent == intent and response.embedding is not None:
                score = np.dot(query_embedding, response.embedding) / (
                    np.linalg.norm(query_embedding) * np.linalg.norm(response.embedding)
                )
                if score > best_score:
                    best_score = score
                    best_response = response

        # If high confidence match, use canned response
        if best_score >= self.selection_threshold and best_response:
            return best_response.response, "selected"

        # Otherwise, would trigger RAG generation (placeholder)
        return f"[Generated response for: {user_query}]", "generated"


# Example usage
responses = [
    CannedResponse("r1", "order_status", "You can track your order at example.com/track"),
    CannedResponse("r2", "return_request", "Returns are accepted within 30 days. Visit example.com/returns"),
    CannedResponse("r3", "product_info", "Our products come with a 1-year warranty."),
]

system = HybridResponseSystem(responses, MockEncoder())
response, method = system.get_response("How do I track my package?", "order_status")
print(f"Response ({method}): {response}")

Response (generated): [Generated response for: How do I track my package?]

Conversational AI Best Practices

Intent Classification:

Few-shot examples: 5-10 examples per intent is often sufficient with good embeddings
Hierarchical intents: Parent → child classification for complex domains
Fallback handling: Route low-confidence queries to human agents or clarification
Active learning: Log low-confidence queries for labeling and model improvement

Context Management:

Semantic retrieval: Don’t just use last N turns—retrieve semantically relevant history
Entity tracking: Maintain extracted entities (order numbers, product names) across turns
Session boundaries: Clear context appropriately between sessions
Privacy: Exclude sensitive information from context retrieval

Response Strategy:

Canned for compliance: Use pre-written responses for legal, safety, policy questions
Generated for flexibility: Use RAG for complex, context-dependent queries
Hybrid routing: Classify query type to select response strategy
Guardrails: Always validate generated responses before sending

11.7 Embedding-Based Summarization

Summarization with embeddings identifies representative content—selecting sentences or passages that best capture document meaning. Unlike generative summarization, embedding-based approaches are extractive, selecting existing text rather than generating new text.

11.7.1 Representative Sentence Selection

The core idea: sentences with embeddings closest to the document centroid are most representative:

Show Extractive Summarizer

from dataclasses import dataclass
from typing import List
import numpy as np


@dataclass
class Sentence:
    """Sentence with embedding."""
    text: str
    embedding: np.ndarray
    position: int  # Position in original document


class ExtractiveSummarizer:
    """Embedding-based extractive summarization."""

    def __init__(self, encoder):
        self.encoder = encoder

    def summarize(self, document: str, num_sentences: int = 3,
                  diversity_weight: float = 0.3) -> List[str]:
        """
        Extract representative sentences from document.

        Args:
            document: Input text
            num_sentences: Number of sentences to extract
            diversity_weight: Balance between relevance (0) and diversity (1)
        """
        # Split into sentences (simplified)
        raw_sentences = [s.strip() for s in document.replace('!', '.').replace('?', '.').split('.') if s.strip()]

        if len(raw_sentences) <= num_sentences:
            return raw_sentences

        # Compute embeddings
        sentences = []
        for i, text in enumerate(raw_sentences):
            embedding = self.encoder.encode(text)
            sentences.append(Sentence(text=text, embedding=embedding, position=i))

        # Compute document centroid
        all_embeddings = np.array([s.embedding for s in sentences])
        centroid = np.mean(all_embeddings, axis=0)

        # Select sentences using MMR (Maximal Marginal Relevance)
        selected = []
        remaining = sentences.copy()

        for _ in range(num_sentences):
            best_sentence = None
            best_score = -float('inf')

            for sentence in remaining:
                # Relevance: similarity to centroid
                relevance = np.dot(sentence.embedding, centroid) / (
                    np.linalg.norm(sentence.embedding) * np.linalg.norm(centroid)
                )

                # Diversity: dissimilarity to already selected sentences
                if selected:
                    max_sim_to_selected = max(
                        np.dot(sentence.embedding, s.embedding) / (
                            np.linalg.norm(sentence.embedding) * np.linalg.norm(s.embedding)
                        )
                        for s in selected
                    )
                    diversity = 1 - max_sim_to_selected
                else:
                    diversity = 1

                # MMR score
                score = (1 - diversity_weight) * relevance + diversity_weight * diversity

                if score > best_score:
                    best_score = score
                    best_sentence = sentence

            if best_sentence:
                selected.append(best_sentence)
                remaining.remove(best_sentence)

        # Return in original document order
        selected.sort(key=lambda s: s.position)
        return [s.text for s in selected]

    def summarize_multi_document(self, documents: List[str], num_sentences: int = 5) -> List[str]:
        """Summarize multiple documents by finding representative sentences across all."""
        all_sentences = []

        for doc_idx, document in enumerate(documents):
            raw_sentences = [s.strip() for s in document.replace('!', '.').replace('?', '.').split('.') if s.strip()]
            for i, text in enumerate(raw_sentences):
                embedding = self.encoder.encode(text)
                all_sentences.append(Sentence(text=text, embedding=embedding, position=i + doc_idx * 1000))

        if len(all_sentences) <= num_sentences:
            return [s.text for s in all_sentences]

        # Compute global centroid
        all_embeddings = np.array([s.embedding for s in all_sentences])
        centroid = np.mean(all_embeddings, axis=0)

        # Score by distance to centroid
        scores = []
        for sentence in all_sentences:
            score = np.dot(sentence.embedding, centroid) / (
                np.linalg.norm(sentence.embedding) * np.linalg.norm(centroid)
            )
            scores.append((sentence, score))

        scores.sort(key=lambda x: x[1], reverse=True)
        return [s.text for s, _ in scores[:num_sentences]]


# Example usage
summarizer = ExtractiveSummarizer(MockEncoder())
document = """
Machine learning has transformed how we process data.
Deep learning models can recognize patterns in images and text.
Neural networks require large amounts of training data.
Transfer learning allows models to leverage pre-trained knowledge.
Embeddings represent data as dense vectors for similarity computation.
"""
summary = summarizer.summarize(document, num_sentences=2)
print(f"Summary ({len(summary)} sentences):")
for s in summary:
    print(f"  - {s}")

Summary (2 sentences):
  - Machine learning has transformed how we process data
  - Deep learning models can recognize patterns in images and text

11.7.2 Cluster-Based Summarization

For longer documents, cluster sentences first, then select representatives from each cluster:

Show Cluster-Based Summarizer

from typing import List, Dict
import numpy as np


class ClusterSummarizer:
    """Cluster-based summarization for long documents."""

    def __init__(self, encoder):
        self.encoder = encoder

    def summarize(self, document: str, num_clusters: int = 3) -> List[str]:
        """
        Summarize by clustering sentences and selecting cluster representatives.
        """
        # Split and embed sentences
        raw_sentences = [s.strip() for s in document.replace('!', '.').replace('?', '.').split('.') if s.strip()]

        if len(raw_sentences) <= num_clusters:
            return raw_sentences

        embeddings = np.array([self.encoder.encode(s) for s in raw_sentences])

        # Simple k-means clustering
        centroids = self._kmeans(embeddings, num_clusters)

        # Assign sentences to clusters
        clusters: Dict[int, List[tuple]] = {i: [] for i in range(num_clusters)}
        for i, (sentence, embedding) in enumerate(zip(raw_sentences, embeddings)):
            distances = [np.linalg.norm(embedding - c) for c in centroids]
            cluster_id = np.argmin(distances)
            clusters[cluster_id].append((sentence, embedding, i))

        # Select representative from each cluster (closest to centroid)
        representatives = []
        for cluster_id, members in clusters.items():
            if not members:
                continue

            centroid = centroids[cluster_id]
            best_sentence = min(
                members,
                key=lambda x: np.linalg.norm(x[1] - centroid)
            )
            representatives.append((best_sentence[0], best_sentence[2]))  # text, position

        # Return in document order
        representatives.sort(key=lambda x: x[1])
        return [text for text, _ in representatives]

    def _kmeans(self, embeddings: np.ndarray, k: int, max_iters: int = 100) -> np.ndarray:
        """Simple k-means clustering."""
        # Initialize centroids randomly
        indices = np.random.choice(len(embeddings), k, replace=False)
        centroids = embeddings[indices].copy()

        for _ in range(max_iters):
            # Assign points to nearest centroid
            assignments = []
            for emb in embeddings:
                distances = [np.linalg.norm(emb - c) for c in centroids]
                assignments.append(np.argmin(distances))

            # Update centroids
            new_centroids = []
            for i in range(k):
                cluster_points = embeddings[np.array(assignments) == i]
                if len(cluster_points) > 0:
                    new_centroids.append(cluster_points.mean(axis=0))
                else:
                    new_centroids.append(centroids[i])

            new_centroids = np.array(new_centroids)

            # Check convergence
            if np.allclose(centroids, new_centroids):
                break

            centroids = new_centroids

        return centroids


# Example
cluster_summarizer = ClusterSummarizer(MockEncoder())
long_doc = """
The economy grew by 3% this quarter. Employment rates improved significantly.
New technology startups raised record funding. AI companies led the investment surge.
Climate change policies face opposition. Environmental groups demand stronger action.
Sports teams prepare for the championship. Fans eagerly await the final matches.
"""
summary = cluster_summarizer.summarize(long_doc, num_clusters=3)
print(f"Cluster-based summary:")
for s in summary:
    print(f"  - {s}")

Cluster-based summary:
  - The economy grew by 3% this quarter
  - Employment rates improved significantly
  - Climate change policies face opposition

Summarization Best Practices

Extraction Strategy:

MMR for diversity: Avoid selecting redundant sentences
Position bias: First/last sentences often contain key information
Length normalization: Don’t over-favor short or long sentences
Cluster-based: For long documents, cluster then select representatives

Quality Considerations:

Coherence: Selected sentences should flow logically
Coverage: Summary should cover main topics, not just one aspect
Redundancy: Remove near-duplicate information
Context preservation: Include enough context for sentences to be understandable

Scale Considerations:

Pre-compute embeddings: For document collections, embed once and reuse
Hierarchical summarization: Summarize sections, then summarize summaries
Incremental updates: For streaming documents, maintain running summaries
Caching: Cache summaries for frequently accessed documents

11.8 Key Takeaways

RAG combines retrieval and generation for grounded LLM responses: Retrieving relevant context from vector databases enables accurate answers over billion-document corpora while maintaining attribution and enabling real-time knowledge updates
Enterprise RAG requires multi-component architecture: Query understanding, retrieval, reranking, context assembly, generation, and validation each play critical roles, and each must scale independently
Context window optimization maximizes information density: Passage extraction, deduplication, and hierarchical assembly enable fitting relevant information within LLM token limits while preserving key facts
Multi-stage retrieval balances recall and precision: Early stages (vector search) optimize for recall across billion-doc corpora, later stages (reranking, diversity) optimize for precision with expensive models on small candidate sets
RAG evaluation requires measuring beyond retrieval and generation: End-to-end metrics must capture retrieval relevance, context utilization, answer accuracy, factual consistency, attribution quality, and user satisfaction
Contradiction handling enables navigating disagreements in knowledge bases: Temporal resolution (prefer recent), source authority weighting (prefer credible), and multi-perspective presentation handle conflicts when sources disagree
Production RAG demands comprehensive engineering: Caching, batching, circuit breakers, monitoring, A/B testing, and continuous evaluation separate research prototypes from production systems serving millions of users
Conversational AI leverages embeddings for semantic intent matching: Embedding-based chatbots classify user intent from examples, retrieve semantically relevant conversation history, and combine canned responses with generated content for appropriate flexibility and compliance
Embedding-based summarization extracts representative content: Centroid-based selection and MMR diversity ensure summaries capture key information without redundancy, while cluster-based approaches handle long documents by selecting representatives from each topic cluster

11.9 Looking Ahead

This chapter demonstrated how RAG leverages embeddings for grounded generation at enterprise scale. Chapter 12 expands semantic search beyond text: multi-modal search across text, images, audio, and video; code search for software intelligence; scientific literature and patent search with domain-specific understanding; media and content discovery across creative assets; and knowledge graph integration for structured reasoning. These applications demonstrate embeddings’ versatility across diverse modalities and domains.

11.10 Further Reading

11.10.1 RAG Foundations

Lewis, Patrick, et al. (2020). “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS.
Guu, Kelvin, et al. (2020). “REALM: Retrieval-Augmented Language Model Pre-Training.” ICML.
Izacard, Gautier, et al. (2021). “Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering.” EACL.

11.10.2 Retrieval Systems

Karpukhin, Vladimir, et al. (2020). “Dense Passage Retrieval for Open-Domain Question Answering.” EMNLP.
Xiong, Lee, et al. (2021). “Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval.” ICLR.
Santhanam, Keshav, et al. (2022). “ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction.” NAACL.

11.10.3 Context Optimization

Jiang, Zhengbao, et al. (2023). “Long-Form Factuality in Large Language Models.” arxiv.
Liu, Nelson F., et al. (2023). “Lost in the Middle: How Language Models Use Long Contexts.” arxiv.
Press, Ofir, et al. (2022). “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.” ICLR.

11.10.4 RAG Evaluation

Chen, Daixuan, et al. (2023). “CRUD-RAG: Benchmarking Retrieval-Augmented Generation for Time-Sensitive Knowledge.” arxiv.
Es, Shahul, et al. (2023). “RAGAS: Automated Evaluation of Retrieval Augmented Generation.” arxiv.
Liu, Yang, et al. (2023). “Evaluating the Factuality of Large Language Models.” ACL.

11.10.5 Production Systems

Anthropic (2023). “Claude 2 System Card.”
OpenAI (2023). “GPT-4 Technical Report.”
Thoppilan, Romal, et al. (2022). “LaMDA: Language Models for Dialog Applications.” arxiv.

11.10.6 Contradiction Detection

Welleck, Sean, et al. (2019). “Dialogue Natural Language Inference.” ACL.
Honovich, Or, et al. (2022). “TRUE: Re-evaluating Factual Consistency Evaluation.” NAACL.
Wang, Cunxiang, et al. (2020). “CARE: Commonsense-Aware Reasoning for Conversational AI.” ACL.

11.10.7 Multi-Stage Retrieval

Nogueira, Rodrigo, et al. (2019). “Passage Re-ranking with BERT.” arxiv.
Gao, Luyu, et al. (2021). “Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline.” ECIR.
Carbonell, Jaime, and Jade Goldstein (1998). “The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries.” SIGIR.