Retrieval-Augmented Generation combines the power of embedding-based retrieval with large language model generation, enabling LLMs to answer questions grounded in enterprise knowledge rather than relying solely on parametric memory. This chapter explores production RAG systems at scale: enterprise architecture patterns that handle billion-document corpora, context window optimization strategies that maximize information density while respecting token limits, multi-stage retrieval pipelines that balance recall and precision across filtering and reranking stages, evaluation frameworks that measure end-to-end quality beyond simple metrics, and techniques for handling contradictory information when sources disagree. These patterns enable RAG systems that serve millions of users with accurate, attributable, up-to-date responses.
With robust data engineering in place (Chapter 23), the foundation exists to build advanced applications that leverage embeddings at scale. Retrieval-Augmented Generation (RAG) has emerged as the dominant pattern for grounding large language models in enterprise knowledge. Rather than fine-tuning models on proprietary data (expensive, slow to update, risk of hallucination), RAG retrieves relevant context from vector databases and includes it in the LLM prompt. This approach enables accurate answers over billion-document corpora, maintains attribution to sources, updates knowledge in real-time, and scales to trillion-row datasets—all critical requirements for enterprise deployment.
11.1 Enterprise RAG Architecture Patterns
Production RAG systems serve thousands of concurrent users querying billion-document knowledge bases with sub-second latency and high accuracy. Enterprise RAG architectures decompose this challenge into specialized components: query understanding, retrieval, reranking, context assembly, generation, and response validation. Each component must scale independently while maintaining end-to-end quality.
11.1.1 The RAG Pipeline
A complete RAG system comprises six stages:
Query Understanding: Parse user intent, extract entities, expand with synonyms
Retrieval: Vector search for top-k relevant documents (k=100-1000)
Reranking: Reorder results by relevance using cross-encoder (reduce to k=5-20)
Context Assembly: Fit selected documents into context window
Generation: LLM generates response given query + context
Validation: Verify response accuracy, check for hallucinations
Show Vector Store Setup
from dataclasses import dataclassfrom typing import List, Optionalimport faissimport numpy as np@dataclassclass Document:"""Document with embedding.""" doc_id: str text: str embedding: Optional[np.ndarray] =None metadata: dict=Nonedef __post_init__(self):ifself.metadata isNone:self.metadata = {}class VectorStore:"""FAISS-based vector store for document retrieval."""def__init__(self, embedding_dim: int=768):self.embedding_dim = embedding_dimself.index = faiss.IndexFlatIP(embedding_dim)self.documents: List[Document] = []def add_documents(self, documents: List[Document]):"""Add documents to the vector store.""" embeddings = np.array([doc.embedding for doc in documents]).astype('float32') faiss.normalize_L2(embeddings)self.index.add(embeddings)self.documents.extend(documents)def search(self, query_embedding: np.ndarray, k: int=5) -> List[Document]:"""Search for top-k most similar documents.""" query_embedding = query_embedding.astype('float32').reshape(1, -1) faiss.normalize_L2(query_embedding) distances, indices =self.index.search(query_embedding, k)return [self.documents[i] for i in indices[0]]# Usage examplestore = VectorStore(embedding_dim=768)docs = [ Document(doc_id="1", text="Machine learning basics", embedding=np.random.rand(768)), Document(doc_id="2", text="Deep learning with PyTorch", embedding=np.random.rand(768))]store.add_documents(docs)results = store.search(np.random.rand(768), k=2)print(f"Found {len(results)} documents")
Always classify intent (different strategies per type)
Extract and normalize entities
Use query expansion for better recall
Parse metadata filters from natural language
Retrieval:
Start with high k (100-1000) for recall
Use multiple retrieval strategies (vector + keyword)
Apply metadata filters early (before reranking)
Log retrieval metrics for continuous improvement
Reranking:
Essential for production accuracy (10-20% improvement)
Use cross-encoder models (more accurate than bi-encoders)
Batch reranking requests for efficiency
Consider two-stage reranking (coarse then fine)
11.2 Context Window Optimization
LLMs have fixed context windows (4K-128K tokens), but enterprise knowledge bases contain millions of documents. Context window optimization maximizes information density: selecting the most relevant passages, removing redundancy, compressing verbose content, and structuring information for LLM comprehension.
11.2.1 The Context Window Challenge
Problem: Retrieved documents often exceed context limits - 10 documents × 1000 tokens each = 10K tokens - Typical LLM limit: 4K-8K tokens - Need to reduce 10K → 4K while preserving key information
Naive approach: Truncate each document - Problem: May cut off critical information, often removes conclusions
from typing import List, Tupleimport reclass PassageExtractor:"""Extract relevant passages from long documents."""def__init__(self, max_passage_length: int=512, overlap: int=50):self.max_passage_length = max_passage_lengthself.overlap = overlapdef extract_passages(self, text: str) -> List[Tuple[str, int, int]]:"""Split text into overlapping passages. Returns: List of (passage_text, start_idx, end_idx) """ sentences = re.split(r'(?<=[.!?])\s+', text) passages = [] current_passage = [] current_length =0 start_idx =0for sentence in sentences: sentence_length =len(sentence.split())if current_length + sentence_length >self.max_passage_length:if current_passage: passage_text =' '.join(current_passage) end_idx = start_idx +len(passage_text) passages.append((passage_text, start_idx, end_idx))# Keep overlap overlap_text = current_passage[-self.overlap:] current_passage = overlap_text + [sentence] start_idx = end_idx -len(' '.join(overlap_text)) current_length =sum(len(s.split()) for s in current_passage)else: current_passage.append(sentence) current_length += sentence_lengthif current_passage: passage_text =' '.join(current_passage) passages.append((passage_text, start_idx, start_idx +len(passage_text)))return passages# Usage exampleextractor = PassageExtractor(max_passage_length=100, overlap=20)text ="This is a long document. "*50passages = extractor.extract_passages(text)print(f"Extracted {len(passages)} passages from document")print(f"First passage: {passages[0][0][:100]}...")
Extracted 32 passages from document
First passage: This is a long document. This is a long document. This is a long document. This is a long document. ...
TipContext Window Optimization Best Practices
Passage extraction:
Use sentence embeddings for relevance scoring
Keep consecutive sentences for narrative flow
Extract different amounts per query type (factual: less, explanation: more)
Deduplication:
Use MinHash or embeddings for semantic similarity
Set threshold based on acceptable information loss (0.8-0.9)
Keep first occurrence (usually most complete)
Token counting:
Use tokenizer from target LLM (different tokenizers vary)
Reserve tokens for query, instructions, output (typically 20-30%)
Hierarchical assembly:
Always include document titles/metadata
Prioritize key passages over full text
Add detail progressively until limit reached
WarningContext Window Pitfalls
Common mistakes that degrade RAG quality:
Over-truncation: Cutting documents mid-sentence or mid-paragraph loses context - Solution: Truncate at sentence/paragraph boundaries
Lost citations: After extraction/summarization, can’t attribute claims - Solution: Maintain document IDs throughout processing
Query not in context: Forgot to include original query in prompt - Solution: Always include query, even if redundant
Exceeding limit: Token estimation off, actual usage exceeds limit - Solution: Use actual tokenizer, add 10% safety buffer
11.3 Multi-Stage Retrieval Systems
Single-stage retrieval (retrieve top-k, done) sacrifices either recall or latency. Multi-stage retrieval separates concerns: early stages optimize for recall (don’t miss relevant documents), later stages optimize for precision (rank best documents highest). This enables billion-document search with high accuracy and low latency.
11.3.1 The Multi-Stage Architecture
Stage 1: Coarse Retrieval (Recall-focused) - Goal: Don’t miss relevant documents - Method: Fast vector search (ANN) - Scale: Search full corpus (1B+ documents) - Output: Top-1000 candidates - Latency: 50-100ms
Track precision @ each stage (is stage 2 improving ranking?)
A/B test stage variations (does keyword filter help?)
11.4 RAG Evaluation Frameworks
RAG systems combine retrieval and generation, requiring evaluation beyond standard IR or NLG metrics. RAG evaluation frameworks measure end-to-end quality: retrieval relevance, context utilization, answer accuracy, factual consistency, attribution quality, and user satisfaction.
11.4.1 The RAG Evaluation Challenge
Traditional IR metrics (Recall@k, MRR, NDCG):
Measure retrieval quality only
Don’t capture if LLM used retrieved context
Don’t measure answer accuracy
Traditional NLG metrics (BLEU, ROUGE, BERTScore):
Measure generation quality only
Don’t capture if answer grounded in context
Don’t detect hallucinations
RAG needs both + more: Did system retrieve relevant docs AND generate accurate answer grounded in those docs?
Show Hybrid Search
from typing import Dictimport numpy as npclass HybridSearch:"""Combine dense (vector) and sparse (BM25) retrieval."""def__init__(self, vector_store, bm25_index=None, alpha: float=0.5):self.vector_store = vector_storeself.bm25_index = bm25_indexself.alpha = alphadef search(self, query: str, query_embedding: np.ndarray, k: int=5) -> List[Document]:"""Hybrid search combining dense and sparse retrieval. Score = alpha * dense_score + (1 - alpha) * sparse_score """# Dense retrieval dense_results =self.vector_store.search(query_embedding, k=k*2) dense_scores = {doc.doc_id: 1.0/ (i +1) for i, doc inenumerate(dense_results)}# Sparse retrieval (BM25)ifself.bm25_index: sparse_scores = {doc.doc_id: np.random.rand() for doc in dense_results}else: sparse_scores = {doc.doc_id: 0.0for doc in dense_results}# Combine scores combined_scores = {} all_doc_ids =set(dense_scores.keys()) |set(sparse_scores.keys())for doc_id in all_doc_ids: dense_score = dense_scores.get(doc_id, 0.0) sparse_score = sparse_scores.get(doc_id, 0.0) combined_scores[doc_id] =self.alpha * dense_score + (1-self.alpha) * sparse_score# Sort by combined score sorted_ids =sorted(combined_scores.keys(), key=lambda x: combined_scores[x], reverse=True)# Return top-k documents id_to_doc = {doc.doc_id: doc for doc in dense_results}return [id_to_doc[doc_id] for doc_id in sorted_ids[:k] if doc_id in id_to_doc]# Usage examplestore = VectorStore(embedding_dim=768)docs = [Document(doc_id=str(i), text=f"Doc {i}", embedding=np.random.rand(768)) for i inrange(50)]store.add_documents(docs)hybrid = HybridSearch(vector_store=store, alpha=0.7)results = hybrid.search("sample query", np.random.rand(768), k=5)print(f"Hybrid search returned {len(results)} documents")
Hybrid search returned 5 documents
TipRAG Evaluation Best Practices
Evaluation data:
Start with 100-500 query-answer pairs
Cover diversity of query types (factual, how-to, comparison, etc.)
Include hard cases (contradictory docs, missing info, ambiguous queries)
Get human annotations for ground truth (expensive but essential)
Automated metrics:
Retrieval: Recall@10, Recall@100, MRR
Generation: Semantic similarity to ground truth (SentenceTransformers)
Faithfulness: NLI models (check entailment between context and answer)
Attribution: Check if citations support claims
Human evaluation:
Sample 10-20% for human review
Ask: Is answer accurate? Is answer complete? Are citations correct?
Use majority vote from 3+ annotators
Expensive but ground truth for calibrating automated metrics
Continuous evaluation:
Evaluate on every model/prompt change
Track metrics over time (detect regressions)
A/B test in production (measure user satisfaction)
11.5 Handling Contradictory Information
Real-world knowledge bases contain contradictions: different sources disagree, information becomes outdated, perspectives conflict. Contradiction handling strategies enable RAG systems to navigate disagreements: detecting conflicts, weighing source credibility, presenting multiple perspectives, and updating knowledge as information evolves.
11.5.1 The Contradiction Challenge
Types of contradictions:
Temporal: Information changes over time
“Product price is $99” (2023) vs “$149” (2024)
Solution: Prioritize recent information
Source disagreement: Different sources conflict
Source A: “API supports OAuth2” vs Source B: “API uses API keys”
Allow users to see all perspectives (expandable sections)
Continuous improvement:
Log user selections when presented with contradictions
Update source credibility based on user preferences
Retrain contradiction detection on corrected examples
WarningContradiction Pitfalls
Over-resolving: Automatically picking one answer when both are valid - Example: “Best database for X” has multiple valid answers - Solution: Recognize when question has multiple valid answers
Temporal confusion: Using old information because it’s higher quality - Example: Detailed 2022 guide vs brief 2024 update - Solution: Always prioritize recency for rapidly changing topics
Authority bias: Always trusting “authoritative” source - Example: Official docs outdated, community docs current - Solution: Consider recency + authority together
Hidden contradictions: Not detecting subtle conflicts - Example: “Supports OAuth2” vs “Requires API keys” (implicit contradiction) - Solution: Use semantic contradiction detection, not just exact mismatches
11.6 Conversational AI and Chatbots
RAG powers modern conversational AI systems—customer service bots, internal assistants, and domain-specific copilots. Embedding-based chatbots move beyond scripted responses to semantic understanding: matching user intent to relevant knowledge, maintaining conversation context, and generating grounded responses.
11.6.1 Intent Classification with Embeddings
Traditional chatbots use keyword matching or rule-based intent classification. Embedding-based systems understand semantic intent:
Show Intent Classifier
import numpy as npfrom dataclasses import dataclassfrom typing import List, Tuple@dataclassclass Intent:"""Chatbot intent with example utterances.""" name: str description: str examples: List[str] embedding: np.ndarray =None# Centroid of example embeddingsclass IntentClassifier:"""Embedding-based intent classification for chatbots."""def__init__(self, intents: List[Intent], encoder):self.intents = intentsself.encoder = encoderself._compute_intent_embeddings()def _compute_intent_embeddings(self):"""Compute centroid embedding for each intent from examples."""for intent inself.intents:if intent.examples: example_embeddings = [self.encoder.encode(ex) for ex in intent.examples] intent.embedding = np.mean(example_embeddings, axis=0)def classify(self, user_message: str, threshold: float=0.5) -> Tuple[str, float]:"""Classify user message into intent with confidence score.""" message_embedding =self.encoder.encode(user_message) best_intent =None best_score =-1for intent inself.intents:if intent.embedding isnotNone:# Cosine similarity score = np.dot(message_embedding, intent.embedding) / ( np.linalg.norm(message_embedding) * np.linalg.norm(intent.embedding) )if score > best_score: best_score = score best_intent = intent.nameif best_score < threshold:return"unknown", best_scorereturn best_intent, best_scoredef get_similar_examples(self, user_message: str, k: int=3) -> List[Tuple[str, str, float]]:"""Find most similar training examples for few-shot prompting.""" message_embedding =self.encoder.encode(user_message) all_examples = []for intent inself.intents:for example in intent.examples: example_embedding =self.encoder.encode(example) score = np.dot(message_embedding, example_embedding) / ( np.linalg.norm(message_embedding) * np.linalg.norm(example_embedding) ) all_examples.append((intent.name, example, score)) all_examples.sort(key=lambda x: x[2], reverse=True)return all_examples[:k]# Example usage with mock encoderclass MockEncoder:def encode(self, text):# In production, use sentence-transformers or similar np.random.seed(hash(text) %2**32)return np.random.randn(384)encoder = MockEncoder()intents = [ Intent("order_status", "Check order status", ["Where is my order?", "Track my package", "Order status"]), Intent("return_request", "Request a return", ["I want to return this", "How do I return?", "Return policy"]), Intent("product_info", "Product information", ["Tell me about this product", "Product specs", "Features"]),]classifier = IntentClassifier(intents, encoder)intent, confidence = classifier.classify("When will my package arrive?")print(f"Intent: {intent}, Confidence: {confidence:.3f}")
Intent: unknown, Confidence: 0.061
11.6.2 Conversation Context Management
Chatbots must maintain context across conversation turns. Embeddings enable semantic context windows that retrieve relevant conversation history:
Show Conversation Manager
from dataclasses import dataclass, fieldfrom typing import List, Optionalimport numpy as np@dataclassclass ConversationTurn:"""Single turn in conversation.""" role: str# "user" or "assistant" content: str embedding: Optional[np.ndarray] =None timestamp: float=0.0@dataclassclass ConversationContext:"""Manages conversation history with semantic retrieval.""" turns: List[ConversationTurn] = field(default_factory=list) max_turns: int=50def add_turn(self, role: str, content: str, encoder, timestamp: float=0.0):"""Add a turn to conversation history.""" embedding = encoder.encode(content) turn = ConversationTurn(role=role, content=content, embedding=embedding, timestamp=timestamp)self.turns.append(turn)# Trim old turns if needediflen(self.turns) >self.max_turns:self.turns =self.turns[-self.max_turns:]def get_relevant_context(self, current_query: str, encoder, k: int=5) -> List[ConversationTurn]:"""Retrieve most relevant previous turns for current query."""ifnotself.turns:return [] query_embedding = encoder.encode(current_query)# Score each turn by relevance scored_turns = []for i, turn inenumerate(self.turns[:-1]): # Exclude current turnif turn.embedding isnotNone: similarity = np.dot(query_embedding, turn.embedding) / ( np.linalg.norm(query_embedding) * np.linalg.norm(turn.embedding) )# Boost recent turns slightly recency_boost =0.1* (i /len(self.turns)) scored_turns.append((turn, similarity + recency_boost))# Sort by score and return top-k scored_turns.sort(key=lambda x: x[1], reverse=True)return [turn for turn, score in scored_turns[:k]]def build_context_prompt(self, current_query: str, encoder, max_tokens: int=2000) ->str:"""Build context string for LLM prompt.""" relevant =self.get_relevant_context(current_query, encoder) context_parts = [] token_estimate =0for turn in relevant: turn_text =f"{turn.role}: {turn.content}" turn_tokens =len(turn_text.split()) *1.3# Rough token estimateif token_estimate + turn_tokens > max_tokens:break context_parts.append(turn_text) token_estimate += turn_tokensreturn"\n".join(context_parts)# Example usagecontext = ConversationContext()encoder = MockEncoder()context.add_turn("user", "I ordered a laptop last week", encoder)context.add_turn("assistant", "I can help you track your laptop order. What's your order number?", encoder)context.add_turn("user", "It's ORDER-12345", encoder)context.add_turn("assistant", "Order ORDER-12345 shipped yesterday and should arrive Friday.", encoder)context.add_turn("user", "What about the warranty?", encoder)# Retrieve relevant context for warranty questionrelevant = context.get_relevant_context("What about the warranty?", encoder, k=3)print(f"Retrieved {len(relevant)} relevant turns for warranty question")
Retrieved 3 relevant turns for warranty question
11.6.3 Response Selection vs Generation
Chatbots can either select from pre-written responses or generate new ones. Embeddings enable hybrid approaches:
Show Hybrid Response System
from dataclasses import dataclassfrom typing import List, Optional, Tupleimport numpy as np@dataclassclass CannedResponse:"""Pre-written response for common queries."""id: str intent: str response: str embedding: Optional[np.ndarray] =Noneclass HybridResponseSystem:"""Combines response selection with RAG-based generation."""def__init__(self, canned_responses: List[CannedResponse], encoder, selection_threshold: float=0.85):self.responses = canned_responsesself.encoder = encoderself.selection_threshold = selection_thresholdself._compute_response_embeddings()def _compute_response_embeddings(self):"""Pre-compute embeddings for canned responses."""for response inself.responses: response.embedding =self.encoder.encode(response.response)def get_response(self, user_query: str, intent: str) -> Tuple[str, str]:""" Get response for user query. Returns (response_text, method) where method is 'selected' or 'generated'. """ query_embedding =self.encoder.encode(user_query)# Find best matching canned response for this intent best_response =None best_score =-1for response inself.responses:if response.intent == intent and response.embedding isnotNone: score = np.dot(query_embedding, response.embedding) / ( np.linalg.norm(query_embedding) * np.linalg.norm(response.embedding) )if score > best_score: best_score = score best_response = response# If high confidence match, use canned responseif best_score >=self.selection_threshold and best_response:return best_response.response, "selected"# Otherwise, would trigger RAG generation (placeholder)returnf"[Generated response for: {user_query}]", "generated"# Example usageresponses = [ CannedResponse("r1", "order_status", "You can track your order at example.com/track"), CannedResponse("r2", "return_request", "Returns are accepted within 30 days. Visit example.com/returns"), CannedResponse("r3", "product_info", "Our products come with a 1-year warranty."),]system = HybridResponseSystem(responses, MockEncoder())response, method = system.get_response("How do I track my package?", "order_status")print(f"Response ({method}): {response}")
Response (generated): [Generated response for: How do I track my package?]
TipConversational AI Best Practices
Intent Classification:
Few-shot examples: 5-10 examples per intent is often sufficient with good embeddings
Hierarchical intents: Parent → child classification for complex domains
Fallback handling: Route low-confidence queries to human agents or clarification
Active learning: Log low-confidence queries for labeling and model improvement
Context Management:
Semantic retrieval: Don’t just use last N turns—retrieve semantically relevant history
Session boundaries: Clear context appropriately between sessions
Privacy: Exclude sensitive information from context retrieval
Response Strategy:
Canned for compliance: Use pre-written responses for legal, safety, policy questions
Generated for flexibility: Use RAG for complex, context-dependent queries
Hybrid routing: Classify query type to select response strategy
Guardrails: Always validate generated responses before sending
11.7 Embedding-Based Summarization
Summarization with embeddings identifies representative content—selecting sentences or passages that best capture document meaning. Unlike generative summarization, embedding-based approaches are extractive, selecting existing text rather than generating new text.
11.7.1 Representative Sentence Selection
The core idea: sentences with embeddings closest to the document centroid are most representative:
Show Extractive Summarizer
from dataclasses import dataclassfrom typing import Listimport numpy as np@dataclassclass Sentence:"""Sentence with embedding.""" text: str embedding: np.ndarray position: int# Position in original documentclass ExtractiveSummarizer:"""Embedding-based extractive summarization."""def__init__(self, encoder):self.encoder = encoderdef summarize(self, document: str, num_sentences: int=3, diversity_weight: float=0.3) -> List[str]:""" Extract representative sentences from document. Args: document: Input text num_sentences: Number of sentences to extract diversity_weight: Balance between relevance (0) and diversity (1) """# Split into sentences (simplified) raw_sentences = [s.strip() for s in document.replace('!', '.').replace('?', '.').split('.') if s.strip()]iflen(raw_sentences) <= num_sentences:return raw_sentences# Compute embeddings sentences = []for i, text inenumerate(raw_sentences): embedding =self.encoder.encode(text) sentences.append(Sentence(text=text, embedding=embedding, position=i))# Compute document centroid all_embeddings = np.array([s.embedding for s in sentences]) centroid = np.mean(all_embeddings, axis=0)# Select sentences using MMR (Maximal Marginal Relevance) selected = [] remaining = sentences.copy()for _ inrange(num_sentences): best_sentence =None best_score =-float('inf')for sentence in remaining:# Relevance: similarity to centroid relevance = np.dot(sentence.embedding, centroid) / ( np.linalg.norm(sentence.embedding) * np.linalg.norm(centroid) )# Diversity: dissimilarity to already selected sentencesif selected: max_sim_to_selected =max( np.dot(sentence.embedding, s.embedding) / ( np.linalg.norm(sentence.embedding) * np.linalg.norm(s.embedding) )for s in selected ) diversity =1- max_sim_to_selectedelse: diversity =1# MMR score score = (1- diversity_weight) * relevance + diversity_weight * diversityif score > best_score: best_score = score best_sentence = sentenceif best_sentence: selected.append(best_sentence) remaining.remove(best_sentence)# Return in original document order selected.sort(key=lambda s: s.position)return [s.text for s in selected]def summarize_multi_document(self, documents: List[str], num_sentences: int=5) -> List[str]:"""Summarize multiple documents by finding representative sentences across all.""" all_sentences = []for doc_idx, document inenumerate(documents): raw_sentences = [s.strip() for s in document.replace('!', '.').replace('?', '.').split('.') if s.strip()]for i, text inenumerate(raw_sentences): embedding =self.encoder.encode(text) all_sentences.append(Sentence(text=text, embedding=embedding, position=i + doc_idx *1000))iflen(all_sentences) <= num_sentences:return [s.text for s in all_sentences]# Compute global centroid all_embeddings = np.array([s.embedding for s in all_sentences]) centroid = np.mean(all_embeddings, axis=0)# Score by distance to centroid scores = []for sentence in all_sentences: score = np.dot(sentence.embedding, centroid) / ( np.linalg.norm(sentence.embedding) * np.linalg.norm(centroid) ) scores.append((sentence, score)) scores.sort(key=lambda x: x[1], reverse=True)return [s.text for s, _ in scores[:num_sentences]]# Example usagesummarizer = ExtractiveSummarizer(MockEncoder())document ="""Machine learning has transformed how we process data.Deep learning models can recognize patterns in images and text.Neural networks require large amounts of training data.Transfer learning allows models to leverage pre-trained knowledge.Embeddings represent data as dense vectors for similarity computation."""summary = summarizer.summarize(document, num_sentences=2)print(f"Summary ({len(summary)} sentences):")for s in summary:print(f" - {s}")
Summary (2 sentences):
- Machine learning has transformed how we process data
- Deep learning models can recognize patterns in images and text
11.7.2 Cluster-Based Summarization
For longer documents, cluster sentences first, then select representatives from each cluster:
Show Cluster-Based Summarizer
from typing import List, Dictimport numpy as npclass ClusterSummarizer:"""Cluster-based summarization for long documents."""def__init__(self, encoder):self.encoder = encoderdef summarize(self, document: str, num_clusters: int=3) -> List[str]:""" Summarize by clustering sentences and selecting cluster representatives. """# Split and embed sentences raw_sentences = [s.strip() for s in document.replace('!', '.').replace('?', '.').split('.') if s.strip()]iflen(raw_sentences) <= num_clusters:return raw_sentences embeddings = np.array([self.encoder.encode(s) for s in raw_sentences])# Simple k-means clustering centroids =self._kmeans(embeddings, num_clusters)# Assign sentences to clusters clusters: Dict[int, List[tuple]] = {i: [] for i inrange(num_clusters)}for i, (sentence, embedding) inenumerate(zip(raw_sentences, embeddings)): distances = [np.linalg.norm(embedding - c) for c in centroids] cluster_id = np.argmin(distances) clusters[cluster_id].append((sentence, embedding, i))# Select representative from each cluster (closest to centroid) representatives = []for cluster_id, members in clusters.items():ifnot members:continue centroid = centroids[cluster_id] best_sentence =min( members, key=lambda x: np.linalg.norm(x[1] - centroid) ) representatives.append((best_sentence[0], best_sentence[2])) # text, position# Return in document order representatives.sort(key=lambda x: x[1])return [text for text, _ in representatives]def _kmeans(self, embeddings: np.ndarray, k: int, max_iters: int=100) -> np.ndarray:"""Simple k-means clustering."""# Initialize centroids randomly indices = np.random.choice(len(embeddings), k, replace=False) centroids = embeddings[indices].copy()for _ inrange(max_iters):# Assign points to nearest centroid assignments = []for emb in embeddings: distances = [np.linalg.norm(emb - c) for c in centroids] assignments.append(np.argmin(distances))# Update centroids new_centroids = []for i inrange(k): cluster_points = embeddings[np.array(assignments) == i]iflen(cluster_points) >0: new_centroids.append(cluster_points.mean(axis=0))else: new_centroids.append(centroids[i]) new_centroids = np.array(new_centroids)# Check convergenceif np.allclose(centroids, new_centroids):break centroids = new_centroidsreturn centroids# Examplecluster_summarizer = ClusterSummarizer(MockEncoder())long_doc ="""The economy grew by 3% this quarter. Employment rates improved significantly.New technology startups raised record funding. AI companies led the investment surge.Climate change policies face opposition. Environmental groups demand stronger action.Sports teams prepare for the championship. Fans eagerly await the final matches."""summary = cluster_summarizer.summarize(long_doc, num_clusters=3)print(f"Cluster-based summary:")for s in summary:print(f" - {s}")
Cluster-based summary:
- The economy grew by 3% this quarter
- Employment rates improved significantly
- Climate change policies face opposition
TipSummarization Best Practices
Extraction Strategy:
MMR for diversity: Avoid selecting redundant sentences
Position bias: First/last sentences often contain key information
Length normalization: Don’t over-favor short or long sentences
Cluster-based: For long documents, cluster then select representatives
Quality Considerations:
Coherence: Selected sentences should flow logically
Coverage: Summary should cover main topics, not just one aspect
Redundancy: Remove near-duplicate information
Context preservation: Include enough context for sentences to be understandable
Scale Considerations:
Pre-compute embeddings: For document collections, embed once and reuse
Hierarchical summarization: Summarize sections, then summarize summaries
Incremental updates: For streaming documents, maintain running summaries
Caching: Cache summaries for frequently accessed documents
11.8 Key Takeaways
RAG combines retrieval and generation for grounded LLM responses: Retrieving relevant context from vector databases enables accurate answers over billion-document corpora while maintaining attribution and enabling real-time knowledge updates
Enterprise RAG requires multi-component architecture: Query understanding, retrieval, reranking, context assembly, generation, and validation each play critical roles, and each must scale independently
Context window optimization maximizes information density: Passage extraction, deduplication, and hierarchical assembly enable fitting relevant information within LLM token limits while preserving key facts
Multi-stage retrieval balances recall and precision: Early stages (vector search) optimize for recall across billion-doc corpora, later stages (reranking, diversity) optimize for precision with expensive models on small candidate sets
RAG evaluation requires measuring beyond retrieval and generation: End-to-end metrics must capture retrieval relevance, context utilization, answer accuracy, factual consistency, attribution quality, and user satisfaction
Contradiction handling enables navigating disagreements in knowledge bases: Temporal resolution (prefer recent), source authority weighting (prefer credible), and multi-perspective presentation handle conflicts when sources disagree
Production RAG demands comprehensive engineering: Caching, batching, circuit breakers, monitoring, A/B testing, and continuous evaluation separate research prototypes from production systems serving millions of users
Conversational AI leverages embeddings for semantic intent matching: Embedding-based chatbots classify user intent from examples, retrieve semantically relevant conversation history, and combine canned responses with generated content for appropriate flexibility and compliance
Embedding-based summarization extracts representative content: Centroid-based selection and MMR diversity ensure summaries capture key information without redundancy, while cluster-based approaches handle long documents by selecting representatives from each topic cluster
11.9 Looking Ahead
This chapter demonstrated how RAG leverages embeddings for grounded generation at enterprise scale. Chapter 12 expands semantic search beyond text: multi-modal search across text, images, audio, and video; code search for software intelligence; scientific literature and patent search with domain-specific understanding; media and content discovery across creative assets; and knowledge graph integration for structured reasoning. These applications demonstrate embeddings’ versatility across diverse modalities and domains.
11.10 Further Reading
11.10.1 RAG Foundations
Lewis, Patrick, et al. (2020). “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS.
Guu, Kelvin, et al. (2020). “REALM: Retrieval-Augmented Language Model Pre-Training.” ICML.
Izacard, Gautier, et al. (2021). “Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering.” EACL.
11.10.2 Retrieval Systems
Karpukhin, Vladimir, et al. (2020). “Dense Passage Retrieval for Open-Domain Question Answering.” EMNLP.
Xiong, Lee, et al. (2021). “Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval.” ICLR.
Santhanam, Keshav, et al. (2022). “ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction.” NAACL.
11.10.3 Context Optimization
Jiang, Zhengbao, et al. (2023). “Long-Form Factuality in Large Language Models.” arxiv.
Liu, Nelson F., et al. (2023). “Lost in the Middle: How Language Models Use Long Contexts.” arxiv.
Press, Ofir, et al. (2022). “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.” ICLR.
11.10.4 RAG Evaluation
Chen, Daixuan, et al. (2023). “CRUD-RAG: Benchmarking Retrieval-Augmented Generation for Time-Sensitive Knowledge.” arxiv.
Es, Shahul, et al. (2023). “RAGAS: Automated Evaluation of Retrieval Augmented Generation.” arxiv.
Liu, Yang, et al. (2023). “Evaluating the Factuality of Large Language Models.” ACL.
11.10.5 Production Systems
Anthropic (2023). “Claude 2 System Card.”
OpenAI (2023). “GPT-4 Technical Report.”
Thoppilan, Romal, et al. (2022). “LaMDA: Language Models for Dialog Applications.” arxiv.
11.10.6 Contradiction Detection
Welleck, Sean, et al. (2019). “Dialogue Natural Language Inference.” ACL.