Semantic search transcends traditional keyword matching, enabling organizations to find meaning across modalities: images, code, scientific papers, media assets, and interconnected knowledge. This chapter explores multi-modal semantic search architectures that unify text, vision, and audio embeddings for cross-modal retrieval, code search systems that understand program semantics beyond syntax for software intelligence, scientific literature and patent search at research scale with citation networks and entity resolution, media and content discovery engines that match visual style and creative intent, and enterprise knowledge graphs that connect entities through learned embeddings. These capabilities transform search from keyword matching to semantic understanding, unlocking insights hidden in unstructured data across modalities.
After mastering RAG for text (Chapter 11), the next frontier is semantic search beyond text. Traditional search operates on keywords: match query terms to document terms, rank by term frequency. This works for text but fails for images (no keywords), code (syntax vs semantics), scientific literature (citation networks matter), media (style and composition), and knowledge graphs (relationships matter more than attributes). Embedding-based semantic search solves these challenges by representing all modalities—text, images, code, papers, media, entities—in a unified vector space where similarity reflects semantic meaning, not surface features.
12.1 Multi-Modal Semantic Search
Multi-modal search finds content across different modalities: search images with text queries (“sunset over mountains”), search text with image queries (upload photo, find similar articles), search videos with audio queries (hum a melody, find the song). Multi-modal embeddings map different modalities into a shared vector space where cross-modal similarity is meaningful.
Code search finds functions, classes, and patterns in massive codebases—but traditional search fails because code semantics differ from syntax. Semantic code search uses embeddings to find code by intent (“sort a list”), not keywords, enabling software intelligence for code completion, bug detection, and API discovery.
12.2.1 The Code Search Challenge
Code has unique properties:
Syntax vs semantics: list.sort() and sorted(list) are syntactically different but semantically similar
Multiple representations: Code, comments, docstrings, test cases all describe intent
Polyglot: Multiple languages (Python, Java, C++, JavaScript)
Challenge: Find code that does X (semantic intent), not code that contains X (keyword match).
Show Code Search System
import astfrom typing import List, Tupleclass CodeSearchEngine:"""Semantic code search using embeddings."""def__init__(self, code_encoder, embedding_dim=768):self.code_encoder = code_encoderself.embedding_dim = embedding_dimself.code_snippets = []self.embeddings = []def add_code(self, code: str, metadata: dict=None):"""Index code snippet."""# Parse AST to extract functions/classestry: tree = ast.parse(code)for node in ast.walk(tree):ifisinstance(node, (ast.FunctionDef, ast.ClassDef)): snippet = ast.get_source_segment(code, node)self.code_snippets.append({'code': snippet, 'metadata': metadata})except:# If parsing fails, index as-isself.code_snippets.append({'code': code, 'metadata': metadata})def search(self, natural_language_query: str, k: int=5) -> List[dict]:"""Search code using natural language query."""# Encode query query_emb =self.encode_query(natural_language_query)# Encode all code snippets code_embs = [self.encode_code(s['code']) for s inself.code_snippets]# Compute similarities similarities = [self.cosine_similarity(query_emb, code_emb)for code_emb in code_embs]# Return top-k top_indices =sorted(range(len(similarities)), key=lambda i: similarities[i], reverse=True)[:k]return [self.code_snippets[i] for i in top_indices]def encode_query(self, query: str):"""Encode natural language query."""return [0.0] *self.embedding_dim # Placeholderdef encode_code(self, code: str):"""Encode code snippet."""return [0.0] *self.embedding_dim # Placeholderdef cosine_similarity(self, a, b):"""Compute cosine similarity."""import numpy as npreturn np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))# Usage exampleengine = CodeSearchEngine(code_encoder=None)engine.add_code("def sort_list(items): return sorted(items)")results = engine.search("sort a list", k=3)print(f"Found {len(results)} code snippets")
Found 1 code snippets
/var/folders/j6/195rqgcs37z88kmyck3fqg_h0000gn/T/ipykernel_19408/1050269118.py:54: RuntimeWarning: invalid value encountered in scalar divide
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
TipCode Search Best Practices
Training:
Pre-training: Use CodeBERT or GraphCodeBERT (pre-trained on GitHub)
Fine-tuning: Fine-tune on domain-specific code (internal codebase) (see Chapter 14 for guidance on when to fine-tune vs. train from scratch)
Data augmentation: Rename variables, reformat code (preserve semantics)
Hard negatives: Mine hard negatives (similar code, different semantics) (see Chapter 15)
Indexing:
Function-level: Index individual functions, not entire files
Deduplication: Remove duplicate functions (common in forks)
Metadata: Include docstrings, comments, test cases
Incremental: Update index as new code is added (CI/CD integration)
Search quality:
Reranking: Use cross-encoder to rerank top-100 results
Diversity: Ensure diverse results (not all bubble sort variants)
Filtering: Filter by language, library, recency
Personalization: Rank by user’s coding style and preferences
12.3 Scientific Literature and Patent Search
Scientific research produces millions of papers annually—PubMed has 35M+ articles, arXiv adds 200K/year, and patent offices hold 100M+ patents. Semantic literature search finds relevant research by understanding concepts, methods, and relationships, enabling discovery across citation networks and entity resolution for authors, institutions, and compounds.
Challenge: Find relevant research (concept match), not keyword match (term frequency).
Show Scientific Literature Search
from typing import List, Dict, Optionalimport numpy as npclass ScientificPaperSearch:"""Search scientific papers using embeddings and citation networks."""def__init__(self, paper_encoder, embedding_dim=768):self.paper_encoder = paper_encoderself.embedding_dim = embedding_dimself.papers = []self.citation_graph = {} # paper_id -> [cited_paper_ids]def add_paper(self, paper_id: str, title: str, abstract: str, citations: List[str] =None):"""Index scientific paper."""# Encode paper text =f"{title}{abstract}" embedding =self.encode_paper(text)self.papers.append({'paper_id': paper_id,'title': title,'abstract': abstract,'embedding': embedding })# Update citation graphif citations:self.citation_graph[paper_id] = citationsdef search(self, query: str, k: int=10, use_citations: bool=True) -> List[dict]:"""Search papers using semantic similarity and citation network."""# Encode query query_emb =self.encode_paper(query)# Compute semantic similarities scores = []for paper inself.papers: semantic_score =self.cosine_similarity(query_emb, paper['embedding'])# Boost score using citation networkif use_citations and paper['paper_id'] inself.citation_graph: citation_boost =len(self.citation_graph[paper['paper_id']]) *0.01 final_score = semantic_score + citation_boostelse: final_score = semantic_score scores.append(final_score)# Return top-k top_indices =sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:k]return [self.papers[i] for i in top_indices]def encode_paper(self, text: str):"""Encode paper text to embedding."""return np.random.rand(self.embedding_dim) # Placeholderdef cosine_similarity(self, a, b):"""Compute cosine similarity."""return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))# Usage examplesearch_engine = ScientificPaperSearch(paper_encoder=None)search_engine.add_paper("paper1", "Deep Learning", "Neural networks...", citations=["paper2"])results = search_engine.search("machine learning", k=5)print(f"Found {len(results)} relevant papers")
Found 1 relevant papers
TipScientific Search Best Practices
Domain-specific embeddings:
Pre-training: Use SPECTER (citation-based), SciBERT (scientific text)
Fine-tuning: Fine-tune on domain-specific corpora (biomedical, physics)
Multi-field: Encode title + abstract + full text (weight by importance)
Citation context: Include sentences that cite the paper
Citation graph:
Co-citation: Papers cited together are related
Bibliographic coupling: Papers citing the same work are related
PageRank: Rank by citation graph centrality
Temporal weighting: Recent citations matter more
Entity linking:
Named entity recognition: Extract entities (chemicals, genes, diseases)
Entity disambiguation: Link to knowledge base (PubChem, UniProt)
Relation extraction: Extract relationships between entities
Entity embeddings: Embed entities in same space as papers
12.4 Media and Content Discovery
Media assets—images, videos, audio—represent trillions of files across organizations. Semantic media search finds content by visual style, composition, audio characteristics, and creative intent, enabling discovery beyond metadata tagging and filename matching.
12.4.1 The Media Discovery Challenge
Media has unique properties:
Visual style: Color palette, composition, lighting
Creative intent: Mood, emotion, message
Temporal dynamics: Video and audio evolve over time
Challenge: Find visually similar or stylistically related media, not keyword matches on filenames.
Show Media Discovery System
from typing import List, Tupleimport numpy as npfrom dataclasses import dataclass@dataclassclass MediaAsset:"""Media asset with content and style embeddings.""" asset_id: str file_path: str content_embedding: np.ndarray style_embedding: np.ndarray metadata: dict=Noneclass MediaDiscoveryEngine:"""Search media by visual similarity and style."""def__init__(self, content_encoder, style_encoder):self.content_encoder = content_encoderself.style_encoder = style_encoderself.assets = []def add_asset(self, asset_id: str, file_path: str, image):"""Index media asset."""# Extract content and style embeddings content_emb =self.encode_content(image) style_emb =self.encode_style(image) asset = MediaAsset( asset_id=asset_id, file_path=file_path, content_embedding=content_emb, style_embedding=style_emb )self.assets.append(asset)def search_by_content(self, query_image, k: int=10) -> List[MediaAsset]:"""Find visually similar content.""" query_emb =self.encode_content(query_image) similarities = [np.dot(query_emb, asset.content_embedding)for asset inself.assets] top_indices = np.argsort(similarities)[-k:][::-1]return [self.assets[i] for i in top_indices]def search_by_style(self, query_image, k: int=10) -> List[MediaAsset]:"""Find assets with similar visual style.""" query_emb =self.encode_style(query_image) similarities = [np.dot(query_emb, asset.style_embedding)for asset inself.assets] top_indices = np.argsort(similarities)[-k:][::-1]return [self.assets[i] for i in top_indices]def encode_content(self, image):"""Encode semantic content."""return np.random.rand(512) # Placeholderdef encode_style(self, image):"""Encode visual style."""return np.random.rand(256) # Placeholder# Usage exampleengine = MediaDiscoveryEngine(content_encoder=None, style_encoder=None)print("Media discovery engine initialized with content and style encoders")
Media discovery engine initialized with content and style encoders
TipMedia Search Best Practices
Visual features:
Content embeddings: Use CLIP, ResNet, or ViT for semantic content
Style embeddings: Use Gram matrices or style-specific encoders
Multi-scale: Extract features at multiple resolutions
Color histograms: Supplement embeddings with color features
Duplicate detection:
Perceptual hashing: pHash, dHash for near-duplicate detection
Hamming distance: Fast comparison (XOR + popcount)
Clustering: Group near-duplicates for review
Threshold tuning: Balance false positives vs false negatives
Performance:
Pre-compute embeddings: Encode assets offline during ingestion
GPU batching: Batch encode 100-1000 images per GPU
Caching: Cache embeddings in vector database
Progressive loading: Show low-res previews while searching
12.5 Enterprise Knowledge Graphs
Enterprise knowledge graphs connect entities—customers, products, employees, documents—through relationships. Embedding-based knowledge graphs use learned embeddings to represent entities and relations, enabling link prediction, entity resolution, and graph-aware search that understands how entities relate.
12.5.1 The Knowledge Graph Challenge
Traditional knowledge graphs use discrete representations (triples: subject-predicate-object). Embedding-based graphs represent entities and relations as vectors, enabling:
Link prediction: Predict missing relationships
Entity resolution: Merge duplicate entities
Multi-hop reasoning: Answer complex queries across relationships
Similarity search: Find similar entities by embeddings
Challenge: Learn embeddings that preserve graph structure and semantics.
Show Knowledge Graph Embeddings
import torchimport torch.nn as nnimport numpy as npfrom typing import List, Tuple, Dictclass KnowledgeGraphEmbedding:"""TransE-based knowledge graph embedding."""def__init__(self, num_entities: int, num_relations: int, embedding_dim: int=128):self.embedding_dim = embedding_dim# Entity and relation embeddingsself.entity_embeddings = nn.Embedding(num_entities, embedding_dim)self.relation_embeddings = nn.Embedding(num_relations, embedding_dim)# Initialize nn.init.xavier_uniform_(self.entity_embeddings.weight) nn.init.xavier_uniform_(self.relation_embeddings.weight)def score_triple(self, head: torch.Tensor, relation: torch.Tensor, tail: torch.Tensor) -> torch.Tensor:"""Score a triple (head, relation, tail) using TransE.""" head_emb =self.entity_embeddings(head) rel_emb =self.relation_embeddings(relation) tail_emb =self.entity_embeddings(tail)# TransE: h + r ≈ t score = torch.norm(head_emb + rel_emb - tail_emb, p=2, dim=-1)return-score # Negate so higher is betterdef predict_tail(self, head: int, relation: int, k: int=10) -> List[Tuple[int, float]]:"""Predict most likely tail entities for (head, relation, ?).""" head_tensor = torch.tensor([head]) rel_tensor = torch.tensor([relation])# Score all possible tails all_tails = torch.arange(self.entity_embeddings.num_embeddings) scores = []for tail in all_tails: tail_tensor = torch.tensor([tail]) score =self.score_triple(head_tensor, rel_tensor, tail_tensor) scores.append((tail.item(), score.item()))# Return top-k scores.sort(key=lambda x: x[1], reverse=True)return scores[:k]# Usage examplekg = KnowledgeGraphEmbedding(num_entities=1000, num_relations=50, embedding_dim=128)predictions = kg.predict_tail(head=0, relation=5, k=5)print(f"Top 5 predicted tail entities: {[p[0] for p in predictions]}")
Top 5 predicted tail entities: [0, 840, 794, 359, 912]
TipKnowledge Graph Embedding Best Practices
Model selection:
TransE: Simple, works well for 1-to-1 relations
DistMult: Better for symmetric relations
ComplEx: Handles asymmetric and inverse relations
RotatE: State-of-the-art for complex relations
Training:
Negative sampling: Sample false triples for contrastive learning
Hard negatives: Mine hard negatives (plausible but false)
Regularization: L2 regularization on embeddings
Batch training: Use large batches (1000-10000 triples)
Applications:
Link prediction: Predict missing relationships
Entity resolution: Merge duplicate entities by embedding similarity
Graph completion: Fill in missing edges
Multi-hop reasoning: Answer complex queries (e.g., “customers who bought products similar to X”)
Entity disambiguation (same name, different entities)
Scalability:
Billion-entity graphs require distributed training
Full graph materialization doesn’t fit in memory
Subgraph sampling required for large graphs
Interpretability:
Embeddings are black boxes (hard to debug)
Relation semantics may not align with vector operations
Need attribution methods to explain predictions
12.6 Key Takeaways
Multi-modal search unifies text, images, audio, and video in shared embedding spaces: Cross-modal retrieval (query text, retrieve images) requires contrastive training on paired data and separate per-modality encoders that project to a common vector space
Code search transcends syntax to find code by semantic intent: Semantic code embeddings trained on code-docstring pairs enable natural language queries like “sort a list” to find relevant implementations across languages and coding styles
Scientific literature search leverages citation networks and domain embeddings: SPECTER and SciBERT embeddings combined with citation graph analysis (co-citation, bibliographic coupling) enable discovery of related research beyond keyword matching
Media discovery finds visual similarity and creative style: Separate embeddings for content (semantic meaning) and style (color, composition, texture) enable both “find similar images” and “find images with similar aesthetic” use cases
Knowledge graph embeddings enable link prediction and entity resolution: TransE and related models represent entities and relations as vectors, enabling prediction of missing relationships, merging of duplicate entities, and graph-aware similarity search
Semantic search beyond text requires domain-specific encoders: General-purpose embeddings (CLIP, BERT) provide baseline capabilities, but production systems need fine-tuning on domain-specific data (code repositories, scientific papers, media assets)—see Chapter 14 for a decision framework on choosing the right level of customization
Search quality depends on training data quality: Multi-modal alignment requires clean paired data, code search needs accurate code-docstring pairs, and knowledge graphs need high-quality relationship annotations
12.7 Looking Ahead
Part IV (Advanced Applications) continues with Chapter 13, which revolutionizes recommendation systems with embeddings: embedding-based collaborative filtering that scales to billions of users and items, cold start solutions using content embeddings and meta-learning, real-time personalization with streaming embeddings, diversity and fairness constraints that prevent filter bubbles, and cross-domain recommendation transfer that leverages embeddings across product categories and platforms.
12.8 Further Reading
12.8.1 Multi-Modal Learning
Radford, Alec, et al. (2021). “Learning Transferable Visual Models From Natural Language Supervision (CLIP).” ICML.
Jia, Chao, et al. (2021). “Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (ALIGN).” ICML.
Girdhar, Rohit, et al. (2023). “ImageBind: One Embedding Space To Bind Them All.” CVPR.
Baltrusaitis, Tadas, et al. (2019). “Multimodal Machine Learning: A Survey and Taxonomy.” IEEE TPAMI.
12.8.2 Code Search and Software Intelligence
Feng, Zhangyin, et al. (2020). “CodeBERT: A Pre-Trained Model for Programming and Natural Languages.” EMNLP.
Guo, Daya, et al. (2021). “GraphCodeBERT: Pre-training Code Representations with Data Flow.” ICLR.
Husain, Hamel, et al. (2019). “CodeSearchNet Challenge: Evaluating the State of Semantic Code Search.” arXiv.
Chen, Mark, et al. (2021). “Evaluating Large Language Models Trained on Code (Codex).” arXiv.
12.8.3 Scientific Literature Search
Cohan, Arman, et al. (2020). “SPECTER: Document-level Representation Learning using Citation-informed Transformers.” ACL.
Beltagy, Iz, et al. (2019). “SciBERT: A Pretrained Language Model for Scientific Text.” EMNLP.
Lo, Kyle, et al. (2020). “S2ORC: The Semantic Scholar Open Research Corpus.” ACL.
Priem, Jason, et al. (2022). “OpenAlex: A Fully-Open Index of Scholarly Works, Authors, Venues, and Concepts.” arXiv.
12.8.4 Media and Content Discovery
Gatys, Leon A., et al. (2016). “Image Style Transfer Using Convolutional Neural Networks.” CVPR.
Johnson, Justin, et al. (2016). “Perceptual Losses for Real-Time Style Transfer and Super-Resolution.” ECCV.
Simonyan, Karen, and Andrew Zisserman (2014). “Very Deep Convolutional Networks for Large-Scale Image Recognition.” ICLR.
Dosovitskiy, Alexey, et al. (2021). “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT).” ICLR.
12.8.5 Knowledge Graph Embeddings
Bordes, Antoine, et al. (2013). “Translating Embeddings for Modeling Multi-relational Data (TransE).” NeurIPS.
Yang, Bishan, et al. (2015). “Embedding Entities and Relations for Learning and Inference in Knowledge Bases (DistMult).” ICLR.
Trouillon, Théo, et al. (2016). “Complex Embeddings for Simple Link Prediction (ComplEx).” ICML.
Sun, Zhiqing, et al. (2019). “RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space.” ICLR.
Wang, Quan, et al. (2017). “Knowledge Graph Embedding: A Survey of Approaches and Applications.” IEEE TKDE.