12  Semantic Search Beyond Text

NoteChapter Overview

Semantic search transcends traditional keyword matching, enabling organizations to find meaning across modalities: images, code, scientific papers, media assets, and interconnected knowledge. This chapter explores multi-modal semantic search architectures that unify text, vision, and audio embeddings for cross-modal retrieval, code search systems that understand program semantics beyond syntax for software intelligence, scientific literature and patent search at research scale with citation networks and entity resolution, media and content discovery engines that match visual style and creative intent, and enterprise knowledge graphs that connect entities through learned embeddings. These capabilities transform search from keyword matching to semantic understanding, unlocking insights hidden in unstructured data across modalities.

After mastering RAG for text (Chapter 11), the next frontier is semantic search beyond text. Traditional search operates on keywords: match query terms to document terms, rank by term frequency. This works for text but fails for images (no keywords), code (syntax vs semantics), scientific literature (citation networks matter), media (style and composition), and knowledge graphs (relationships matter more than attributes). Embedding-based semantic search solves these challenges by representing all modalities—text, images, code, papers, media, entities—in a unified vector space where similarity reflects semantic meaning, not surface features.

12.2 Code Search and Software Intelligence

Code search finds functions, classes, and patterns in massive codebases—but traditional search fails because code semantics differ from syntax. Semantic code search uses embeddings to find code by intent (“sort a list”), not keywords, enabling software intelligence for code completion, bug detection, and API discovery.

12.2.1 The Code Search Challenge

Code has unique properties:

  • Syntax vs semantics: list.sort() and sorted(list) are syntactically different but semantically similar
  • Multiple representations: Code, comments, docstrings, test cases all describe intent
  • Compositional: Functions compose; understanding requires context
  • Polyglot: Multiple languages (Python, Java, C++, JavaScript)

Challenge: Find code that does X (semantic intent), not code that contains X (keyword match).

Show Code Search System
import ast
from typing import List, Tuple


class CodeSearchEngine:
    """Semantic code search using embeddings."""
    def __init__(self, code_encoder, embedding_dim=768):
        self.code_encoder = code_encoder
        self.embedding_dim = embedding_dim
        self.code_snippets = []
        self.embeddings = []

    def add_code(self, code: str, metadata: dict = None):
        """Index code snippet."""
        # Parse AST to extract functions/classes
        try:
            tree = ast.parse(code)
            for node in ast.walk(tree):
                if isinstance(node, (ast.FunctionDef, ast.ClassDef)):
                    snippet = ast.get_source_segment(code, node)
                    self.code_snippets.append({'code': snippet, 'metadata': metadata})
        except:
            # If parsing fails, index as-is
            self.code_snippets.append({'code': code, 'metadata': metadata})

    def search(self, natural_language_query: str, k: int = 5) -> List[dict]:
        """Search code using natural language query."""
        # Encode query
        query_emb = self.encode_query(natural_language_query)

        # Encode all code snippets
        code_embs = [self.encode_code(s['code']) for s in self.code_snippets]

        # Compute similarities
        similarities = [self.cosine_similarity(query_emb, code_emb)
                        for code_emb in code_embs]

        # Return top-k
        top_indices = sorted(range(len(similarities)),
                             key=lambda i: similarities[i], reverse=True)[:k]
        return [self.code_snippets[i] for i in top_indices]

    def encode_query(self, query: str):
        """Encode natural language query."""
        return [0.0] * self.embedding_dim  # Placeholder

    def encode_code(self, code: str):
        """Encode code snippet."""
        return [0.0] * self.embedding_dim  # Placeholder

    def cosine_similarity(self, a, b):
        """Compute cosine similarity."""
        import numpy as np
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Usage example
engine = CodeSearchEngine(code_encoder=None)
engine.add_code("def sort_list(items): return sorted(items)")
results = engine.search("sort a list", k=3)
print(f"Found {len(results)} code snippets")
Found 1 code snippets
/var/folders/j6/195rqgcs37z88kmyck3fqg_h0000gn/T/ipykernel_19408/1050269118.py:54: RuntimeWarning: invalid value encountered in scalar divide
  return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
TipCode Search Best Practices

Training:

  • Pre-training: Use CodeBERT or GraphCodeBERT (pre-trained on GitHub)
  • Fine-tuning: Fine-tune on domain-specific code (internal codebase) (see Chapter 14 for guidance on when to fine-tune vs. train from scratch)
  • Data augmentation: Rename variables, reformat code (preserve semantics)
  • Hard negatives: Mine hard negatives (similar code, different semantics) (see Chapter 15)

Indexing:

  • Function-level: Index individual functions, not entire files
  • Deduplication: Remove duplicate functions (common in forks)
  • Metadata: Include docstrings, comments, test cases
  • Incremental: Update index as new code is added (CI/CD integration)

Search quality:

  • Reranking: Use cross-encoder to rerank top-100 results
  • Diversity: Ensure diverse results (not all bubble sort variants)
  • Filtering: Filter by language, library, recency
  • Personalization: Rank by user’s coding style and preferences

12.4 Media and Content Discovery

Media assets—images, videos, audio—represent trillions of files across organizations. Semantic media search finds content by visual style, composition, audio characteristics, and creative intent, enabling discovery beyond metadata tagging and filename matching.

12.4.1 The Media Discovery Challenge

Media has unique properties:

  • Visual style: Color palette, composition, lighting
  • Creative intent: Mood, emotion, message
  • Temporal dynamics: Video and audio evolve over time
  • Quality variation: Resolution, noise, compression artifacts
  • Massive scale: Petabytes of media files

Challenge: Find visually similar or stylistically related media, not keyword matches on filenames.

Show Media Discovery System
from typing import List, Tuple
import numpy as np
from dataclasses import dataclass


@dataclass
class MediaAsset:
    """Media asset with content and style embeddings."""
    asset_id: str
    file_path: str
    content_embedding: np.ndarray
    style_embedding: np.ndarray
    metadata: dict = None


class MediaDiscoveryEngine:
    """Search media by visual similarity and style."""
    def __init__(self, content_encoder, style_encoder):
        self.content_encoder = content_encoder
        self.style_encoder = style_encoder
        self.assets = []

    def add_asset(self, asset_id: str, file_path: str, image):
        """Index media asset."""
        # Extract content and style embeddings
        content_emb = self.encode_content(image)
        style_emb = self.encode_style(image)

        asset = MediaAsset(
            asset_id=asset_id,
            file_path=file_path,
            content_embedding=content_emb,
            style_embedding=style_emb
        )
        self.assets.append(asset)

    def search_by_content(self, query_image, k: int = 10) -> List[MediaAsset]:
        """Find visually similar content."""
        query_emb = self.encode_content(query_image)

        similarities = [np.dot(query_emb, asset.content_embedding)
                        for asset in self.assets]
        top_indices = np.argsort(similarities)[-k:][::-1]
        return [self.assets[i] for i in top_indices]

    def search_by_style(self, query_image, k: int = 10) -> List[MediaAsset]:
        """Find assets with similar visual style."""
        query_emb = self.encode_style(query_image)

        similarities = [np.dot(query_emb, asset.style_embedding)
                        for asset in self.assets]
        top_indices = np.argsort(similarities)[-k:][::-1]
        return [self.assets[i] for i in top_indices]

    def encode_content(self, image):
        """Encode semantic content."""
        return np.random.rand(512)  # Placeholder

    def encode_style(self, image):
        """Encode visual style."""
        return np.random.rand(256)  # Placeholder

# Usage example
engine = MediaDiscoveryEngine(content_encoder=None, style_encoder=None)
print("Media discovery engine initialized with content and style encoders")
Media discovery engine initialized with content and style encoders
TipMedia Search Best Practices

Visual features:

  • Content embeddings: Use CLIP, ResNet, or ViT for semantic content
  • Style embeddings: Use Gram matrices or style-specific encoders
  • Multi-scale: Extract features at multiple resolutions
  • Color histograms: Supplement embeddings with color features

Duplicate detection:

  • Perceptual hashing: pHash, dHash for near-duplicate detection
  • Hamming distance: Fast comparison (XOR + popcount)
  • Clustering: Group near-duplicates for review
  • Threshold tuning: Balance false positives vs false negatives

Performance:

  • Pre-compute embeddings: Encode assets offline during ingestion
  • GPU batching: Batch encode 100-1000 images per GPU
  • Caching: Cache embeddings in vector database
  • Progressive loading: Show low-res previews while searching

12.5 Enterprise Knowledge Graphs

Enterprise knowledge graphs connect entities—customers, products, employees, documents—through relationships. Embedding-based knowledge graphs use learned embeddings to represent entities and relations, enabling link prediction, entity resolution, and graph-aware search that understands how entities relate.

12.5.1 The Knowledge Graph Challenge

Traditional knowledge graphs use discrete representations (triples: subject-predicate-object). Embedding-based graphs represent entities and relations as vectors, enabling:

  • Link prediction: Predict missing relationships
  • Entity resolution: Merge duplicate entities
  • Multi-hop reasoning: Answer complex queries across relationships
  • Similarity search: Find similar entities by embeddings

Challenge: Learn embeddings that preserve graph structure and semantics.

Show Knowledge Graph Embeddings
import torch
import torch.nn as nn
import numpy as np
from typing import List, Tuple, Dict


class KnowledgeGraphEmbedding:
    """TransE-based knowledge graph embedding."""
    def __init__(self, num_entities: int, num_relations: int, embedding_dim: int = 128):
        self.embedding_dim = embedding_dim

        # Entity and relation embeddings
        self.entity_embeddings = nn.Embedding(num_entities, embedding_dim)
        self.relation_embeddings = nn.Embedding(num_relations, embedding_dim)

        # Initialize
        nn.init.xavier_uniform_(self.entity_embeddings.weight)
        nn.init.xavier_uniform_(self.relation_embeddings.weight)

    def score_triple(self, head: torch.Tensor, relation: torch.Tensor,
                     tail: torch.Tensor) -> torch.Tensor:
        """Score a triple (head, relation, tail) using TransE."""
        head_emb = self.entity_embeddings(head)
        rel_emb = self.relation_embeddings(relation)
        tail_emb = self.entity_embeddings(tail)

        # TransE: h + r ≈ t
        score = torch.norm(head_emb + rel_emb - tail_emb, p=2, dim=-1)
        return -score  # Negate so higher is better

    def predict_tail(self, head: int, relation: int, k: int = 10) -> List[Tuple[int, float]]:
        """Predict most likely tail entities for (head, relation, ?)."""
        head_tensor = torch.tensor([head])
        rel_tensor = torch.tensor([relation])

        # Score all possible tails
        all_tails = torch.arange(self.entity_embeddings.num_embeddings)
        scores = []

        for tail in all_tails:
            tail_tensor = torch.tensor([tail])
            score = self.score_triple(head_tensor, rel_tensor, tail_tensor)
            scores.append((tail.item(), score.item()))

        # Return top-k
        scores.sort(key=lambda x: x[1], reverse=True)
        return scores[:k]

# Usage example
kg = KnowledgeGraphEmbedding(num_entities=1000, num_relations=50, embedding_dim=128)
predictions = kg.predict_tail(head=0, relation=5, k=5)
print(f"Top 5 predicted tail entities: {[p[0] for p in predictions]}")
Top 5 predicted tail entities: [0, 840, 794, 359, 912]
TipKnowledge Graph Embedding Best Practices

Model selection:

  • TransE: Simple, works well for 1-to-1 relations
  • DistMult: Better for symmetric relations
  • ComplEx: Handles asymmetric and inverse relations
  • RotatE: State-of-the-art for complex relations

Training:

  • Negative sampling: Sample false triples for contrastive learning
  • Hard negatives: Mine hard negatives (plausible but false)
  • Regularization: L2 regularization on embeddings
  • Batch training: Use large batches (1000-10000 triples)

Applications:

  • Link prediction: Predict missing relationships
  • Entity resolution: Merge duplicate entities by embedding similarity
  • Graph completion: Fill in missing edges
  • Multi-hop reasoning: Answer complex queries (e.g., “customers who bought products similar to X”)
WarningGraph Embedding Challenges

Data quality:

  • Incomplete graphs (missing edges) degrade embeddings
  • Noisy relations (incorrect edges) poison training
  • Entity disambiguation (same name, different entities)

Scalability:

  • Billion-entity graphs require distributed training
  • Full graph materialization doesn’t fit in memory
  • Subgraph sampling required for large graphs

Interpretability:

  • Embeddings are black boxes (hard to debug)
  • Relation semantics may not align with vector operations
  • Need attribution methods to explain predictions

12.6 Key Takeaways

  • Multi-modal search unifies text, images, audio, and video in shared embedding spaces: Cross-modal retrieval (query text, retrieve images) requires contrastive training on paired data and separate per-modality encoders that project to a common vector space

  • Code search transcends syntax to find code by semantic intent: Semantic code embeddings trained on code-docstring pairs enable natural language queries like “sort a list” to find relevant implementations across languages and coding styles

  • Scientific literature search leverages citation networks and domain embeddings: SPECTER and SciBERT embeddings combined with citation graph analysis (co-citation, bibliographic coupling) enable discovery of related research beyond keyword matching

  • Media discovery finds visual similarity and creative style: Separate embeddings for content (semantic meaning) and style (color, composition, texture) enable both “find similar images” and “find images with similar aesthetic” use cases

  • Knowledge graph embeddings enable link prediction and entity resolution: TransE and related models represent entities and relations as vectors, enabling prediction of missing relationships, merging of duplicate entities, and graph-aware similarity search

  • Semantic search beyond text requires domain-specific encoders: General-purpose embeddings (CLIP, BERT) provide baseline capabilities, but production systems need fine-tuning on domain-specific data (code repositories, scientific papers, media assets)—see Chapter 14 for a decision framework on choosing the right level of customization

  • Search quality depends on training data quality: Multi-modal alignment requires clean paired data, code search needs accurate code-docstring pairs, and knowledge graphs need high-quality relationship annotations

12.7 Looking Ahead

Part IV (Advanced Applications) continues with Chapter 13, which revolutionizes recommendation systems with embeddings: embedding-based collaborative filtering that scales to billions of users and items, cold start solutions using content embeddings and meta-learning, real-time personalization with streaming embeddings, diversity and fairness constraints that prevent filter bubbles, and cross-domain recommendation transfer that leverages embeddings across product categories and platforms.

12.8 Further Reading

12.8.1 Multi-Modal Learning

  • Radford, Alec, et al. (2021). “Learning Transferable Visual Models From Natural Language Supervision (CLIP).” ICML.
  • Jia, Chao, et al. (2021). “Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (ALIGN).” ICML.
  • Girdhar, Rohit, et al. (2023). “ImageBind: One Embedding Space To Bind Them All.” CVPR.
  • Baltrusaitis, Tadas, et al. (2019). “Multimodal Machine Learning: A Survey and Taxonomy.” IEEE TPAMI.

12.8.2 Code Search and Software Intelligence

  • Feng, Zhangyin, et al. (2020). “CodeBERT: A Pre-Trained Model for Programming and Natural Languages.” EMNLP.
  • Guo, Daya, et al. (2021). “GraphCodeBERT: Pre-training Code Representations with Data Flow.” ICLR.
  • Husain, Hamel, et al. (2019). “CodeSearchNet Challenge: Evaluating the State of Semantic Code Search.” arXiv.
  • Chen, Mark, et al. (2021). “Evaluating Large Language Models Trained on Code (Codex).” arXiv.

12.8.4 Media and Content Discovery

  • Gatys, Leon A., et al. (2016). “Image Style Transfer Using Convolutional Neural Networks.” CVPR.
  • Johnson, Justin, et al. (2016). “Perceptual Losses for Real-Time Style Transfer and Super-Resolution.” ECCV.
  • Simonyan, Karen, and Andrew Zisserman (2014). “Very Deep Convolutional Networks for Large-Scale Image Recognition.” ICLR.
  • Dosovitskiy, Alexey, et al. (2021). “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT).” ICLR.

12.8.5 Knowledge Graph Embeddings

  • Bordes, Antoine, et al. (2013). “Translating Embeddings for Modeling Multi-relational Data (TransE).” NeurIPS.
  • Yang, Bishan, et al. (2015). “Embedding Entities and Relations for Learning and Inference in Knowledge Bases (DistMult).” ICLR.
  • Trouillon, Théo, et al. (2016). “Complex Embeddings for Simple Link Prediction (ComplEx).” ICML.
  • Sun, Zhiqing, et al. (2019). “RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space.” ICLR.
  • Wang, Quan, et al. (2017). “Knowledge Graph Embedding: A Survey of Approaches and Applications.” IEEE TKDE.