9 Code Embeddings

Chapter Overview

This chapter covers code embeddings—representations that convert source code into vectors capturing program semantics. We explore how these embeddings understand what code does, not just how it’s written, enabling applications from semantic code search to vulnerability detection.

9.1 What Are Code Embeddings?

Code embeddings convert source code into vectors that capture program semantics—what the code does, not just how it’s written. Two functions that sum a list of numbers should have similar embeddings whether implemented with a loop or the built-in sum() function.

The challenge with code is that syntax varies widely while functionality remains the same. Variable names, formatting, and implementation choices differ between programmers, but the underlying logic may be identical. Code embeddings must see through surface differences to capture semantic similarity.

The example below uses a general text model for demonstration. Production systems use specialized code models (CodeBERT, StarCoder) trained on millions of code repositories that understand programming language syntax and semantics.

9.2 Creating Code Embeddings

"""
Code Embeddings: Source Code as Vectors
"""

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# General text model for demo (production: use CodeBERT, StarCoder, etc.)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Same functionality, different implementations
code_snippets = {
    'sum_loop': '''
def sum_numbers(nums):
    total = 0
    for n in nums:
        total += n
    return total
''',
    'sum_builtin': '''
def sum_numbers(numbers):
    return sum(numbers)
''',
    'reverse_loop': '''
def reverse_list(lst):
    result = []
    for i in range(len(lst)-1, -1, -1):
        result.append(lst[i])
    return result
''',
    'reverse_slice': '''
def reverse_list(items):
    return items[::-1]
''',
}

embeddings = {name: model.encode(code) for name, code in code_snippets.items()}

print(f"Embedding dimension: {len(embeddings['sum_loop'])}\n")
print("Code embedding similarities:\n")
print("Same functionality, different implementation:")
sum_sim = cosine_similarity([embeddings['sum_loop']], [embeddings['sum_builtin']])[0][0]
rev_sim = cosine_similarity([embeddings['reverse_loop']], [embeddings['reverse_slice']])[0][0]
print(f"  sum (loop) ↔ sum (builtin):       {sum_sim:.3f}")
print(f"  reverse (loop) ↔ reverse (slice): {rev_sim:.3f}")

print("\nDifferent functionality:")
cross_sim = cosine_similarity([embeddings['sum_loop']], [embeddings['reverse_loop']])[0][0]
print(f"  sum ↔ reverse:                    {cross_sim:.3f}")

Embedding dimension: 384

Code embedding similarities:

Same functionality, different implementation:
  sum (loop) ↔ sum (builtin):       0.809
  reverse (loop) ↔ reverse (slice): 0.822

Different functionality:
  sum ↔ reverse:                    0.272

Functions with the same purpose cluster together even with different implementations. The two sum functions are more similar to each other than to the reverse functions, and vice versa. This enables powerful applications like “find code similar to this function” or “detect if this code was copied from somewhere.”

9.3 When to Use Code Embeddings

When to use code embeddings: Semantic code search, code clone detection, vulnerability detection, code recommendation, and repository organization.

This book doesn’t include dedicated code embedding chapters. If you’d like to see code applications covered in future editions, reach out to the author.

9.4 Popular Code Architectures

Code embedding architectures
Architecture	Type	Strengths	Use Cases
CodeBERT	BERT-style	Multi-language	Search, clone detection
GraphCodeBERT	Graph-enhanced	Data flow awareness	Bug detection
StarCoder	Large model	80+ languages	Code generation
CodeT5	Encoder-decoder	Understanding + generation	Code summarization

9.5 Advanced: How Code Models Learn

Optional Section

This section explains how code embedding models capture program semantics. Skip if you just need to use pre-built embeddings.

9.5.1 Code as Natural Language

The simplest approach treats code as text and applies standard NLP techniques. This works surprisingly well because code has structure, naming conventions, and patterns that convey meaning.

# These look different but serve the same purpose
# A text model picks up on shared vocabulary: "sum", "numbers", "return"

def sum_numbers_v1(nums):
    return sum(nums)

def sum_numbers_v2(numbers):
    total = 0
    for n in numbers:
        total = total + n
    return total

9.5.2 Abstract Syntax Trees (AST)

More sophisticated models parse code into its structural representation:

import ast

code = "def add(a, b): return a + b"
tree = ast.parse(code)

# The AST captures structure:
# FunctionDef(name='add', args=['a', 'b'], body=[Return(BinOp(...))])

Training on ASTs helps models understand that variable names don’t change program behavior.

9.5.3 Data Flow Graphs

GraphCodeBERT goes further by modeling how data flows through programs:

# Data flow: x → y → z
x = input()
y = x.upper()
z = len(y)

Understanding data flow helps detect bugs where data is used before initialization or after it’s freed.

9.5.4 Contrastive Training

CodeBERT and similar models are trained with contrastive objectives:

Natural Language → Code: Match documentation to the code it describes
Code → Code: Match semantically equivalent implementations
Negative sampling: Push apart unrelated code pairs

# Positive pair: documentation matches code
doc = "Returns the sum of all numbers in the list"
code = "def sum_list(nums): return sum(nums)"

# Negative pair: documentation doesn't match
doc = "Returns the sum of all numbers in the list"
code = "def reverse_list(lst): return lst[::-1]"

9.6 Practical Considerations

9.6.1 Embedding Granularity

Decide what to embed based on your use case:

Function-level: Best for code search and clone detection
File-level: Good for repository organization
Line/block-level: Useful for vulnerability detection

9.6.2 Multi-Language Support

Models like StarCoder support 80+ programming languages. For cross-language search:

# A universal code model embeds these similarly
# because they both sort a list

# Python
sorted_list = sorted(items)

# JavaScript
const sortedList = items.sort();

9.6.3 Handling Long Code

Code often exceeds model context limits. Strategies include:

Chunking: Split into functions/classes
Hierarchical encoding: Embed chunks, then combine
Summarization: Use docstrings/comments plus key lines

9.7 Key Takeaways

Code embeddings capture what code does, not just how it’s written—semantically equivalent implementations cluster together
General text models work for basic tasks, but specialized models (CodeBERT, StarCoder) understand programming language structure
Training approaches range from treating code as text to parsing ASTs and data flow graphs
Applications include semantic search, clone detection, vulnerability finding, and code recommendation
Granularity matters: embed functions for search, files for organization, blocks for vulnerability detection

9.8 Looking Ahead

This completes Part II on embedding types. Chapter 10 explores advanced patterns like hybrid embeddings, multi-vector representations, and quantized embeddings that extend these foundational types.

9.9 Further Reading

Feng, Z., et al. (2020). “CodeBERT: A Pre-Trained Model for Programming and Natural Languages.” EMNLP Findings
Guo, D., et al. (2021). “GraphCodeBERT: Pre-training Code Representations with Data Flow.” ICLR
Li, R., et al. (2023). “StarCoder: may the source be with you!” arXiv:2305.06161