9  Code Embeddings

NoteChapter Overview

This chapter covers code embeddings—representations that convert source code into vectors capturing program semantics. We explore how these embeddings understand what code does, not just how it’s written, enabling applications from semantic code search to vulnerability detection.

9.1 What Are Code Embeddings?

Code embeddings convert source code into vectors that capture program semantics—what the code does, not just how it’s written. Two functions that sum a list of numbers should have similar embeddings whether implemented with a loop or the built-in sum() function.

The challenge with code is that syntax varies widely while functionality remains the same. Variable names, formatting, and implementation choices differ between programmers, but the underlying logic may be identical. Code embeddings must see through surface differences to capture semantic similarity.

The example below uses a general text model for demonstration. Production systems use specialized code models (CodeBERT, StarCoder) trained on millions of code repositories that understand programming language syntax and semantics.

9.2 Creating Code Embeddings

"""
Code Embeddings: Source Code as Vectors
"""

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# General text model for demo (production: use CodeBERT, StarCoder, etc.)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Same functionality, different implementations
code_snippets = {
    'sum_loop': '''
def sum_numbers(nums):
    total = 0
    for n in nums:
        total += n
    return total
''',
    'sum_builtin': '''
def sum_numbers(numbers):
    return sum(numbers)
''',
    'reverse_loop': '''
def reverse_list(lst):
    result = []
    for i in range(len(lst)-1, -1, -1):
        result.append(lst[i])
    return result
''',
    'reverse_slice': '''
def reverse_list(items):
    return items[::-1]
''',
}

embeddings = {name: model.encode(code) for name, code in code_snippets.items()}

print(f"Embedding dimension: {len(embeddings['sum_loop'])}\n")
print("Code embedding similarities:\n")
print("Same functionality, different implementation:")
sum_sim = cosine_similarity([embeddings['sum_loop']], [embeddings['sum_builtin']])[0][0]
rev_sim = cosine_similarity([embeddings['reverse_loop']], [embeddings['reverse_slice']])[0][0]
print(f"  sum (loop) ↔ sum (builtin):       {sum_sim:.3f}")
print(f"  reverse (loop) ↔ reverse (slice): {rev_sim:.3f}")

print("\nDifferent functionality:")
cross_sim = cosine_similarity([embeddings['sum_loop']], [embeddings['reverse_loop']])[0][0]
print(f"  sum ↔ reverse:                    {cross_sim:.3f}")
Embedding dimension: 384

Code embedding similarities:

Same functionality, different implementation:
  sum (loop) ↔ sum (builtin):       0.809
  reverse (loop) ↔ reverse (slice): 0.822

Different functionality:
  sum ↔ reverse:                    0.272

Functions with the same purpose cluster together even with different implementations. The two sum functions are more similar to each other than to the reverse functions, and vice versa. This enables powerful applications like “find code similar to this function” or “detect if this code was copied from somewhere.”

9.3 When to Use Code Embeddings

When to use code embeddings: Semantic code search, code clone detection, vulnerability detection, code recommendation, and repository organization.

This book doesn’t include dedicated code embedding chapters. If you’d like to see code applications covered in future editions, reach out to the author.

9.5 Advanced: How Code Models Learn

NoteOptional Section

This section explains how code embedding models capture program semantics. Skip if you just need to use pre-built embeddings.

9.5.1 Code as Natural Language

The simplest approach treats code as text and applies standard NLP techniques. This works surprisingly well because code has structure, naming conventions, and patterns that convey meaning.

# These look different but serve the same purpose
# A text model picks up on shared vocabulary: "sum", "numbers", "return"

def sum_numbers_v1(nums):
    return sum(nums)

def sum_numbers_v2(numbers):
    total = 0
    for n in numbers:
        total = total + n
    return total

9.5.2 Abstract Syntax Trees (AST)

More sophisticated models parse code into its structural representation:

import ast

code = "def add(a, b): return a + b"
tree = ast.parse(code)

# The AST captures structure:
# FunctionDef(name='add', args=['a', 'b'], body=[Return(BinOp(...))])

Training on ASTs helps models understand that variable names don’t change program behavior.

9.5.3 Data Flow Graphs

GraphCodeBERT goes further by modeling how data flows through programs:

# Data flow: x → y → z
x = input()
y = x.upper()
z = len(y)

Understanding data flow helps detect bugs where data is used before initialization or after it’s freed.

9.5.4 Contrastive Training

CodeBERT and similar models are trained with contrastive objectives:

  1. Natural Language → Code: Match documentation to the code it describes
  2. Code → Code: Match semantically equivalent implementations
  3. Negative sampling: Push apart unrelated code pairs
# Positive pair: documentation matches code
doc = "Returns the sum of all numbers in the list"
code = "def sum_list(nums): return sum(nums)"

# Negative pair: documentation doesn't match
doc = "Returns the sum of all numbers in the list"
code = "def reverse_list(lst): return lst[::-1]"

9.6 Practical Considerations

9.6.1 Embedding Granularity

Decide what to embed based on your use case:

  • Function-level: Best for code search and clone detection
  • File-level: Good for repository organization
  • Line/block-level: Useful for vulnerability detection

9.6.2 Multi-Language Support

Models like StarCoder support 80+ programming languages. For cross-language search:

# A universal code model embeds these similarly
# because they both sort a list

# Python
sorted_list = sorted(items)

# JavaScript
const sortedList = items.sort();

9.6.3 Handling Long Code

Code often exceeds model context limits. Strategies include:

  1. Chunking: Split into functions/classes
  2. Hierarchical encoding: Embed chunks, then combine
  3. Summarization: Use docstrings/comments plus key lines

9.7 Key Takeaways

  • Code embeddings capture what code does, not just how it’s written—semantically equivalent implementations cluster together
  • General text models work for basic tasks, but specialized models (CodeBERT, StarCoder) understand programming language structure
  • Training approaches range from treating code as text to parsing ASTs and data flow graphs
  • Applications include semantic search, clone detection, vulnerability finding, and code recommendation
  • Granularity matters: embed functions for search, files for organization, blocks for vulnerability detection

9.8 Looking Ahead

This completes Part II on embedding types. Chapter 10 explores advanced patterns like hybrid embeddings, multi-vector representations, and quantized embeddings that extend these foundational types.

9.9 Further Reading

  • Feng, Z., et al. (2020). “CodeBERT: A Pre-Trained Model for Programming and Natural Languages.” EMNLP Findings
  • Guo, D., et al. (2021). “GraphCodeBERT: Pre-training Code Representations with Data Flow.” ICLR
  • Li, R., et al. (2023). “StarCoder: may the source be with you!” arXiv:2305.06161