Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

L02 - Embeddings & Positional Encoding: Giving Numbers Meaning

How words become vectors in space, and how we tell time without a clock


In L01 - Tokenization, we turned text into IDs. But to a neural network, ID 464 and ID 465 are just arbitrary numbers. There is no inherent relationship between them.

In this post, we solve two problems:

  1. Meaning: How do we represent words so that “King” is mathematically closer to “Queen” than it is to “Toaster”?

  2. Order: Since the attention-based model we’ll build in the next lesson processes all tokens at once, how do we tell it that “The dog bit the man” is different from “The man bit the dog”?

By the end of this post, you’ll understand:


Part 1: The Embedding Space

Imagine every word is a point in a high-dimensional room. Words with similar meanings are “clumped” together. This is an Embedding Space.

An Embedding Layer is simply a big lookup table. If our vocabulary size is 10,000 and we want each word to be represented by a vector of 512 numbers, our table is a 10,000×51210,000 \times 512 matrix.

The Lookup Operation

When the model sees ID 5, it doesn’t do math on the number 5. It simply grabs the 5th row of the embedding matrix.

<Figure size 800x600 with 1 Axes>

Part 2: The Problem of Order

Standard Neural Networks (like the MLPs we built in the Neural Networks from Scratch series) process data in a specific order. The attention mechanism we’ll introduce next is parallel. It looks at every word in a sentence at the exact same time.

Without help, the attention-based model sees the sentence “The dog bit the man” as a bag of words. It has no idea which word came first.

We fix this by adding Positional Encodings.


Part 3: Positional Encoding (The Sine/Cosine Trick)

This is often the most confusing part of the Transformer architecture, so let’s derive it from scratch.

The Problem: Why not just count?

If we want to represent the order of words, the simplest idea is to just assign an integer to each position:

Why this fails:

  1. Exploding Values: For a long document, the 5,000th word would have the value 5000. Neural networks hate large, unbounded numbers; they cause gradients to explode and training to become unstable.

  2. Inconsistent Steps (if normalized): You might try dividing by the total length (e.g., 0.0,0.5,1.00.0, 0.5, 1.0 for a 3-word sentence). But then the “time distance” between words changes depending on the sentence length. We need a method where the step size is bounded and consistent.

The Intuition: The “Binary Clock”

So, how do we represent numbers that get bigger and bigger without using huge values? We use patterns. Think of how binary numbers work.

Let’s start with a simple 3-bit binary counter to see the pattern clearly:

PositionBit 2 (Slow)Bit 1 (Medium)Bit 0 (Fast)Binary
0000000
1001001
2010010
3011011
4100100
5101101
6110110
7111111

Notice the pattern?

Each column oscillates at a different frequency. Together, they create a unique combination for every row, using only 0s and 1s.

From Binary to Continuous (The Spectrum)

Transformers adapt this binary idea using continuous waves (Sine and Cosine). But instead of just “fast” and “slow,” we have a smooth spectrum of frequencies across the embedding dimensions.

The Formula

For a position pospos and dimension index jj:

PE(pos,2i)=sin(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)
PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)

Don’t let the 10000... term scare you. It is just a “wavelength knob.” Let’s plug in some real numbers to see it in action.

Example: Plugging in the Numbers

Imagine we have a model with dmodel=512d_{model} = 512. This means we have 256 pairs of Sine/Cosine waves.

Case 1: The “Fast” Pair (Dimensions 0 & 1) We are at the start of the vector (i=0i=0).

Denominator=100000=1\text{Denominator} = 10000^0 = 1

This pair acts like the “Seconds Hand” (High Precision).

Case 2: The “Slow” Pair (Dimensions 510 & 511) We are at the end of the vector (i=255i=255).

Denominator=10000510/51210000\text{Denominator} = 10000^{510/512} \approx 10000

This pair acts like the “Hour Hand” (Long-term Context). It takes ~62,800 words to complete one cycle!

Visualization

Let’s generate the matrix. In the plot below:

Notice the “barber pole” pattern? That is the frequencies getting slower as you move to the right (higher dimensions).

<Figure size 1000x600 with 2 Axes>

Part 4: Putting it Together

The final input to our model is:

Input=Embedding(w)+PositionalEncoding(pos)\text{Input} = \text{Embedding}(w) + \text{PositionalEncoding}(pos)

Now, the vector for “dog” at position 2 is slightly different from the vector for “dog” at position 5. The “meaning” is the same, but the “stamp” of its location is unique.

# Pseudo-code for the input pipeline
word_ids = [464, 2068, 7586] # "The quick brown"
embeddings = embedding_layer(word_ids) # Shape: [3, 512]
positions = positional_encoding_layer(range(len(word_ids))) # Shape: [3, 512]

final_input = embeddings + positions

Concrete Example: What the Numbers Look Like

Let’s see what these vectors actually contain (showing just the first 8 dimensions out of 512):

# Token: "The" (position 0)
embedding_The = [ 0.21,  0.15, -0.33,  0.08, -0.12,  0.19,  0.05, -0.28, ...]  # Learned
pos_encoding_0 = [ 0.00,  1.00,  0.00,  1.00,  0.00,  1.00,  0.00,  1.00, ...]  # Fixed (sin/cos)
final_input_0  = [ 0.21,  1.15, -0.33,  1.08, -0.12,  1.19,  0.05,  0.72, ...]  # Sum

# Token: "quick" (position 1)
embedding_quick = [-0.18,  0.42,  0.11, -0.25,  0.37, -0.14,  0.22,  0.09, ...]  # Learned
pos_encoding_1  = [ 0.84,  0.54,  0.10,  0.99,  0.01,  1.00,  0.00,  1.00, ...]  # Fixed (sin/cos)
final_input_1   = [ 0.66,  0.96,  0.21,  0.74,  0.38,  0.86,  0.22,  1.09, ...]  # Sum

# Token: "brown" (position 2)
embedding_brown = [ 0.09, -0.31,  0.44,  0.17, -0.08,  0.26, -0.13,  0.35, ...]  # Learned
pos_encoding_2  = [ 0.91, -0.42,  0.20,  0.98,  0.02,  1.00,  0.00,  1.00, ...]  # Fixed (sin/cos)
final_input_2   = [ 1.00, -0.73,  0.64,  1.15, -0.06,  1.26, -0.13,  1.35, ...]  # Sum

Key observations:

  1. Embeddings have learned values (positive and negative) that capture word meaning

  2. Positional encodings follow the sin/cos pattern (notice the periodic structure)

  3. Final input is simply the element-wise sum of the two

  4. Same word at different positions gets different final vectors (different PE added)


Summary

  1. Embeddings map discrete IDs to continuous vectors where distance equals similarity.

  2. Positional Encodings inject a sense of order into a model that otherwise sees everything at once.

  3. Addition: We simply add these two vectors together. The model learns to separate the “meaning” signal from the “position” signal during training.

Embeddings at Scale Book:

Next Up: L03 – The Attention Mechanism. This is the “Aha!” moment of the entire series. We will build the logic that allows the model to decide which words in a sentence are most relevant to each other.