Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

NN02 - Training an Edge-Detection Neural Network from Scratch

A visual guide to how neural networks learn, using edge detection as our example


Neural networks can seem intimidating. Terms like “cross-entropy loss,” “softmax,” and “backpropagation” sound complex. But at their core, these are just clever combinations of simple operations.

This tutorial explains how neural networks learn — specifically, how they automatically discover the right weights through training. We’ll use edge detection as our running example: training a network to look at a small image patch and answer “Is there a vertical edge here?”

By the end, you’ll understand:

Companion post: For a deeper dive into how a single neuron works as a pattern detector (weights, ReLU, bias), see NN01 - Edge Detection Intuition: A Single Neuron as Pattern Matching. This tutorial focuses on the training process — how the network discovers those weights automatically.


Part 1: The Problem - Detecting Edges

Let’s start with a concrete task. Given a 5x5 grayscale image patch, we want to classify it:

Here are some examples:

<Figure size 1400x300 with 4 Axes>

The Network’s Job

Our neural network will:

  1. Take 25 pixel values as input (the 5×5 patch flattened into a vector)

  2. Process them through weights and activations

  3. Output two scores: one for “Edge”, one for “No Edge”

  4. Convert scores to probabilities using softmax

  5. Compare to the true label using the loss function

📐 Important: Notation Guide

  • Lowercase (x, z, w) → Scalar (single number), e.g. x = 0.5

  • Lowercase bold (\mathbf{x}, \mathbf{z}) → Vector (array), e.g. \mathbf{x} = [0.1, 0.5, 0.9]

  • Subscript (x_i, w_i) → Element i of a vector, e.g. x_0 = 0.1

  • Uppercase (W, X) → Matrix (grid), e.g. W is 25×8

Let’s build up each piece.

<Figure size 1400x800 with 1 Axes>

Seeing the Forward Pass in Action

Let’s trace real numbers through the network to see the big picture. Don’t worry about understanding every detail yet — we’ll explain each concept in the sections that follow:

For now, just watch the shapes and values flow through:

Input image (5×5):
[[0.1 0.1 0.5 0.9 0.9]
 [0.1 0.1 0.5 0.9 0.9]
 [0.1 0.1 0.5 0.9 0.9]
 [0.1 0.1 0.5 0.9 0.9]
 [0.1 0.1 0.5 0.9 0.9]]

This is an EDGE: dark on left, bright on right

Flattened to vector x: shape (1, 25)
x = [0.1, 0.1, ... 0.9]

--- Layer 1 ---
W1: shape (25, 8) (25 inputs × 8 hidden)
    [-0.33, +0.30, ...]
    [+0.38, -0.26, ...]  ...
b1: shape (1, 8)
    [+0.00, +0.00, ...]

z1 = x @ W1 + b1,  shape: (1, 8)
z1 = [-0.27, -1.06, +0.58, ...]

h = ReLU(z1)  -- negative values become 0
h  = [0.00, 0.00, 0.58, ...]

--- Layer 2 ---
W2: shape (8, 2) (8 hidden × 2 outputs)
    [+0.21, -0.18]
    [+0.66, +0.21]  ...
b2: shape (1, 2)
    [+0.00, +0.00]

z2 = h @ W2 + b2,  shape: (1, 2)
z2 = [-0.07, -0.41]  (Edge, No-Edge)

--- Softmax ---
p = softmax(z2)
p = [0.583, 0.417]

→ Prediction: Edge (58.3% confident)
  (Random weights - prediction may be wrong!)

Part 2: How a Neuron Detects Patterns

A neuron performs a simple two-step calculation:

Step 1: Weighted Sum

z=ixiwi+bz = \sum_{i} x_i \cdot w_i + b

Each input xix_i (a scalar — one pixel value) is multiplied by its weight wiw_i (also a scalar), all products are summed, and a bias bb (scalar) is added. The result zz is a single number.

For one neuron: z=xw+bz = \mathbf{x} \cdot \mathbf{w} + b, where x\mathbf{x} is the input vector (25 pixels) and w\mathbf{w} is that neuron’s weight vector (25 weights).

For all neurons at once: z1=xW1+b1\mathbf{z}_1 = \mathbf{x} \cdot W_1 + \mathbf{b}_1, where W1W_1 is a matrix (25×8) — each column is one neuron’s weights. This computes all 8 hidden neurons in one operation!

Step 2: Activation (ReLU)

h=ReLU(z)=max(0,z)h = \text{ReLU}(z) = \max(0, z)

ReLU (Rectified Linear Unit) is the simplest activation:

The activation function adds non-linearity. Without it, stacking layers would be pointless — the whole network would just be one big linear equation!

Why non-linearity matters:

The problem with pure linear layers: A linear function always does the same thing to every input: multiply and add. If you chain two linear functions, you still just get multiply and add: z2=(xW1)W2=x(W1W2)\mathbf{z}_2 = (\mathbf{x} W_1) W_2 = \mathbf{x} (W_1 W_2) — it collapses to a single operation. You can only draw straight lines.

How ReLU fixes this: ReLU makes a decision based on the input:

if z > 0: pass it through

if z ≤ 0: block it (output 0)

This “if/else” behavior is the key! A purely linear function has no “if” — it blindly does the same multiplication regardless of input. ReLU says “it depends on the value.” Different inputs get treated differently.

With many neurons, each making its own if/else decision, the network can carve up the input space into regions, handling each region differently. That’s how it learns complex, non-linear patterns.

Implementation:

z1 = x @ W1 + b1   # matrix multiply: (1,25) @ (25,8) = (1,8)
h = np.maximum(0, z1)  # ReLU applied element-wise

The Key Insight

The weights define what pattern the neuron responds to.

The output hh is essentially a score for how well the input matches the weight pattern. High score = good match. Zero = no match (or negative match blocked by ReLU).

The chart below shows ReLU in action:

<Figure size 1400x500 with 2 Axes>

Now Let’s See Pattern Detection in Action

Below we show how the weights create a “template” that the neuron matches against:

<Figure size 1600x600 with 3 Axes>

Why This Works

The weights encode a template — what pattern the neuron is looking for:

WeightMeaning
Negative (left side)“I want dark pixels here”
Positive (right side)“I want bright pixels here”

How the weighted sum responds:

InputWeightProductEffect
Dark (0.1)Negative (−1)+0.1✓ Positive contribution
Bright (0.9)Positive (+1)+0.9✓ Positive contribution
Bright (0.9)Negative (−1)−0.9✗ Negative contribution
Dark (0.1)Positive (+1)+0.1~ Weak contribution

When input matches template: Products are positive → large sum → strong output

When input doesn’t match: Products cancel or go negative → ReLU blocks it → zero output

Testing on Different Inputs

How does our edge detector respond to different patterns? The chart below tests 5 inputs:

<Figure size 1500x700 with 10 Axes>

Part 3: From Scores to Probabilities - Softmax

We saw that neurons produce scores — higher means better pattern match. But for classification, we need probabilities: “What’s the chance this is an edge?”

Our network has two output neurons (one for “Edge”, one for “No Edge”), each producing a score. So we have a vector of scores z=[zedge,zno_edge]\mathbf{z} = [z_{edge}, z_{no\_edge}]. These raw scores can be any number — positive, negative, or zero.

📘 Terminology: Logits

The raw scores before softmax are called logits. You’ll see this term everywhere in machine learning.

  • Logits can be any real number (−∞ to +∞)

  • Softmax converts logits → probabilities (0 to 1, summing to 1)

Softmax converts this score vector into a probability vector where:

  1. All values are positive

  2. They sum to 1.0

The Intuition: It’s Just a Ratio

At its core, softmax answers: “What fraction of the total is each score?”

If we only had positive scores, we could just divide by the sum:

pj=zjkzkp_j = \frac{z_j}{\sum_k z_k}

Where:

For example, scores [3,1][3, 1] would give P0=3/(3+1)=0.75P_0 = 3/(3+1) = 0.75 and P1=1/(3+1)=0.25P_1 = 1/(3+1) = 0.25.

The problem: Scores can be negative or zero, which breaks this.

The solution: First apply eze^z to make everything positive, then take the ratio.

The Formula

pj=ezjkezkp_j = \frac{e^{z_j}}{\sum_{k} e^{z_k}}

The exponential also amplifies differences — a score of 5 vs 3 becomes e5e^5 vs e3e^3 (148 vs 20), making the network more confident.

The visualization below shows the 3-step process:

<Figure size 1400x400 with 3 Axes>

Implementation:

# z is a vector of scores (logits) for all classes
z = np.array([2.5, 0.5])            # e.g., [Edge score, No-Edge score]

exp_z = np.exp(z)                   # exponentiate each element
p = exp_z / np.sum(exp_z)           # normalize → probability vector

print(f"Logits z: {z}")             # [2.5, 0.5]
print(f"exp(z):   {exp_z}")         # [12.18, 1.65]
print(f"Probs p:  {p}")             # [0.88, 0.12]

This computes all pjp_j values at once: p[j] = ezjkezk\frac{e^{z_j}}{\sum_k e^{z_k}}

Note: In practice, we subtract the max for numerical stability. This prevents overflow with large scores but gives the same result:

exp_z = np.exp(z - np.max(z))        # stable exponentiation
p = exp_z / np.sum(exp_z)            # normalize

Part 4: Measuring Wrongness - The Loss Function

Now we can measure how wrong the network is. We need a single number (scalar) that tells us “how bad was this prediction?”

The Setup

The network outputs a probability vector p=[Pedge,Pno_edge]\mathbf{p} = [P_{edge}, P_{no\_edge}].

P(Edge)P(No Edge)True Labelpcorrectp_{correct}
Good prediction0.950.05Edge0.95
Uncertain0.50.5Edge0.5
Wrong prediction0.10.9Edge0.1

pcorrectp_{correct} is a scalar — whichever probability corresponds to the true class.

The Formula: Cross-Entropy Loss

L=log(pcorrect)L = -\log(p_{correct})

This is called cross-entropy loss. The full formula is L=iyilog(pi)L = -\sum_i y_i \log(p_i).

Think of it as measuring surprise. If you predict 99% chance of rain and it rains, you’re not surprised (low loss). If you predict 1% chance of rain and it rains, you’re very surprised (high loss)! Cross-entropy measures this “surprise” when reality (y\mathbf{y}) differs from your prediction (p\mathbf{p}).

Since y\mathbf{y} is one-hot (all zeros except one 1), the sum simplifies to just log(pcorrect)-\log(p_{correct}). We explain this simplification in the Summation Trick callout below.

Why This Works

The logarithm severely punishes confident wrong answers:

A network that says “I’m 99% sure it’s NOT an edge” when it IS an edge should be penalized heavily. The chart below shows this curve:

<Figure size 1400x500 with 2 Axes>

📘 The Summation Trick

You might see the loss formula written as: L = -\sum_{j} y_j \cdot \log(p_j)

This sums over all classes. But since \mathbf{y} is one-hot encoded (1 for the correct class, 0 for all others), only one term survives!

For example, if “Edge” is correct: \mathbf{y} = [1, 0], \mathbf{p} = [0.88, 0.12]

L = -(1 \cdot \log(0.88) + 0 \cdot \log(0.12)) = -\log(0.88) = 0.128

The zeros kill all terms except the correct class — giving us -\log(p_{correct}).

Implementation:

# p is the probability vector from softmax
# y is the one-hot encoded true label
p = np.array([0.88, 0.12])    # predicted: [P(Edge), P(No Edge)]
y = np.array([1, 0])          # true label: Edge (one-hot)

loss = -np.sum(y * np.log(p)) # cross-entropy

print(f"Predictions p: {p}")           # [0.88, 0.12]
print(f"True label y:  {y}")           # [1, 0]
print(f"Loss:          {loss:.3f}")    # 0.128

The loss is small (0.128) because the network correctly predicted “Edge” with high confidence (88%).


Part 5: Learning - How Do We Find the Right Weights?

So far we’ve assumed the weights are “correct” (negative on left, positive on right). But how does the network learn these weights from scratch?

The answer is gradient descent:

  1. Start with random weights

  2. Measure the loss (how wrong are we?)

  3. Compute the gradient (which direction makes loss smaller?)

  4. Update weights in that direction

  5. Repeat!

We’ve already covered step 2 (the loss function in Part 4). This part explains what gradients are and how we use them (steps 3-4). Part 6 shows how to compute gradients through composed functions (chain rule), and Part 7 puts it all together with backpropagation.

The Gradient

The gradient tells us: if I increase this weight slightly, does the loss go up or down?

Lw="how much does loss change when I change w?"\frac{\partial L}{\partial w} = \text{"how much does loss change when I change } w \text{?"}

Worked Example:

Say we have a weight w=0.5w = 0.5, and we compute (using methods we explain in Part 6) that the gradient is:

Lw=0.32\frac{\partial L}{\partial w} = -0.32

The gradient is negative, meaning: if we increase ww, the loss will decrease. That’s what we want! So we should increase ww.

Update: With learning rate α=0.1\alpha = 0.1:

wnew=0.50.1×(0.32)=0.532w_{new} = 0.5 - 0.1 \times (-0.32) = 0.532

We increased ww from 0.5 to 0.532. The network is learning!

The Update Rule

wnew=woldαLww_{\text{new}} = w_{\text{old}} - \alpha \cdot \frac{\partial L}{\partial w}

We move opposite to the gradient (downhill), scaled by the learning rate α\alpha.

📘 Learning Rate (\alpha)

The learning rate controls how big a step we take with each update:

  • Too large (\alpha = 1.0): Steps are too big, we overshoot and bounce around, may never converge

  • Too small (\alpha = 0.0001): Steps are tiny, learning is very slow

  • Just right: Converges smoothly to a good solution

Think of it like descending a mountain in fog. Large steps might make you miss the valley and end up climbing the other side. Small steps are safe but slow.

Practical advice: Start with \alpha = 0.01 or 0.001 and experiment. If loss explodes or oscillates wildly, reduce it. If loss decreases too slowly, increase it. There’s no universal best value — it depends on your network and data.

Terminology

Implementation:

# Example: one weight update step
w = 0.5                    # current weight
dL_dw = -0.32              # gradient (computed via chain rule)
learning_rate = 0.1

w_new = w - learning_rate * dL_dw
print(f"Old weight:    {w}")           # 0.5
print(f"Gradient:      {dL_dw}")       # -0.32
print(f"Update step:   {-learning_rate * dL_dw}")  # 0.032
print(f"New weight:    {w_new}")       # 0.532

For a full layer with weight matrix WW and bias vector b\mathbf{b}:

W = W - learning_rate * dL_dW  # update all weights at once
b = b - learning_rate * dL_db  # update all biases at once

The gradient dL_dW has the same shape as W — each element tells us how to adjust the corresponding weight.

<Figure size 1400x500 with 2 Axes>

Part 6: The Chain Rule

Before we learn how networks update their weights, we need one key idea from calculus: the chain rule.

The Problem

We want to know: how does the loss (a scalar) change when we tweak one weight (also a scalar)? But the weight doesn’t directly affect the loss — it goes through several steps:

w×xzsoftmaxplogLw \xrightarrow{\times x} z \xrightarrow{\text{softmax}} p \xrightarrow{-\log} L

How do we compute Lw\frac{\partial L}{\partial w} when ww affects LL indirectly?

The Chain Rule

If y=f(g(x))y = f(g(x)) — that is, yy depends on gg, and gg depends on xx — then:

yx=yggx\frac{\partial y}{\partial x} = \frac{\partial y}{\partial g} \cdot \frac{\partial g}{\partial x}

Multiply the derivatives along the path. All these are scalars.

Intuition: If gg doubles when xx increases by 1, and yy triples when gg increases by 1, then yy increases by 2×3=62 \times 3 = 6 when xx increases by 1.

A Concrete Example

In neural networks, a single weight contributes to the score: z=wxz = w \cdot x (scalars here).

This means zw=x\frac{\partial z}{\partial w} = xthe gradient of a weight equals its input!

Let’s verify with L=z2L = z^2 (simplified loss), w=3w = 3, and x=2x = 2:

Applying the chain rule:

Lw=Lzzw=2zx=12×2=24\frac{\partial L}{\partial w} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial w} = 2z \cdot x = 12 \times 2 = 24

Verification: If ww goes from 3 to 3.01:

This pattern — weight gradient = input × gradient from above — is the core of backpropagation.


Part 7: Backpropagation - Computing Gradients Efficiently

In Part 5, we saw the update rule requires Lw\frac{\partial L}{\partial w} for each weight. But our network has many weights — how do we compute all these gradients efficiently?

From Scalars to Matrices

So far we’ve talked about single weights. In practice:

The forward pass becomes: z=xW+b\mathbf{z} = \mathbf{x} W + \mathbf{b} (matrix multiplication)

Backpropagation

Backpropagation applies the chain rule efficiently using matrix operations. We compute gradients layer by layer, starting from the loss and working backwards.

For a layer with input x\mathbf{x} (vector), weights WW (matrix), and output z\mathbf{z} (vector):

LW=xTLz\frac{\partial L}{\partial W} = \mathbf{x}^T \cdot \frac{\partial L}{\partial \mathbf{z}}

The gradient for the entire weight matrix is computed in one operation!

<Figure size 1400x1200 with 3 Axes>

The Key Gradients

For our two-layer edge detector (25 → 8 → 2), we compute gradients layer by layer, working backwards from the loss. Each formula below is derived in the Appendix.

Step 1: Output layer gradient

Lz2=py\frac{\partial L}{\partial \mathbf{z}_2} = \mathbf{p} - \mathbf{y}

This elegant result combines softmax and cross-entropy derivatives (Appendix A.1). If prediction is p=[0.9,0.1]\mathbf{p}=[0.9, 0.1] and truth is y=[1,0]\mathbf{y}=[1,0]:

Step 2: Gradients for W2W_2 and b2\mathbf{b}_2

LW2=hTLz2\frac{\partial L}{\partial W_2} = \mathbf{h}^T \cdot \frac{\partial L}{\partial \mathbf{z}_2}

Lb2=Lz2\frac{\partial L}{\partial \mathbf{b}_2} = \frac{\partial L}{\partial \mathbf{z}_2}

(Appendix A.2 and A.3)

Step 3: Backpropagate to hidden layer

Lh=Lz2W2T\frac{\partial L}{\partial \mathbf{h}} = \frac{\partial L}{\partial \mathbf{z}_2} \cdot W_2^T

(Appendix A.4)

Step 4: Through ReLU

Lz1=Lh1z1>0\frac{\partial L}{\partial \mathbf{z}_1} = \frac{\partial L}{\partial \mathbf{h}} \odot \mathbf{1}_{\mathbf{z}_1 > 0}

The \odot means element-wise multiply. ReLU passes gradients through where z1>0z_1 > 0, blocks them where z10z_1 \leq 0. (Appendix A.5)

Step 5: Gradients for W1W_1 and b1\mathbf{b}_1

LW1=xTLz1\frac{\partial L}{\partial W_1} = \mathbf{x}^T \cdot \frac{\partial L}{\partial \mathbf{z}_1}

Lb1=Lz1\frac{\partial L}{\partial \mathbf{b}_1} = \frac{\partial L}{\partial \mathbf{z}_1}

(Same pattern as Step 2)

Each gradient has the same shape as its parameter: LW1\frac{\partial L}{\partial W_1} is 25×8, LW2\frac{\partial L}{\partial W_2} is 8×2, etc.

How Gradients Update Weights

Once we have all gradients, we update every parameter:

W1=W1αLW1,b1=b1αLb1W_1 = W_1 - \alpha \cdot \frac{\partial L}{\partial W_1}, \quad \mathbf{b}_1 = \mathbf{b}_1 - \alpha \cdot \frac{\partial L}{\partial \mathbf{b}_1}
W2=W2αLW2,b2=b2αLb2W_2 = W_2 - \alpha \cdot \frac{\partial L}{\partial W_2}, \quad \mathbf{b}_2 = \mathbf{b}_2 - \alpha \cdot \frac{\partial L}{\partial \mathbf{b}_2}

Implementation: (backward pass — all matrix/vector operations)

# Shapes for our network: x(1,25), W1(25,8), h1(1,8), W2(8,2), p(1,2)
dL_dz2 = p - y              # (1,2) output gradient
dL_dW2 = h1.T @ dL_dz2      # (8,1) @ (1,2) = (8,2) weight gradient
dL_dh1 = dL_dz2 @ W2.T      # (1,2) @ (2,8) = (1,8) backprop to hidden
dL_dz1 = dL_dh1 * (z1 > 0)  # (1,8) element-wise, ReLU gradient
dL_dW1 = x.T @ dL_dz1       # (25,1) @ (1,8) = (25,8) weight gradient

Part 8: Watching It Learn

Let’s put it all together! We’ll train a tiny network on an edge detection task and watch the loss decrease.

The Training Loop Structure:

for epoch in range(num_epochs):
    for x, y in training_data:
        
        # Forward pass (through both layers)
        z1 = x @ W1 + b1              # hidden layer
        h1 = relu(z1)                 # ReLU activation
        z2 = h1 @ W2 + b2             # output layer
        p = softmax(z2)               # probabilities
        loss = -np.sum(y * np.log(p)) # cross-entropy
        
        # Backward pass (chain rule, layer by layer)
        dL_dz2 = p - y                # output gradient
        dL_dW2 = h1.T @ dL_dz2        # output weights
        dL_dh1 = dL_dz2 @ W2.T        # backprop to hidden
        dL_dz1 = dL_dh1 * (z1 > 0)    # through ReLU
        dL_dW1 = x.T @ dL_dz1         # hidden weights
        
        # Update all weights
        W1 -= lr * dL_dW1
        W2 -= lr * dL_dW2

Each piece we’ve learned slots into this loop. Let’s see it in action:

<Figure size 1600x500 with 3 Axes>
Final accuracy: 100.0%

Network architecture: 25 -> 8 (ReLU) -> 2 (softmax)
Total parameters: W1(25x8) + b1(8) + W2(8x2) + b2(2) = 226

What Happened?

  1. Loss decreased — the network got better at predicting

  2. Accuracy increased — from ~50% (random guessing) to nearly 100%

  3. W1 learned edge patterns — the first hidden neuron’s weights (W1[:, 0] reshaped to 5×5) show the classic edge detector: negative on left, positive on right!

The network discovered the edge detection pattern automatically through gradient descent. We never told it what an edge looks like — it figured it out from examples.

Each of the 8 hidden neurons learns a different pattern. The output layer (W2) then combines these patterns to make the final Edge/No-Edge decision.


Try It Yourself: Complete Edge Detector in ~80 Lines

Here’s a self-contained example with everything we’ve learned: a two-layer network with ReLU activation. It includes data generation, forward pass, backward pass (with backpropagation through both layers), and training.

Architecture: 25 inputs → 8 hidden neurons (ReLU) → 2 outputs (softmax)

Training 2-layer network: 25 -> 8 (ReLU) -> 2
  Epoch  0: Loss = 0.1423
  Epoch 20: Loss = 0.0005
  Epoch 40: Loss = 0.0002
  Epoch 60: Loss = 0.0001
  Epoch 80: Loss = 0.0001

Test Accuracy: 20/20 = 100.0%

Hidden layer: 8 neurons, each looking for a different pattern
Output layer: combines hidden features to classify Edge vs No-Edge

Total parameters: 226 = 226

Summary

We’ve built up a complete picture of how neural networks learn:

The Forward Pass:

  1. Neurons compute weighted sums + ReLU activation (Part 2)

  2. Softmax converts scores to probabilities (Part 3)

  3. Cross-entropy loss measures how wrong we are (Part 4)

The Backward Pass: 4. Gradient descent updates weights opposite to the gradient (Part 5) 5. Chain rule lets us compute derivatives through composed functions (Part 6) 6. Backpropagation efficiently computes all gradients layer by layer (Part 7)

Key Formulas:

The Magic: By repeating forward-pass, loss, backward-pass, update thousands of times, the network discovers the right weights automatically!


Next Up: Building a Flexible Neural Network — Generalize the code from this tutorial to handle any number of layers, with a clean Layer class abstraction. Same principles, but now you can build networks of any depth!


Further Reading:


Appendix: Deriving the Backpropagation Gradients

This appendix derives each gradient formula used in Part 7. All derivations use the chain rule from Part 6.

A.1: Softmax + Cross-Entropy Gradient

Goal: Show that Lzj=pjyj\frac{\partial L}{\partial z_j} = p_j - y_j

Setup:

Step 1: Derivative of loss w.r.t. softmax output

Starting from the cross-entropy loss:

L=iyilog(pi)L = -\sum_i y_i \log(p_i)

To find Lpi\frac{\partial L}{\partial p_i}, we differentiate. Since ddx[log(x)]=1x\frac{d}{dx}[\log(x)] = \frac{1}{x}:

Lpi=yipi\frac{\partial L}{\partial p_i} = -\frac{y_i}{p_i}

Step 2: Derivative of softmax w.r.t. logits

We need pizj\frac{\partial p_i}{\partial z_j}—how changing logit zjz_j affects output pip_i.

Key insight: Why two indices, ii and jj?

Unlike simpler functions, softmax couples all outputs—every pip_i depends on all logits via the denominator kezk\sum_k e^{z_k}. So changing any zjz_j affects every pip_i, not just pjp_j. We use ii and jj as two different indices into the same set of classes.

Using the quotient rule ddx[uv]=vuuvv2\frac{d}{dx}\left[\frac{u}{v}\right] = \frac{v \cdot u' - u \cdot v'}{v^2} with u=eziu = e^{z_i} and v=kezkv = \sum_k e^{z_k}:

Case 1: When i=ji = j (differentiating pip_i w.r.t. its own logit ziz_i)

pizi=ezikezkeziezi(kezk)2=ezi(kezkezi)(kezk)2\frac{\partial p_i}{\partial z_i} = \frac{e^{z_i} \cdot \sum_k e^{z_k} - e^{z_i} \cdot e^{z_i}}{(\sum_k e^{z_k})^2} = \frac{e^{z_i}(\sum_k e^{z_k} - e^{z_i})}{(\sum_k e^{z_k})^2}
=ezikezkkezkezikezk=pi(1pi)= \frac{e^{z_i}}{\sum_k e^{z_k}} \cdot \frac{\sum_k e^{z_k} - e^{z_i}}{\sum_k e^{z_k}} = p_i(1 - p_i)

Case 2: When iji \neq j (differentiating pip_i w.r.t. a different logit zjz_j)

pizj=0eziezj(kezk)2=ezikezkezjkezk=pipj\frac{\partial p_i}{\partial z_j} = \frac{0 - e^{z_i} \cdot e^{z_j}}{(\sum_k e^{z_k})^2} = -\frac{e^{z_i}}{\sum_k e^{z_k}} \cdot \frac{e^{z_j}}{\sum_k e^{z_k}} = -p_i p_j

Summary: pizj=pi(δijpj)\frac{\partial p_i}{\partial z_j} = p_i(\delta_{ij} - p_j) where δij\delta_{ij} is 1 if i=ji=j, else 0.

Step 3: Chain rule

To find Lzj\frac{\partial L}{\partial z_j}, we sum over all classes ii:

Lzj=iLpipizj=i(yipi)pi(δijpj)\frac{\partial L}{\partial z_j} = \sum_i \frac{\partial L}{\partial p_i} \cdot \frac{\partial p_i}{\partial z_j} = \sum_i \left(-\frac{y_i}{p_i}\right) \cdot p_i(\delta_{ij} - p_j)
=iyi(δijpj)=i(yiδij+yipj)= \sum_i -y_i(\delta_{ij} - p_j) = \sum_i (-y_i \delta_{ij} + y_i p_j)
=yj+pjiyi= -y_j + p_j \sum_i y_i

Since y\mathbf{y} is one-hot, iyi=1\sum_i y_i = 1:

Lzj=yj+pj=pjyj\frac{\partial L}{\partial z_j} = -y_j + p_j = p_j - y_j \quad \checkmark

In vector form: Lz2=py\frac{\partial L}{\partial \mathbf{z}_2} = \mathbf{p} - \mathbf{y}


A.2: Weight Gradient

Goal: Show that LW2=hTLz2\frac{\partial L}{\partial W_2} = \mathbf{h}^T \cdot \frac{\partial L}{\partial \mathbf{z}_2}

Setup: The output layer computes z2=hW2+b2\mathbf{z}_2 = \mathbf{h} \cdot W_2 + \mathbf{b}_2, where h\mathbf{h} is the hidden layer output (1×8) and W2W_2 is the weight matrix (8×2).

Derivation:

For a single weight wijw_{ij} connecting hidden neuron ii to output jj:

z2j=ihiwij+bjz_{2j} = \sum_i h_i \cdot w_{ij} + b_j

By chain rule:

Lwij=Lz2jz2jwij=Lz2jhi\frac{\partial L}{\partial w_{ij}} = \frac{\partial L}{\partial z_{2j}} \cdot \frac{\partial z_{2j}}{\partial w_{ij}} = \frac{\partial L}{\partial z_{2j}} \cdot h_i

In matrix form, this gives us:

LW2=hTLz2\frac{\partial L}{\partial W_2} = \mathbf{h}^T \cdot \frac{\partial L}{\partial \mathbf{z}_2} \quad \checkmark

The gradient has the same shape as W2W_2 (8×2).


A.3: Bias Gradient

Goal: Show that Lb2=Lz2\frac{\partial L}{\partial \mathbf{b}_2} = \frac{\partial L}{\partial \mathbf{z}_2}

Derivation:

Since z2j=ihiwij+bjz_{2j} = \sum_i h_i \cdot w_{ij} + b_j:

z2jbj=1\frac{\partial z_{2j}}{\partial b_j} = 1

By chain rule:

Lbj=Lz2jz2jbj=Lz2j1\frac{\partial L}{\partial b_j} = \frac{\partial L}{\partial z_{2j}} \cdot \frac{\partial z_{2j}}{\partial b_j} = \frac{\partial L}{\partial z_{2j}} \cdot 1

Therefore:

Lb2=Lz2\frac{\partial L}{\partial \mathbf{b}_2} = \frac{\partial L}{\partial \mathbf{z}_2} \quad \checkmark

A.4: Backpropagating to Hidden Layer

Goal: Show that Lh=Lz2W2T\frac{\partial L}{\partial \mathbf{h}} = \frac{\partial L}{\partial \mathbf{z}_2} \cdot W_2^T

Derivation:

Each hidden neuron hih_i affects all output neurons. By chain rule, we sum over all outputs:

Lhi=jLz2jz2jhi\frac{\partial L}{\partial h_i} = \sum_j \frac{\partial L}{\partial z_{2j}} \cdot \frac{\partial z_{2j}}{\partial h_i}

Since z2j=ihiwij+bjz_{2j} = \sum_i h_i \cdot w_{ij} + b_j, we have z2jhi=wij\frac{\partial z_{2j}}{\partial h_i} = w_{ij}:

Lhi=jLz2jwij\frac{\partial L}{\partial h_i} = \sum_j \frac{\partial L}{\partial z_{2j}} \cdot w_{ij}

In matrix form:

Lh=Lz2W2T\frac{\partial L}{\partial \mathbf{h}} = \frac{\partial L}{\partial \mathbf{z}_2} \cdot W_2^T \quad \checkmark

This is the key insight of backpropagation: gradients flow backward through the same weights used in the forward pass, but transposed.


A.5: Gradient Through ReLU

Goal: Show that Lz1=Lh1z1>0\frac{\partial L}{\partial \mathbf{z}_1} = \frac{\partial L}{\partial \mathbf{h}} \odot \mathbf{1}_{\mathbf{z}_1 > 0}

Setup: The hidden layer applies ReLU: hi=max(0,z1i)h_i = \max(0, z_{1i})

Derivation:

The derivative of ReLU is:

hiz1i={1if z1i>00if z1i0\frac{\partial h_i}{\partial z_{1i}} = \begin{cases} 1 & \text{if } z_{1i} > 0 \\ 0 & \text{if } z_{1i} \leq 0 \end{cases}

By chain rule:

Lz1i=Lhihiz1i\frac{\partial L}{\partial z_{1i}} = \frac{\partial L}{\partial h_i} \cdot \frac{\partial h_i}{\partial z_{1i}}

In vector form, using \odot for element-wise multiplication:

Lz1=Lh1z1>0\frac{\partial L}{\partial \mathbf{z}_1} = \frac{\partial L}{\partial \mathbf{h}} \odot \mathbf{1}_{\mathbf{z}_1 > 0} \quad \checkmark

ReLU acts as a “gate”: gradients flow through where z1>0z_1 > 0, and are blocked where z10z_1 \leq 0.