Chapter 2

Neurons That Learn

In Chapter 1, you manually set the weight and bias by hand. That was fun, but imagine doing that for a network with billions of weights. Impossible! In this chapter, we'll discover how neurons figure out the right weights on their own — by learning from examples.

🎸 What Does "Training" Mean?

Have you ever tuned a guitar? You pluck a string, listen to the sound, and then turn the tuning peg a tiny bit. Pluck again — closer? Turn a little more. You keep repeating this cycle until the note sounds right. You never calculate the exact position of the peg with a formula — you just listen and adjust.

Training a neural network works the same way. We show the network an example (like an input-output pair), let it make a prediction, check how wrong it was, and then nudge the weights slightly to make it less wrong. Then we do it again. And again. Thousands of times. Each time, the network gets a tiny bit better.

But to do this, we need a way to measure "how wrong" the network is. That measurement has a name: loss.

📊 What is Loss?

Loss is a single number that tells you how wrong the network's predictions are. Think of it like a score in golf: lower is better. A loss of 0 means the network is perfect — every prediction exactly matches the correct answer. A high loss means the network is way off.

For example, if the correct answer is 1.0 and the network predicts 0.3, the loss captures that gap. The specific formula we'll use is called Mean Squared Error (MSE) — it takes the difference between the prediction and the correct answer, squares it (so negative errors and positive errors both count), and averages across all examples. But you don't need to memorize that. Just remember: loss = how wrong we are. Lower is better.

🔄 What is an Epoch?

When the network looks at every single training example once, that's called one epoch (pronounced "EP-ock"). If you have 4 training examples and the network sees all 4, that's 1 epoch. If it sees all 4 again, that's 2 epochs. Training usually takes hundreds or thousands of epochs — each pass through the data makes the network a little bit smarter.

Think of epochs like laps around a track. One lap = one epoch. The more laps you run, the more fit you get (up to a point). Similarly, the more epochs you train, the lower the loss goes — the network keeps getting better.

🧩 How Learning Works: Step by Step

Let's walk through exactly what happens during one training step. There are 5 stages, and they repeat over and over:

Forward Pass: Feed an input into the network and let it calculate an output. This is exactly what you did in Chapter 1 — input goes in, gets multiplied by weights, bias gets added, sigmoid squishes the result. The network makes its best guess.
Calculate the Loss: Compare the network's guess to the correct answer. How far off was it? This gives us the loss — one number summarizing the error.
Backpropagation: This is the clever part. The network works backwards from the loss to figure out which weights were most responsible for the error. It's like tracing a wrong answer on a test back to the specific thing you misunderstood. The term backpropagation (or "backprop" for short) literally means "propagating the error backward" through the network.
Update the Weights: Now that we know which weights caused the error, we nudge them in the direction that would reduce the loss. Big error → bigger nudge. Small error → tiny nudge. The size of the nudge is controlled by a setting called the learning rate (more on that below).
Repeat: Do it all again with the next example. After one full pass through all examples (one epoch), the network is a little smarter. After hundreds of epochs, it can be surprisingly accurate.

This cycle — forward, measure error, trace backwards, adjust, repeat — is the heartbeat of all machine learning. Every AI you've ever used was trained this way, including ChatGPT.

📉 Gradient Descent: Rolling Downhill

Here's a beautiful way to visualize training. Imagine that we plot the loss (how wrong the network is) as a landscape of hills and valleys. The height of the landscape at any point represents the loss for a particular weight value. Our goal is to find the lowest valley — the weight that gives the smallest loss.

Gradient descent is the algorithm that finds that valley. Don't let the fancy name scare you — it literally just means "going downhill, step by step." It works by dropping a ball onto the landscape and letting gravity do the work. The ball rolls downhill, always moving toward lower loss. The gradient is just a fancy word for "which direction is downhill" — it tells the ball which way to roll.

The learning rate controls how big each step is. A small learning rate means the ball takes tiny, careful steps (slow but precise). A large learning rate means big, bold leaps (fast but might overshoot the valley and bounce around). Try adjusting the learning rate slider below and see the difference!

Try this: Click "Drop the Ball" and watch it roll to the bottom. Then reset, crank the learning rate up to 0.5, and drop again — see how it behaves differently?

Learning Rate 0.10

Click "Drop the Ball" to watch gradient descent find the minimum!

What you just saw is the core of how every neural network learns. The ball finding the valley is the network finding the best weights. In real networks, the landscape has millions of dimensions (one per weight) instead of just one — but the principle is identical: follow the slope downhill.

🎯 Activity: Train on the AND Gate

Let's put this all together and watch a neuron actually learn. We'll train it on a classic problem: the AND gate.

An AND gate is a simple rule: the output is 1 only when both inputs are 1. Here are all 4 possible input combinations:

[0, 0] → 0 (neither is 1, so output is 0)
[0, 1] → 0 (only one is 1, so output is 0)
[1, 0] → 0 (only one is 1, so output is 0)
[1, 1] → 1 (both are 1, so output is 1! ✅)

Our neuron starts with random weights — it has no idea what the AND gate is. Then we train it: show it the 4 examples over and over, and let gradient descent adjust the weights to minimize the loss. After enough epochs, the neuron should figure out the AND gate on its own!

🗺️ Reading the Decision Boundary

On the left below, you'll see a colorful 2D grid. Here's how to read it:

The purple/dark regions are where the neuron outputs a value close to 1 (it's saying "yes, this is AND").
The light/bright regions are where the neuron outputs a value close to 0 (it's saying "no").
The 4 dots represent the 4 input combinations: (0,0), (0,1), (1,0), and (1,1). Each dot shows the neuron's current prediction.
As training progresses, you should see the purple region shift so that only the (1,1) dot ends up in the purple zone — that means the neuron learned the AND gate!

📈 Reading the Loss Chart

On the right, you'll see a chart that tracks the loss over time:

X axis = Epoch (training round number). Each tick is one pass through all 4 examples.
Y axis = Loss (error). How wrong the network is overall. Remember: lower is better!
You should see the line drop from high to low as training progresses — that's the network getting smarter!

Try clicking Train and watch both visualizations update in real time. You can also try changing the learning rate — a higher rate makes training faster but can be unstable. A lower rate is safer but slower.

Epoch

—

Loss

—

Weight 1

—

Weight 2

—

Bias

Loss over time — X: epoch (training round) · Y: loss (error)

Press Train to begin.

Learning Rate 2.0

After training completes, look at the final weights. You should see both weights are positive (meaning the neuron cares about both inputs) and the bias is negative (meaning the neuron's default answer is "no" — it only says "yes" when both inputs push hard enough to overcome the negative bias). That's the AND gate, learned from scratch!

💡 Things to Try

Set the learning rate to 0.5 — training is slow but steady. Then try 5.0 — see how the loss jumps around wildly? That's because big steps overshoot the valley.
Watch the decision boundary as training progresses. The purple region should creep toward the top-right corner, covering only (1,1).
Try the gradient descent ball above with different learning rates. At 0.01 it barely moves. At 0.5 it bounces past the minimum. Around 0.1 is the sweet spot.
Reset and watch the loss chart closely — notice how it drops fast at first, then slows down. This "diminishing returns" pattern is called convergence.

🤯 What Just Happened?

Let's step back and appreciate what you just witnessed. You took a neuron that started with completely random weights — it had no idea what an AND gate was. Then you clicked "Train," and the neuron figured it out by itself through nothing more than looking at examples and adjusting its weights to reduce error.

No one told the neuron "make weight 1 positive" or "set the bias to -5." It discovered those values on its own through gradient descent. That's machine learning.

Now scale this up: instead of 2 inputs and 1 neuron, imagine millions of neurons with billions of connections, trained on trillions of words from the internet. The same process — forward pass, calculate loss, backpropagate, update weights, repeat — is how ChatGPT learned to write, reason, and have conversations.

🔑 The key insight: Neural networks aren't programmed with rules.
They learn patterns from examples, one tiny weight adjustment at a time.
Next up: what happens when we connect many neurons in layers?