In Chapter 1, you manually set the weight and bias by hand. That was fun, but imagine doing that for a network with billions of weights. Impossible! In this chapter, we'll discover how neurons figure out the right weights on their own β by learning from examples.
Have you ever tuned a guitar? You pluck a string, listen to the sound, and then turn the tuning peg a tiny bit. Pluck again β closer? Turn a little more. You keep repeating this cycle until the note sounds right. You never calculate the exact position of the peg with a formula β you just listen and adjust.
Training a neural network works the same way. We show the network an example (like an input-output pair), let it make a prediction, check how wrong it was, and then nudge the weights slightly to make it less wrong. Then we do it again. And again. Thousands of times. Each time, the network gets a tiny bit better.
But to do this, we need a way to measure "how wrong" the network is. That measurement has a name: loss.
Loss is a single number that tells you how wrong the network's predictions are. Think of it like a score in golf: lower is better. A loss of 0 means the network is perfect β every prediction exactly matches the correct answer. A high loss means the network is way off.
For example, if the correct answer is 1.0 and the network predicts 0.3, the loss captures that gap. The specific formula we'll use is called Mean Squared Error (MSE) β it takes the difference between the prediction and the correct answer, squares it (so negative errors and positive errors both count), and averages across all examples. But you don't need to memorize that. Just remember: loss = how wrong we are. Lower is better.
When the network looks at every single training example once, that's called one epoch (pronounced "EP-ock"). If you have 4 training examples and the network sees all 4, that's 1 epoch. If it sees all 4 again, that's 2 epochs. Training usually takes hundreds or thousands of epochs β each pass through the data makes the network a little bit smarter.
Think of epochs like laps around a track. One lap = one epoch. The more laps you run, the more fit you get (up to a point). Similarly, the more epochs you train, the lower the loss goes β the network keeps getting better.
Let's walk through exactly what happens during one training step. There are 5 stages, and they repeat over and over:
This cycle β forward, measure error, trace backwards, adjust, repeat β is the heartbeat of all machine learning. Every AI you've ever used was trained this way, including ChatGPT.
Here's a beautiful way to visualize training. Imagine that we plot the loss (how wrong the network is) as a landscape of hills and valleys. The height of the landscape at any point represents the loss for a particular weight value. Our goal is to find the lowest valley β the weight that gives the smallest loss.
Gradient descent is the algorithm that finds that valley. Don't let the fancy name scare you β it literally just means "going downhill, step by step." It works by dropping a ball onto the landscape and letting gravity do the work. The ball rolls downhill, always moving toward lower loss. The gradient is just a fancy word for "which direction is downhill" β it tells the ball which way to roll.
The learning rate controls how big each step is. A small learning rate means the ball takes tiny, careful steps (slow but precise). A large learning rate means big, bold leaps (fast but might overshoot the valley and bounce around). Try adjusting the learning rate slider below and see the difference!
Try this: Click "Drop the Ball" and watch it roll to the bottom. Then reset, crank the learning rate up to 0.5, and drop again β see how it behaves differently?
What you just saw is the core of how every neural network learns. The ball finding the valley is the network finding the best weights. In real networks, the landscape has millions of dimensions (one per weight) instead of just one β but the principle is identical: follow the slope downhill.
Let's put this all together and watch a neuron actually learn. We'll train it on a classic problem: the AND gate.
An AND gate is a simple rule: the output is 1 only when both inputs are 1. Here are all 4 possible input combinations:
Our neuron starts with random weights β it has no idea what the AND gate is. Then we train it: show it the 4 examples over and over, and let gradient descent adjust the weights to minimize the loss. After enough epochs, the neuron should figure out the AND gate on its own!
On the left below, you'll see a colorful 2D grid. Here's how to read it:
On the right, you'll see a chart that tracks the loss over time:
Try clicking Train and watch both visualizations update in real time. You can also try changing the learning rate β a higher rate makes training faster but can be unstable. A lower rate is safer but slower.
Loss over time β X: epoch (training round) Β· Y: loss (error)
After training completes, look at the final weights. You should see both weights are positive (meaning the neuron cares about both inputs) and the bias is negative (meaning the neuron's default answer is "no" β it only says "yes" when both inputs push hard enough to overcome the negative bias). That's the AND gate, learned from scratch!
Let's step back and appreciate what you just witnessed. You took a neuron that started with completely random weights β it had no idea what an AND gate was. Then you clicked "Train," and the neuron figured it out by itself through nothing more than looking at examples and adjusting its weights to reduce error.
No one told the neuron "make weight 1 positive" or "set the bias to -5." It discovered those values on its own through gradient descent. That's machine learning.
Now scale this up: instead of 2 inputs and 1 neuron, imagine millions of neurons with billions of connections, trained on trillions of words from the internet. The same process β forward pass, calculate loss, backpropagate, update weights, repeat β is how ChatGPT learned to write, reason, and have conversations.
π The key insight: Neural networks aren't programmed with rules.
They learn patterns from examples, one tiny weight adjustment at a time.
Next up: what happens when we connect many neurons in layers?