The secret ingredient of transformers: letting each word decide which other words matter most.
In the last chapter, we turned words into numbers — vectors that capture meaning. Now we need the neural network to actually understand sentences. And understanding a sentence means understanding how words relate to each other.
Consider this sentence: "The animal didn't cross the street because it was too tired." What does "it" refer to? The animal, obviously. But how does a computer figure that out? "It" and "animal" are separated by several words. The computer needs some way to connect them.
Before 2017, the most popular approach was called a Recurrent Neural Network (RNN). An RNN reads a sentence one word at a time, from left to right, like a person reading aloud. It keeps a running "memory" of what it's read so far. Sounds reasonable, right?
The problem: RNNs forget. By the time an RNN reaches word #50 in a sentence, its memory of word #1 has faded to almost nothing — like trying to remember the first word of a paragraph after reading the whole thing. For short sentences this was okay. For anything longer — articles, conversations, code — it fell apart.
In 2017, researchers at Google published a paper called "Attention Is All You Need." Their idea was simple but powerful: throw away the left-to-right reading entirely. Instead, let every word look at every other word directly, all at once, regardless of distance. Word #50 can look right back at word #1 just as easily as it looks at word #49.
This mechanism is called attention, and it's the core idea behind GPT, ChatGPT, and pretty much every modern language model. Without it, none of this works.
Attention has three key components, and they have slightly intimidating names: Query, Key, and Value. But the concept is surprisingly intuitive once you hear the right analogy.
You walk into a room full of people. You have a question on your mind — say, "Who here knows about cooking?" That question is your Query.
Each person at the party is wearing a name tag that describes their expertise. One tag says "Chef," another says "Programmer," another says "Artist." These name tags are the Keys.
You scan the room, comparing your Query to each person's Key. "Chef" matches your cooking question really well! "Programmer" doesn't match at all. "Artist"... maybe a little (food art is a thing). Based on these matches, you decide how much attention to pay to each person.
Then you listen to what each person actually has to say. The chef tells you an amazing recipe. The programmer talks about code (not useful for your cooking question). The artist mentions food presentation. What they tell you is the Value — the actual useful information you gather.
Your final "answer" is a blend of everyone's Values, weighted by how well their Key matched your Query. You pay 80% attention to the chef, 5% to the programmer, and 15% to the artist.
In a neural network, every word in a sentence creates all three: a Query, a Key, and a Value. These are just vectors (lists of numbers) computed from the word's embedding.
For each word, its Query is compared to every other word's Key. The better the match, the more attention is paid. The output for that word is a weighted blend of all the Values, with the weights determined by the Query-Key matches.
So in our example sentence — "The animal didn't cross the street because it was too tired" — when the network processes the word "it," its Query essentially asks "who or what am I referring to?" The word "animal" has a Key that matches well, so "it" pays a lot of attention to "animal." The network successfully connects the two words, even though they're far apart.
The grid below is called an attention heatmap. Here's how to read it:
Type a sentence below and watch the attention pattern change. Try sentences where pronouns refer to earlier nouns (like "The dog chased the cat because it was fast") and see if the heatmap picks up on the connection.
What just happened? You're seeing a simulated version of what happens inside a transformer's attention layer. Each word computed a Query and compared it to every other word's Key. The resulting weights — shown as the colored cells — determine how much information flows between words. This is how the network builds an understanding of the relationships between words, not just the individual words themselves.
Note: this demo uses a simplified heuristic to generate plausible attention patterns. Real attention weights are learned during training and can be surprisingly complex — sometimes capturing grammar, sometimes meaning, sometimes patterns that humans can't easily interpret.
Here's another piece worth knowing: in GPT and other real transformer models, there isn't just one attention mechanism running — there are many running in parallel. These are called attention heads.
Why? Because words relate to each other in many different ways simultaneously. Think about the sentence "The tired old dog slowly chased the energetic young cat." There are multiple types of relationships happening:
A single attention head can only learn one "style" of paying attention. By running many heads at once (GPT-3 uses 96 attention heads per layer!), the model can capture all these different relationship types simultaneously. Each head produces its own set of attention weights, and their outputs are combined to give the final, rich representation of each word.
It's like having 96 different people at the party, each listening for something different — one is listening for topic relevance, another for emotional tone, another for who's replying to whom, and so on. Together, they build a much fuller picture than any single listener could.
This chapter covered the single most important idea in modern AI. Let's recap:
In the next chapter, we'll put it all together and see how attention fits into the full Transformer architecture — the complete blueprint behind GPT, ChatGPT, and all their cousins.