Chapter 6: Paying Attention

🧠 The Problem With Reading Left to Right

In the last chapter, we turned words into numbers — vectors that capture meaning. Now we need the neural network to actually understand sentences. And understanding a sentence means understanding how words relate to each other.

Consider this sentence: "The animal didn't cross the street because it was too tired." What does "it" refer to? The animal, obviously. But how does a computer figure that out? "It" and "animal" are separated by several words. The computer needs some way to connect them.

Before 2017, the most popular approach was called a Recurrent Neural Network (RNN). An RNN reads a sentence one word at a time, from left to right, like a person reading aloud. It keeps a running "memory" of what it's read so far. Sounds reasonable, right?

The problem: RNNs forget. By the time an RNN reaches word #50 in a sentence, its memory of word #1 has faded to almost nothing — like trying to remember the first word of a paragraph after reading the whole thing. For short sentences this was okay. For anything longer — articles, conversations, code — it fell apart.

In 2017, researchers at Google published a paper called "Attention Is All You Need." Their idea was simple but powerful: throw away the left-to-right reading entirely. Instead, let every word look at every other word directly, all at once, regardless of distance. Word #50 can look right back at word #1 just as easily as it looks at word #49.

This mechanism is called attention, and it's the core idea behind GPT, ChatGPT, and pretty much every modern language model. Without it, none of this works.

📚 Query, Key, and Value: The Party Analogy

Attention has three key components, and they have slightly intimidating names: Query, Key, and Value. But the concept is surprisingly intuitive once you hear the right analogy.

Imagine you're at a party 🎉

You walk into a room full of people. You have a question on your mind — say, "Who here knows about cooking?" That question is your Query.

Each person at the party is wearing a name tag that describes their expertise. One tag says "Chef," another says "Programmer," another says "Artist." These name tags are the Keys.

You scan the room, comparing your Query to each person's Key. "Chef" matches your cooking question really well! "Programmer" doesn't match at all. "Artist"... maybe a little (food art is a thing). Based on these matches, you decide how much attention to pay to each person.

Then you listen to what each person actually has to say. The chef tells you an amazing recipe. The programmer talks about code (not useful for your cooking question). The artist mentions food presentation. What they tell you is the Value — the actual useful information you gather.

Your final "answer" is a blend of everyone's Values, weighted by how well their Key matched your Query. You pay 80% attention to the chef, 5% to the programmer, and 15% to the artist.

How this works for words

In a neural network, every word in a sentence creates all three: a Query, a Key, and a Value. These are just vectors (lists of numbers) computed from the word's embedding.

The Query says: "What kind of information am I looking for?"
The Key says: "Here's what kind of information I have."
The Value says: "And here's the actual information itself."

For each word, its Query is compared to every other word's Key. The better the match, the more attention is paid. The output for that word is a weighted blend of all the Values, with the weights determined by the Query-Key matches.

So in our example sentence — "The animal didn't cross the street because it was too tired" — when the network processes the word "it," its Query essentially asks "who or what am I referring to?" The word "animal" has a Key that matches well, so "it" pays a lot of attention to "animal." The network successfully connects the two words, even though they're far apart.

🔥 Interactive Attention Heatmap

How to read this visualization

The grid below is called an attention heatmap. Here's how to read it:

Each row represents a word asking the question: "Which other words should I pay attention to?"
Each column represents a word that might get attended to.
The color intensity shows how much attention is being paid. Darker purple = more attention. Lighter or white = less attention.
The diagonal (top-left to bottom-right) is often strong because words tend to pay attention to themselves — which makes sense, since a word's own meaning is always relevant.
Off-diagonal bright cells show where words are paying attention to other words. These are the interesting ones — they reveal which words the network thinks are related.

Type a sentence below and watch the attention pattern change. Try sentences where pronouns refer to earlier nouns (like "The dog chased the cat because it was fast") and see if the heatmap picks up on the connection.

Each row shows how much a word "attends to" every other word. The diagonal is often strong (words attend to themselves). Related words like "cat" and "sat" may show stronger connections.

What just happened? You're seeing a simulated version of what happens inside a transformer's attention layer. Each word computed a Query and compared it to every other word's Key. The resulting weights — shown as the colored cells — determine how much information flows between words. This is how the network builds an understanding of the relationships between words, not just the individual words themselves.

Note: this demo uses a simplified heuristic to generate plausible attention patterns. Real attention weights are learned during training and can be surprisingly complex — sometimes capturing grammar, sometimes meaning, sometimes patterns that humans can't easily interpret.

💡 Things to Try

Type "The bank by the river" vs "The bank approved the loan" — notice how the attention pattern for "bank" changes depending on context.
Try a longer sentence and watch how distant words can still attend to each other — this is the key advantage over older models that could only look at nearby words.
Look at the diagonal — words always attend to themselves somewhat. But the off-diagonal connections are where the magic happens.
Try "She gave him the book because he asked" — can you spot which words "he" and "she" attend to?

🎭 Multi-Head Attention: Looking at Everything at Once

Here's another piece worth knowing: in GPT and other real transformer models, there isn't just one attention mechanism running — there are many running in parallel. These are called attention heads.

Why? Because words relate to each other in many different ways simultaneously. Think about the sentence "The tired old dog slowly chased the energetic young cat." There are multiple types of relationships happening:

Grammatical relationships: "dog" is the subject of "chased." "cat" is the object. One attention head might specialize in tracking subject-verb-object structure.
Descriptive relationships: "tired," "old," and "slowly" all describe the dog's side of the action. Another head might specialize in connecting adjectives and adverbs to the nouns and verbs they modify.
Positional relationships: Maybe a head just pays attention to nearby words — the words immediately before and after.
Semantic relationships: "dog" and "cat" are both animals. "chased" implies a specific kind of interaction between them. Another head might focus on these meaning-based connections.

A single attention head can only learn one "style" of paying attention. By running many heads at once (GPT-3 uses 96 attention heads per layer!), the model can capture all these different relationship types simultaneously. Each head produces its own set of attention weights, and their outputs are combined to give the final, rich representation of each word.

It's like having 96 different people at the party, each listening for something different — one is listening for topic relevance, another for emotional tone, another for who's replying to whom, and so on. Together, they build a much fuller picture than any single listener could.

🎯 What You Just Learned

This chapter covered the single most important idea in modern AI. Let's recap:

The old approach (RNNs) read left to right and forgot the beginning of long sentences. This was a fundamental limitation.
Attention lets every word look at every other word directly, regardless of how far apart they are. This solved the forgetting problem.
Each word creates three things: a Query (what it's looking for), a Key (what it offers), and a Value (its actual information). The Query-Key match determines attention weights, and the Values are what gets passed along.
Attention heatmaps visualize these relationships as a grid, where each cell shows how much one word attends to another.
Multi-head attention runs many attention mechanisms in parallel, each specializing in a different type of relationship (grammar, meaning, position, etc.).

In the next chapter, we'll put it all together and see how attention fits into the full Transformer architecture — the complete blueprint behind GPT, ChatGPT, and all their cousins.