The architecture behind ChatGPT, Claude, Gemini, and virtually every modern AI — explained from scratch.
If you've used ChatGPT, Claude, Gemini, or any modern AI chatbot, you've been talking to a Transformer. A Transformer is not a specific product or company — it's an architecture, which is a fancy word for "blueprint" or "design pattern." Just like every house needs a blueprint that says where the walls, doors, and windows go, every AI language model needs an architecture that says how information flows through it. The Transformer is that blueprint, and it's the one that nearly every major AI company uses today.
The Transformer was invented in 2017 by a team of researchers at Google, in a paper titled "Attention Is All You Need." Before this paper, AI researchers used older designs called RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks). These older designs processed text one word at a time, like reading a book by looking at one word, then the next, then the next. This was painfully slow, and worse — by the time the model reached the end of a long sentence, it had often "forgotten" what was at the beginning. Imagine trying to understand a whole paragraph but only being allowed to look at one word at a time, and your memory gets fuzzier with each new word. That was the problem.
Transformers solved this by processing all words at once (in parallel) and using attention (Chapter 6!) to let every word "look at" every other word simultaneously. Instead of reading left-to-right and hoping you remember the beginning, a Transformer sees the entire sentence at once and decides which words are important for understanding each other word. This is not only more accurate — it's massively faster, because modern computer chips (GPUs) are built to do many calculations at the same time.
A single Transformer layer has two main parts:
Real language models stack many of these layers on top of each other. Each layer refines the model's understanding a little more. Think of it like editing an essay: the first pass catches the basic meaning, the second pass catches nuance, the third catches subtle implications, and so on. GPT-2 stacks 12 layers. GPT-3 stacks 96 layers. GPT-4 reportedly uses 120+ layers. More layers = deeper understanding, but also more computation.
Now let's watch a Transformer in action. The interactive demo below shows data flowing through a single Transformer layer, processing the sentence "The cat sat on the ___" and predicting what word comes next.
Click "Next Step" to advance through each stage. Read the explanation below the diagram at each step — it connects back to everything you've learned so far.
We're going to feed the sentence "The cat sat on the ___" into a Transformer and see how it predicts the missing word. The blank (___) represents the position where the model will generate its prediction. Every time ChatGPT writes a word, it's doing exactly this — predicting what comes next based on everything before it.
Remember tokenization from Chapter 5? The very first thing a Transformer does is split the
input text into tokens — small pieces that the model can work with. In our example,
"The cat sat on the ___" becomes six tokens: The, cat, sat,
on, the, and ___. In real models, tokens aren't always whole words —
a long word like "understanding" might become two tokens: "under" + "standing". But the idea is the same:
break text into manageable pieces.
Computers can't think about words directly — they need numbers. So each token gets converted into a
vector (a list of numbers) called an embedding (Chapter 5). The word "cat"
might become something like [0.2, -0.5, 0.8, 1.1, ...] — a list of hundreds of numbers that
capture its meaning. Words with similar meanings (like "cat" and "kitten") end up with similar numbers.
But there's a problem: if we just use embeddings, the model can't tell the difference between "The cat sat on the mat" and "The mat sat on the cat" — the same words, different order, very different meaning! So the Transformer adds positional encoding — extra numbers that tell the model where each word appears in the sentence. Position 1, position 2, position 3, etc. Now the model knows both what each word means AND where it appears.
This is the magic ingredient — the thing that makes Transformers special. In self-attention (Chapter 6), every token gets to "look at" every other token and decide which ones are most relevant.
When processing the word "sat," the attention mechanism might learn: "cat" is very relevant (it's the one doing the sitting), and "on" is relevant (it tells us sat WHERE)." Meanwhile, when processing "the" (the second one), attention might focus on "___" because it's part of the phrase "the ___" — whatever the blank is, "the" is its article.
The curved lines in the diagram represent these attention connections. Every word connects to every other word, but some connections are stronger (higher attention weight) than others. This is how the model builds understanding — not from individual words in isolation, but from how they relate to each other.
After attention has mixed information between words, each word's representation passes through a small feed-forward neural network — the same kind of network you built in Chapters 1–3! It's just layers of neurons with weights and activation functions.
While attention lets words share information, the feed-forward network lets each word think deeply about the information it gathered. Think of it this way: attention is like a group discussion where everyone shares their perspective, and the feed-forward network is like going home afterward and really processing what you heard. This step adds layers of nuance and abstract understanding that simple attention can't capture alone.
Now we need to actually predict a word. The model takes the final vector (list of numbers) for the ___
position and projects it through a massive lookup table called the vocabulary projection.
The model's vocabulary might contain 50,000+ words and word-pieces. For each one, the model calculates a score:
how likely is this word to come next?
These scores get converted to probabilities using a function called softmax, which makes all the numbers positive and sum to 100%. So you might get: "mat" = 42%, "floor" = 18%, "table" = 12%, "ground" = 9%, and thousands of other words sharing the remaining 19%.
Finally, the model picks a word. The simplest approach is to always pick the word with the highest probability (this is called greedy decoding). But often, models use temperature to add controlled randomness — at low temperature, the model almost always picks the top word ("mat"); at high temperature, it might surprise you with "blanket" or "carpet." This is the same temperature concept from our training demos! Higher temperature = more creative but less predictable.
You might be wondering: okay, the Transformer predicted one word ("mat"). But ChatGPT writes entire paragraphs, essays, even code. How does it go from predicting one word to writing a whole response?
The answer is beautifully simple: it does it one word at a time.
GPT is what's called an autoregressive model (a fancy word that just means "it uses its own output as input"). Here's how it works:
"The cat sat on the" → Transformer predicts: "mat""The cat sat on the mat" → Transformer predicts: ".""The cat sat on the mat." → Transformer predicts: "It""The cat sat on the mat. It" → Transformer predicts: "was"Every single time, the entire Transformer runs from scratch on the full input so far. When ChatGPT is writing the 500th word of a response, it's looking at all 499 previous words through attention to decide what word #500 should be. This is why longer responses take longer to generate — there's more to process each time.
The context window is the maximum number of tokens the model can see at once. It's like the model's working memory. GPT-3.5 had a context window of about 4,000 tokens (~3,000 words). GPT-4 expanded this to 128,000 tokens — roughly 300 pages of text. That means GPT-4 can read an entire novel and answer questions about it, all in one go. Claude (the AI by Anthropic) supports up to 200,000 tokens. The bigger the context window, the more the model can "remember" during a single conversation.
Every impressive thing a language model does — writing code, translating languages, summarizing documents, having conversations — boils down to this simple loop: read everything so far → predict the next token → add it → repeat. The "intelligence" comes from the Transformer's ability to understand context through attention, refined through billions of parameters trained on enormous amounts of text. That's it. That's the whole trick.
Next up: Chapter 8 — we'll zoom out and see what happens when you scale these building blocks from our tiny playground to the massive models powering today's AI revolution.