Chapter 5: How Computers Read Words

🌉 Bridging the Gap: From Numbers to Language

So far in this course, we've been feeding numbers into neural networks. Inputs like pixel values, coordinates, or simple measurements. And the networks learned to do useful things with those numbers — classify, predict, and even play around in the playground you just tried in Chapter 4.

But here's the thing: ChatGPT works with text — words, sentences, paragraphs, entire conversations. You type a question in plain English (or Spanish, or Japanese, or any language), and it types back an answer. How on earth does a neural network, which only understands numbers, process human language?

This chapter is where we bridge that gap. We're going to answer two big questions:

How do we chop text into pieces a computer can work with? (This is called tokenization.)
How do we turn those pieces into numbers that capture meaning? (This is called embedding.)

By the end of this chapter, you'll understand the very first thing that happens when you type a message into ChatGPT — before any "thinking" occurs, before any response is generated. It all starts here: turning your words into numbers.

✂️ Tokenization: Chopping Text Into Pieces

Before a language model can do anything with your text, it needs to break that text into small, manageable chunks. These chunks are called tokens. Think of tokens as the "atoms" of language for a computer — the smallest meaningful pieces it works with.

Why not just use whole words?

Your first instinct might be: "Just split on spaces! Each word is one token." That sounds reasonable, but it falls apart quickly. Here's why:

There are WAY too many words. The English language alone has over 170,000 words in current use. Add in names, slang, technical jargon, typos, other languages, and you're looking at millions. The computer would need a separate entry for every single one.
New words appear all the time. What about "unforgettable"? Or "ChatGPT"? Or someone's username like "xXDragonSlayer99Xx"? A whole-word system would see these as completely unknown — it would have no idea what to do with them.
Related words look totally different. "run", "running", "runner", and "runs" are obviously related, but as whole words they'd each get completely separate, unrelated entries.

The solution: Subword Tokenization (BPE)

Modern language models use a clever trick called Byte Pair Encoding, or BPE for short. Here's the basic idea, explained step by step:

Start with individual characters. Your vocabulary begins with just the alphabet: a, b, c, ... z, plus punctuation and spaces. Every possible word can be spelled out character by character.
Find the most common pair. Look through a huge pile of text (billions of words from the internet) and find which two characters appear next to each other most often. Maybe it's "t" + "h" → "th".
Merge that pair into a new token. Now "th" is a single token in your vocabulary.
Repeat thousands of times. Next most common pair might be "th" + "e" → "the". Then maybe "in" + "g" → "ing". Keep going until you have a vocabulary of about 50,000 to 100,000 tokens.

The result? Common words like "the" and "and" become single tokens. But rare or long words get split into smaller, recognizable pieces. For example:

"unbelievable" might become → ["un", "##believ", "##able"]
"ChatGPT" might become → ["Chat", "##G", "##PT"]
"the" stays as → ["the"] (it's common enough to be its own token)

What's with the "##" prefix?

You'll notice some tokens start with ##. This is a marker that means "I'm a continuation — I'm part of a larger word, not a word by myself." So when you see ["un", "##believ", "##able"], the computer knows that "##believ" and "##able" attach to "un" to form one word. Without the ## marker, the computer might think each piece is a separate word.

Real numbers

GPT-4 has a vocabulary of roughly 100,000 tokens. That might sound like a lot, but remember — it covers every language, programming code, emoji, mathematical notation, and more. Each token gets assigned a unique ID number (like token #4821 = "hello", token #952 = "##ing"). These ID numbers are what actually get fed into the neural network.

Try it yourself!

Type something in the box below and watch it get tokenized. Notice how longer or unusual words get split into subword pieces (marked with ##), while short common words stay whole.

What just happened? The tokenizer examined your text and split it into pieces from its vocabulary. Each colored chip is one token — one entry that the neural network will process. Notice that common short words stay whole, while longer words get broken into familiar sub-pieces. This is exactly what happens inside ChatGPT before it even begins to "think" about your message.

Try typing something unusual — a made-up word, a long compound word like "antidisestablishmentarianism", or even some gibberish. Watch how the tokenizer still breaks it into recognizable pieces. That's the beauty of BPE: it can handle any text, even words it's never seen before, by splitting them into known sub-parts.

📍 Word Embeddings: Words as Points in Space

Okay, so we've chopped our text into tokens and assigned each one an ID number. But here's a problem: those ID numbers are arbitrary. Token #4821 ("hello") and token #4822 ("help") might have very similar meanings, but their ID numbers don't reflect that. The number 4821 isn't "close to" 4822 in any meaningful way — it's just the next number in a list.

We need something better. We need a way to represent words as numbers where similar words have similar numbers. That's where embeddings come in.

What is a vector? (Don't panic — it's just a list of numbers)

An embedding turns each token into a vector. And a vector is literally just a list of numbers. That's it. No fancy math needed to understand the concept.

Here's an analogy: your house has a location on Earth, described by two numbers — latitude and longitude. Those two numbers are a "vector" that pinpoints where you live. Houses that are near each other have similar latitude and longitude values.

Word embeddings work the same way, except instead of physical location, they describe location in "meaning space." Instead of just 2 numbers (latitude and longitude), each word gets hundreds of numbers — each one capturing a different aspect of the word's meaning. You can think of it as giving every word coordinates in a vast, multidimensional map of meaning.

Why does this work? The magic of "closeness"

Here's what makes embeddings useful: words with similar meanings end up at similar coordinates. "Dog" and "cat" are close together. "King" and "queen" are close together. "Happy" and "joyful" are practically on top of each other.

But it gets even cooler. The directions in this space capture relationships. There's a well-known example that surprised researchers when they first found it:

king − man + woman ≈ queen

Yes, really. If you take the embedding vector for "king," subtract the vector for "man," and add the vector for "woman," the result is a point in space that's very close to "queen." The direction from "man" to "woman" captures something about gender, and you can apply that direction to other words. This actually works in real trained embeddings — it's not a trick or a simplification.

How are embeddings learned?

You might wonder: who decides that "king" should be at coordinates [0.7, 0.8, ...] and "queen" at [0.75, 0.85, ...]? The answer is: nobody. The network learns them during training.

Here's the rough idea: the embedding starts as random numbers. During training, the network reads billions of sentences. Every time it sees "dog" and "puppy" used in similar contexts (like "I walked my ___" or "The ___ barked"), it nudges their embedding vectors a tiny bit closer together. Over millions of examples, words that appear in similar contexts end up with similar embeddings. It's like the network is building its own dictionary of meaning — not from definitions, but from patterns of usage.

Explore the embedding space

The scatter plot below shows words positioned according to their embeddings, reduced to 2D so we can actually see them. Notice how words cluster by category: animals together, emotions together, royalty together. You can drag the words around to see what happens when you move them — but notice how the original positions already group similar words together.

Each dot is a word, colored by category. Words with similar meanings naturally cluster together. Drag any word to reposition it — but the initial layout shows how embeddings group related concepts.

💡 Things to Try

Type your own text in the tokenizer above — try "unbelievable" and see how BPE splits it into subwords it knows.
Try typing an emoji or a word in another language. The tokenizer can handle it because BPE works at the byte level!
In the embedding space, notice how "cat" and "dog" are close together, but "pizza" is far away. That's because the numbers for similar words are similar.
Drag a word to a different cluster — in a real model, this would change its meaning entirely!

What just happened? You're looking at a simplified version of what happens inside every language model. Each of those dots represents a word's embedding — its "coordinates" in meaning space. In this demo, we're showing just 2 dimensions so it fits on your screen. In reality, GPT-4 uses 12,288 dimensions per word — that's 12,288 numbers for every single token! We obviously can't visualize 12,288-dimensional space, but the principle is exactly the same: similar words cluster together, and the directions between words capture meaningful relationships.

🎯 What You Just Learned

Let's recap the two big ideas from this chapter:

Tokenization breaks text into small pieces called tokens. Modern models use Byte Pair Encoding (BPE) to create a vocabulary of ~50,000–100,000 subword tokens. This means they can handle any word — even ones they've never seen before — by splitting it into known pieces.
Embeddings turn each token into a vector (a list of hundreds or thousands of numbers) that captures its meaning. Words with similar meanings get similar vectors, and the directions between vectors capture relationships like gender, tense, or category.

So when you type "Hello, how are you?" into ChatGPT, the first thing that happens is: your text gets tokenized into pieces, and each piece gets converted into a long list of numbers via its embedding. Only then does the actual neural network start processing — which brings us to the next chapter, where we'll learn about the most important innovation in modern AI: attention.