ChatGPT doesn't see words — it sees numbers. Let's explore how text becomes vectors.
So far in this course, we've been feeding numbers into neural networks. Inputs like pixel values, coordinates, or simple measurements. And the networks learned to do useful things with those numbers — classify, predict, and even play around in the playground you just tried in Chapter 4.
But here's the thing: ChatGPT works with text — words, sentences, paragraphs, entire conversations. You type a question in plain English (or Spanish, or Japanese, or any language), and it types back an answer. How on earth does a neural network, which only understands numbers, process human language?
This chapter is where we bridge that gap. We're going to answer two big questions:
By the end of this chapter, you'll understand the very first thing that happens when you type a message into ChatGPT — before any "thinking" occurs, before any response is generated. It all starts here: turning your words into numbers.
Before a language model can do anything with your text, it needs to break that text into small, manageable chunks. These chunks are called tokens. Think of tokens as the "atoms" of language for a computer — the smallest meaningful pieces it works with.
Your first instinct might be: "Just split on spaces! Each word is one token." That sounds reasonable, but it falls apart quickly. Here's why:
Modern language models use a clever trick called Byte Pair Encoding, or BPE for short. Here's the basic idea, explained step by step:
The result? Common words like "the" and "and" become single tokens. But rare or long words get split into smaller, recognizable pieces. For example:
["un", "##believ", "##able"]["Chat", "##G", "##PT"]["the"] (it's common enough to be its own token)
You'll notice some tokens start with ##. This is a marker that
means "I'm a continuation — I'm part of a larger word, not a word by myself."
So when you see ["un", "##believ", "##able"], the computer knows
that "##believ" and "##able" attach to "un" to form one word. Without the ##
marker, the computer might think each piece is a separate word.
GPT-4 has a vocabulary of roughly 100,000 tokens. That might sound like a lot, but remember — it covers every language, programming code, emoji, mathematical notation, and more. Each token gets assigned a unique ID number (like token #4821 = "hello", token #952 = "##ing"). These ID numbers are what actually get fed into the neural network.
Type something in the box below and watch it get tokenized. Notice how longer or unusual words get split into subword pieces (marked with ##), while short common words stay whole.
What just happened? The tokenizer examined your text and split it into pieces from its vocabulary. Each colored chip is one token — one entry that the neural network will process. Notice that common short words stay whole, while longer words get broken into familiar sub-pieces. This is exactly what happens inside ChatGPT before it even begins to "think" about your message.
Try typing something unusual — a made-up word, a long compound word like "antidisestablishmentarianism", or even some gibberish. Watch how the tokenizer still breaks it into recognizable pieces. That's the beauty of BPE: it can handle any text, even words it's never seen before, by splitting them into known sub-parts.
Okay, so we've chopped our text into tokens and assigned each one an ID number. But here's a problem: those ID numbers are arbitrary. Token #4821 ("hello") and token #4822 ("help") might have very similar meanings, but their ID numbers don't reflect that. The number 4821 isn't "close to" 4822 in any meaningful way — it's just the next number in a list.
We need something better. We need a way to represent words as numbers where similar words have similar numbers. That's where embeddings come in.
An embedding turns each token into a vector. And a vector is literally just a list of numbers. That's it. No fancy math needed to understand the concept.
Here's an analogy: your house has a location on Earth, described by two numbers — latitude and longitude. Those two numbers are a "vector" that pinpoints where you live. Houses that are near each other have similar latitude and longitude values.
Word embeddings work the same way, except instead of physical location, they describe location in "meaning space." Instead of just 2 numbers (latitude and longitude), each word gets hundreds of numbers — each one capturing a different aspect of the word's meaning. You can think of it as giving every word coordinates in a vast, multidimensional map of meaning.
Here's what makes embeddings useful: words with similar meanings end up at similar coordinates. "Dog" and "cat" are close together. "King" and "queen" are close together. "Happy" and "joyful" are practically on top of each other.
But it gets even cooler. The directions in this space capture relationships. There's a well-known example that surprised researchers when they first found it:
Yes, really. If you take the embedding vector for "king," subtract the vector for "man," and add the vector for "woman," the result is a point in space that's very close to "queen." The direction from "man" to "woman" captures something about gender, and you can apply that direction to other words. This actually works in real trained embeddings — it's not a trick or a simplification.
You might wonder: who decides that "king" should be at coordinates [0.7, 0.8, ...] and "queen" at [0.75, 0.85, ...]? The answer is: nobody. The network learns them during training.
Here's the rough idea: the embedding starts as random numbers. During training, the network reads billions of sentences. Every time it sees "dog" and "puppy" used in similar contexts (like "I walked my ___" or "The ___ barked"), it nudges their embedding vectors a tiny bit closer together. Over millions of examples, words that appear in similar contexts end up with similar embeddings. It's like the network is building its own dictionary of meaning — not from definitions, but from patterns of usage.
The scatter plot below shows words positioned according to their embeddings, reduced to 2D so we can actually see them. Notice how words cluster by category: animals together, emotions together, royalty together. You can drag the words around to see what happens when you move them — but notice how the original positions already group similar words together.
What just happened? You're looking at a simplified version of what happens inside every language model. Each of those dots represents a word's embedding — its "coordinates" in meaning space. In this demo, we're showing just 2 dimensions so it fits on your screen. In reality, GPT-4 uses 12,288 dimensions per word — that's 12,288 numbers for every single token! We obviously can't visualize 12,288-dimensional space, but the principle is exactly the same: similar words cluster together, and the directions between words capture meaningful relationships.
Let's recap the two big ideas from this chapter:
So when you type "Hello, how are you?" into ChatGPT, the first thing that happens is: your text gets tokenized into pieces, and each piece gets converted into a long list of numbers via its embedding. Only then does the actual neural network start processing — which brings us to the next chapter, where we'll learn about the most important innovation in modern AI: attention.