How Large Language Models Work - Part 2

Last time, we trained a simple language model to predict the next character in a text. It wasn’t very good at creating Shakespeare, but it got the basics right. Now, let’s build on that foundation and explore what makes modern LLMs so much more powerful. WWe’ll uncover the secret of “self-attention,” see how bigger models lead to better performance, and touch on how cloud GPUs make all of this possible.

As always, the code for this post is available on GitHub.

The Power of Self-Attention

In the previous post, our simple model treated all input characters the same. However, in a sentence, not all words matter equally; some words are more relevant to the meaning than others. Let’s take our sentence from the previous post:

“The cat sat on the…”

In this sentence, the words “cat” and “sat” are more important for predicting the next word than “the” or “on”. Our previous model didn’t know this, so it treated all the words the same. Self-attention allows the model to put more weight on the words that are most relevant to the prediction.

The Self-Attention Mechanism

Each word (or in our example model, each character) has three properties:

Key: This represents what the token “means” or its “identity”
Query: This represents what the token is looking for in other tokens
Value: Represents the actual information this token carries that can be shared with other tokens.

In our previous model we essentially only had the “Value” property. Now, for the word “cat” a simple example could have:

Key: “I’m a subject/noun.”
Query: “What’s the action/verb associated with me?”
Value: “I’m a cat”

When the model is predicting the next word, it will compare the Query of “cat” with the Keys of all other tokens in the input sequence. This comparison generates a score that represents how closely related the tokens are. Higher scores mean the tokens are more relevant to each other. It will then use these scores to create a weighted combination of the Values from all tokens.

This may sound a bit complicated, but all we’re doing is allowing the model to focus on the most relevant words rather than treating them all the same. It’s the secret ingredient that lets models understand context and long-range relationships across entire paragraphs.

Training the Self-Attention Model

Just like in the previous post, we train the model by feeding it a lot of text and making adjustments to minimize the error between the predicted output and the actual output. But this time it will also be refining its keys and queries, learning how to better understand the relationships between words.

Our previous model achieved a perplexity score of 11.8. Let’s run an equivalent model with self-attention and see how it performs:

step 500: perplexity: 8.7 
step 1000: perplexity: 7.6 
step 1500: perplexity: 7.1 
step 2000: perplexity: 6.8 
step 2500: perplexity: 6.6 
step 3000: perplexity: 6.4 
step 3500: perplexity: 6.4 
step 4000: perplexity: 6.2

The perplexity score has dropped from 11.8 to 6.2, which means the model is now predicting the next word 1 in every 6.2 times. That’s a huge improvement!

Let’s see what text the model generates:

your bes, you as: nowful hear of nold; ward. bath hed lequieds a firt nobor, we have's can to fordids me forim stul of them grabless wind mons to rewind, gays, buthit gner upon
to til Roe sive ve thoughd pods.
That thols, and tame, you seas, fightr, dors soo stae woung, tet, is of com staiser.

Sher.'t Caren: out be het righs. your peaut wapes, in sie;
This to my, this, twer one and med fea kin, we balwans fight rove Rorand don-hil man
I word, seer no, awrad youghtrise:
Stto hy pove, and hond an

This is still far from perfect but it’s a lot better than our previous model. Some recognisable words and setence structure but still some way to go. We now have the fundamental building blocks in place - the next step is to make things bigger.

Scaling Up

To make our language model smarter, we need to expand three key components: block size embeddings, layers, and heads. These upgrades help the model understand more complex patterns in language.

Block Size: Longer Context Windows

Our simple model could only look at 16 characters at a time. Modern models can handle much longer sequences - GPT-4o can process up to 128,000 tokens at once. This longer “context window” lets the model:

Understand relationships across paragraphs and pages
Remember details from earlier in a conversation
Generate longer, more coherent text

Embeddings: Richer Word Representations

In our simple model, each word or character was represented by a single number. But this doesn’t capture meaning very well. Modern language models use embeddings, which are a list of numbers that represent each token.

Think of embeddings as placing words on a map:

Words with similar meanings, like “cat” and “dog,” are closer together.
Words with different meanings, like “cat” and “table,” are further apart.

Layers: Deeper Understanding

Our simple model only processed data once. But modern models stack multiple layers on top of each other. Each layer refines the information from the previous one, allowing the model to capture more complex relationships.

Early layers find simple patterns, like grammar rules or nearby word connections.
Middle layers discover deeper relationships, like how words in a sentence relate to each other.
Later layers focus on the overall meaning or context of the input.

Heads: Looking at Language from Different Angles

Self-attention allows tokens to focus on the most relevant parts of a sentence. But there’s often more than one way to interpret text. That’s why modern models use multi-head attention.

Each head focuses on a specific type of relationship:

One head might track which nouns connect to which verbs.
Another might focus on the order of words.
Yet another might look at how ideas relate across longer distances.

Why Scaling Works

By increasing the size of embeddings, the number of layers, and the number of heads, we give the model the ability to learn and process far more complex patterns in language. This is why modern LLMs are so much better at understanding context, answering questions, and generating coherent text than simpler models.

With these improvements, our model is no longer just a basic next-word predictor—it’s starting to understand language in ways that feel almost human.

Training the Model

We can now train the model with more embeddings layers and heads. Lets try with:

64 block size
192 embeddings
4 layers
3 heads

step 500: perplexity: 7.6 
step 1000: perplexity: 6.7 
step 1500: perplexity: 6.2 
step 2000: perplexity: 6.0 
step 2500: perplexity: 5.8 
step 3000: perplexity: 5.7 
step 3500: perplexity: 5.5 
step 4000: perplexity: 5.5

Perplexity has dropped from 6.2 to 5.5. Lets see what this means for the text it generates:

LADY CAPULET:
King Esell own so it not by
here to I will love thee.

SICINIUS:
Action; porder.
Who on the me proction.
Now you the words my mercest fornight
That a reconk lay to before-jasts a contern:
O grath?

COMINIUS:
No fro to'th with your gruerm of mror now
on yous: to't!' then't Cabsughtly senator
Your chage
Fausenue to the such in shive.
Will bord, my exaincin, and M

OK now we’re getting somewhere! We’ve got a recognisable sentence structure and recognisable words. We’ve seen that scaling up the model has helped it understand more complex patterns in language.

Can we do better?

This model is still very small compared to the latest models, which use thousands of embeddings, 16+ layers, and >100k block sizes! So why don’t we just bump up the numbers and see what happens?

The model above took about 4 minutes to train on my laptop. If we increase the input parameters much more, it’s going to take a while to run. To take this to the next level, we’d need to use more powerful hardware like Graphical Processing Units (GPUs). And to get a good model, you need to use lots of them!

For example, the recently released DeepSeek-V3 was trained for a combined 2.66 million GPU hours on an Nvidia H800. These cost around $40k to buy or around $2 per hour to use, meaning the training cost was around $5.32 million. And this model has amazed experts by being comparatively cheap to train! This is why it’s so expensive to train state-of-the-art models!

Conclusion

In this post we’ve seen how self-attention allows the model to focus on the most relevant parts of a sentence and how scaling up the model has helped it understand more complex patterns in language. Thats it for now in this series. I might come back to this topic in the future to look at tokenisation and how to run distributed training on cloud GPUs.