How Large Language Models Work - Part 1
Every day millions of people interact with Large Language Models (LLMs) like ChatGPT, Claude, and Gemini. These models can answer questions, write documents, generate code, and even create art. But how do they actually work? Let’s break down these complex systems into simple concepts that anyone can understand.
The code behind this post is available on GitHub{:target="_blank"}. It’s losely based on Andrej Karpathy’s Videos{:target="_blank"} and the Attention is All You Need{:target="_blank"} paper by Vaswani et al.
The Basic Idea: Predicting What Comes Next
Imagine I say “The cat sat on the…”. You’re probably saying “mat” in your head because you’ve heard the phrase lots of times before. You’ve learned over time that for those words, the most likely next word is “mat”. LLMs work in a surprisingly similar way, just on a much larger scale. The foundation of all LLMs is the ability to generate text based on what has come before.
To have this ability, LLMs need to learn the patterns in the text they’re given. This is done by training the model on a large dataset of text. The model then learns the patterns in this text, character by character.
In these posts we’ll look at how to build an LLM. In part 1 we’ll start by setting up and training a very basic model and then in part 2 we’ll increase the complexity and scale. Lets dive in.
Step 1 - Encoding - Converting Text to Numbers
To train and language model we need some language. For this test we’ll use The Complete Works of William Shakespeare. The text has 1.1 million characters (about 200k words).
The first step is to convert the text into a numerical format. The simplest way to do this is to create a mapping between each character and a number. In our text there are 65 unique characters (26 uppercae, 26 lowercase and 13 numbers and special characters). So we can simply convert each character to a number from 1 to 65.
In our case we’ll sort alphabetically and then convert each character to a number. So numbers 0-12 are the special characters (yes we start from 0 - its confusing but that how its done), A = 13, B = 14, etc. It doesn’t matter what order we pick as long as we’re consistent.
For example, the first sentence of the text is:
First Citizen:
Before we proceed any further, hear me speak.
We can convert this to a sequence of numbers like this:
[18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44, 53, 56, 43, 1, 61, 43, 1, 54, 56, 53, 41, 43, 43, 42, 1, 39, 52, 63, 1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1, 51, 43, 1, 57, 54, 43, 39, 49, 8]
You can pick out 1s as spaces and 0s as new lines.
In practice, modern models split words into pieces called “tokens”, rather than single characters. This lets the model handle a bigger vocabulary - there are around 50k tokens in GPT-4 which is a lot more than our 65! But the concept is the same and we can still build a model using this simple method.
Step 2 - Block and Batch Setup
Ok, so we now have a sequence of numbers. But how do we use this to train a model? The next step is to take chunks of the sequence. Let’s start by taking a chunk or batch of 8 numbers, in this case the first 8 numbers:
[18, 47, 56, 57, 58, 1, 15, 47]
Remember this equates to:
First Ci
From these we can make a miniture training dataset where we feed in some characters as inputs and the next character as the output:
Input: [18]
Output: [47]
Input: [18, 47]
Output: [56]
Input: [18, 47, 56]
Output: [57]
Input: [18, 47, 56, 57]
Output: [58]
Input: [18, 47, 56, 57, 58]
Output: [1]
Input: [18, 47, 56, 57, 58, 1]
Output: [15]
Input: [18, 47, 56, 57, 58, 1, 15]
Output: [47]
Input: [18, 47, 56, 57, 58, 1, 15, 47]
Output: [58]
We now have 8 pieces of data to train the model. Essentially we’re saying to it “Given some characters in a row, what’s the next character?”.
So there are 2 key inputs we need:
- Block size: The number of characters in the input sequence. In the example above we used 8.
- Batch size: The number of input sequences we feed into the model in one go. The above example is 1 batch.
Step 3 - The simplest model
Inputs
We’re now ready to train a model. For our first model we’ll use:
Block size = 16
Batch size = 32
This means we’ll take 32 random chunks of 16 characters from the text.
The Model
Imagine our model as a tiny writing assistant with a short memory. It looks at a small chunk (in this case up to 16 letters)and tries to figure out which letter should come next. Because it’s small, it can’t remember long phrases or complicated sentence structures. Instead, it focuses on little snippets and learns patterns like,
After the letters “ca”, the next letter might be “t”.
Training
To teach this model, we show it lots of ‘question-answer’ examples of text. When the model answers correctly, we reinforce that response. When it’s wrong, we make a small adjustment inside the model so it’s less likely to repeat the mistake. So it does the following cycle:
- Input: Some sequences of numbers (characters). In our case 32 sequences (batch size) of 16 numbers (block size).
- Predict: A simple mathematical function tries to predict the next character.
- Compare: The model’s guess is compared to the real next character.
- Adjust: The model changes its internal parameters if it’s wrong.
The above cycle is one training cycle. This cycle repeats lots of times until the model gradually becomes better at guessing the next letter. To start with the model will be really bad as it will essentially predict random results. After each step it will adjust its internal parameters to try and improve its performance.
Evaluation
We’ll measure the performance of the model using the metric “perplexity”. Perplexity is a measure of how well the model predicts the next character. A perplexity of 1 means the model is predicting the next character perfectly. A perplexity of 2 means the model is predicting the next character 50% of the time, 4 would be 25% of the time and so on.
Results
Here are the perplexity scores after the first 100 training steps:
step 0: perplexity: 113.7
step 20: perplexity: 104.2
step 40: perplexity: 96.2
step 60: perplexity: 89.9
step 80: perplexity: 83.9
step 100: perplexity: 76.9
The model is improving but a score of 76.9 still means we’re only predicting the right character 1 in every 77 times. However it’s still going down. Let’s see what happens if we do a lot more training steps:
step 0: perplexity: 112.3
step 500: perplexity: 27.1
step 1000: perplexity: 15.6
step 1500: perplexity: 13.1
step 2000: perplexity: 12.4
step 2500: perplexity: 12.1
step 3000: perplexity: 11.9
step 3500: perplexity: 11.9
step 4000: perplexity: 11.8
The perplexity decreases with each training step but flatlines after around 3000 steps to 11.8. This means the model is now predicting the next character 1 in every 11.7 times.
Generating Text
We can now generate text using the model. This means it picks each new character based on how likely it is to follow the characters so far. This won’t try to write exact Shakespeare but it will generate text that looks like it could have been written by Shakespeare. Drum roll…
O:
Anele er fe co,
LLamer squsethaittthtr ayit tifod rer; y e ined guratosoulyequg.
BUEEd tavaperelee athavis u warray, n
We by bronond man, d cr miowivero agarlan
has,
Binksue; ain'lilavealeamy y t Isoup uge o'sth r.
What Beeethisunded orachigorsh kn, Ta cheneinhit we t,
Fr s ide Bus ithikee me;
Bul ake har apy ave I arillevVIO hineeo n:
TI ad by andulcavis, scld
Atlithe day;
Hmm not there quite yet. Definitely not Shakespeare but there are some recognisable words and structure. Remember, this is pretty much the most simple model possible. It was trained on my laptop in 7 seconds while frontier models are training on thousands of GPUs for several months!
Next Time
We’ve now seen how to train a simple LLM. In the next post we’ll look at how to take it to the next level:
- Modelling the input text in a more intelligent way
- Scaling up the model to make it more powerful