How i would describe LLM to a person

If you clicked on this blog, you must have heard about LLMs, or you may have heard about ChatGPT, Gemini, or, if you're more niche, you've heard about Claude. Those three are LLM products developed by their respective companies. But what is an LLM?

LLM itself stands for 'Large Language Model,' which sounds confusing, but to make it simple, think of it as just a next-word prediction model. Well, not exactly, because they don't predict the next word; they predict the next token.

What is a token?

We can think of tokens as words that a model can understand. Here's an example that we can use to compare words to tokens. I used https://tiktokenizer.vercel.app/ to help me visualize the comparison:

> Pasted image

The words 'Hi my name is cole' are translated into this sequence of tokens: '12194, 922, 1308, 382, 42876'. In this example, the count of words and the count of tokens are the same. But if we look at another example:

> Pasted image

Notice that it split the word 'Rafii' into two tokens, which are '64312, 3573'. This is what the model sees, understands, and predicts.

So, since you have already learned what a token is, the model is simply predicting the next token based on the given input, and it keeps predicting the next token recursively until it feels like it needs to stop. This is the main reason why LLMs are very compute-heavy compared to other machine learning models. The model gets an input and generates an output based on that input, and then uses the predicted output to generate the next input until it stops.

> Pasted image

Now that we know what LLMs do, we should learn how they can do that. I mean, how do we teach a machine learning algorithm to predict the next token? It's really not that confusing. We just take a good, big chunk of text data from the internet and let the model learn from it. But what do they need to learn? They learn the pattern of words, like what are the most likely words after this sequence of input. They learn the pattern of when to stop predicting the next token. They basically learn all of the patterns from the text or the training data.

Why is this important?

Because this training data will limit how knowledgeable an LLM is. How much the model knows depends on the training data that it was trained on. So if a model's training data came from 2020, and we ask a question about an event that happened in 2021, the model will still give you an answer, but a wrong one. This is what we call 'Hallucination'. Why does the model still give an answer even though it doesn't actually know the answer to the question? It's because every token predicted has some probability behind it. For example:

> Pasted image

The image above explains how an LLM chooses what the next token is based on the probability of the token. But today's LLMs don't choose the token with the highest probability. They use top-k or top-p sampling (ignore this if you don't want to go into more detail) to pick the next predicted token to have a more human-like output.

Back to the topic of 'Hallucination,' you may ask, "My ChatGPT knows what incident happened yesterday. That doesn't make sense since it was trained with the knowledge cut-off of October 1, 2024." Yes! This is what makes modern LLMs interesting. They have the ability to search the internet with what we call 'Tooling'. So if the LLM detects a pattern that triggers a sequence of something that the LLM doesn't know, it will use its internal tooling to search for it on the internet.

But how do they know they can use these tools? It can't be from the chunk of training data that they consume, right? It's not from that data. It came from the process known as the Post-training phase. To make it easier to understand, look at the image below:

> Pasted image

As you can see, the post-training phase is the most complicated part. It's the phase where we train the model on how to behave, how to respond to a question, how to ensure the safety of a question, and how to make the model learn to use external tools to obtain information or to use a compiler. With the patterns being learned by the model, it extends the capabilities of these models and reduces the chance of the model giving a hallucinated response since they actually know that they don't know the answer that was asked by the user.

Looking at the previous image, we can also see a foundation model. If you have been following LLM development, I'm sure you have heard of models like GPT-4o, GPT-4.1, or GPT-4.1 Mini; they are a family of models that originated from the GPT-4 foundation model.

Foundation models are basically the base models. These models are plain next-token predictors. Unlike the LLMs that we generally use (like ChatGPT, Gemini, or Claude), they haven't been given an instruction dataset, so their output is generally not helpful, safe, or aligned with human instructions. They need to be trained or fine-tuned to transform them into powerful assistants to help with domain-specific tasks. For example, GPT-4.1 Mini should be faster but not as clever as GPT-4.1. This is what differentiates them: they were tuned to excel in a certain domain.

Summary

In short, a Large Language Model is a powerful tool that functions by predicting the next "token" in a sequence based on vast amounts of text data it has been trained on. This training allows it to understand patterns, context, and language structure. While its knowledge is limited by its training data, which can lead to "hallucinations," modern LLMs overcome this through a post-training phase that equips them with tools to access real-time information. This combination of foundational knowledge and the ability to seek new data is what makes them such versatile and capable systems.

I guess that's all you need to know for understanding LLMs in general. I may cover the topic of the limitations of LLMs, like what the major problems that they have, and why I don't think LLMs are the way to achieve AGI in the next post, but I think this is a great first post to have.