# What is an LLM, really?

> A from-scratch explainer of how large language models actually work — tokens, attention, the inference loop, and what to make of it all — in 12 minutes.

- **Pillar:** Explainers
- **Author:** Aditya Marin Gasga (Founding Editor)
- **Published:** 2026-05-15T00:00:00.000Z
- **Updated:** 2026-05-20T00:00:00.000Z
- **Tags:** llm, transformer, attention, tokens, explainer

## TL;DR

A large language model is a function that, given a stretch of text, predicts what comes next — one token at a time, billions of weighted decisions per token, all the way down. Everything else (chat, code, agents, tool use) is built on top of that one trick.

## Key takeaways

1. An LLM doesn't 'know' things — it predicts text. The fact that this works so well is genuinely the surprising part.
2. Tokens are not words. They're the unit the model actually thinks in, and they cost money.
3. Attention is the mechanism that lets the model decide which parts of the input matter for the next token.
4. Training and inference are completely different operations — months of GPU time vs. milliseconds per token.
5. The 'context window' is just how much text the model can see at once. Bigger context isn't always better.

import PullQuote from '~/components/article/PullQuote.astro';
import Callout from '~/components/article/Callout.astro';

The phrase 'large language model' has shown up in roughly every business deck of the last three years, and almost none of them define it. So let's just do that. By the end of this piece you'll have a working mental model — not a research-paper-grade one, but one good enough to read the news and judge product claims.

## What a language model does

Strip everything else away, and a language model is a function. You hand it some text. It returns a probability distribution over the next *token* — meaning, a list of possible next tokens with a score for each.

That's the whole core operation. Repeat it in a loop, and you get text generation. Repeat it with a system prompt that says "you are a helpful assistant," and you get ChatGPT. Repeat it with a tool-use harness and you get an agent. The trick is that the loop has gotten extraordinarily good.

<PullQuote pillar="explainers">Repeat one operation — predict the next token — extraordinarily well, and you get text generation, ChatGPT, and agents alike.</PullQuote>

When you ask a model `What is the capital of France?`, what it actually does is:

1. Convert your question into tokens.
2. Run those tokens through a 200-billion-parameter neural network.
3. Get back a probability distribution: maybe `Paris` at 0.92, `Lyon` at 0.03, `the` at 0.02, and a long tail of other options.
4. Pick the most likely token (or sample, with some randomness).
5. Append that token to the input and go back to step 2 for the next one.

That's it. The model is doing one thing — predicting the next token — and what looks like answers, reasoning, and writing all emerge from doing that prediction extremely well, billions of times in sequence. The diagram at the top of this piece traces exactly that loop.

<Callout pillar="explainers" label="The whole trick">One operation — **predict the next token** — run billions of times in sequence. Everything that looks like reasoning or writing emerges from doing just that, extremely well.</Callout>

## Tokens are not words

This is the first thing most explanations skip. A *token* is whatever atomic unit of text the model was trained to work in. It can be a word (`apple`), a piece of a word (`unpre`, `dictable`), a single character, or even punctuation. Frontier models use *byte-pair encoding* — a compression scheme that lets the model assign single tokens to common patterns and multi-token sequences to rare ones.

A rough rule: in English, a token is about 3/4 of a word. So a 1,000-word essay is roughly 1,300 tokens. This matters because:

- **You're billed per token.** Not per word. Not per character. Per token.
- **Context windows are measured in tokens.** A "200K context window" means 200,000 tokens — about 150,000 words — and that includes the system prompt, tools schemas, and prior conversation.
- **Some inputs are dramatically more expensive than they look.** A wall of code might tokenize to 2× more tokens than equivalent English prose. Tables and JSON are often inefficient.

## What attention does

If a language model is just predicting the next token, *how* it does that is mostly a story about **attention**. Pre-transformer architectures struggled with long-range dependencies — by the time you'd processed a long paragraph, the model had effectively forgotten the start.

Attention solves this. When the model is predicting the next token, every other token in the input gets to "vote" on what should come next — weighted by how relevant that token is. The 2017 paper that introduced the modern transformer is literally titled *Attention Is All You Need*.

Concretely: when you ask `The boy who lost his hat looked sad. What was the boy looking for? He was looking for his ___`, the attention mechanism is what lets the model assign high weight to `hat` from twenty tokens earlier when picking the next word.

Modern models stack many layers of this attention machinery — each layer attending to the outputs of the previous one. With each layer, the model can capture progressively more abstract relationships in the input. Sixty layers deep, you get the kind of pattern-finding that produces working code or coherent essays.

## Training vs. inference

These are two completely different operations, and conflating them is one of the most common confusions in AI conversations.

**Training** is the months-long process of running text through the model and adjusting its hundreds of billions of internal parameters to better predict the next token. It uses petabytes of text, thousands of GPUs, and tens of millions of dollars in electricity. A new frontier model is trained roughly once every six months.

**Inference** is what happens when you use a trained model. The weights are frozen. You give it input; it gives you output; nothing changes. It's much cheaper per use — fractions of a cent per request — but it's the cost that scales with how many users you have. Anthropic, OpenAI, and Google all spend more on serving inference than on training new models.

When people say "the model learned from my conversation," that's usually wrong. The weights don't update from your chat. You're using the frozen model. (Models with "memory" features fake this by saving notes about you and re-injecting them into the system prompt — the model itself doesn't change.)

## Context windows and why bigger isn't always better

The **context window** is how much text the model can attend to in a single request. Five years ago, this was 2,000 tokens. Today, frontier models support 200K to 2M tokens — enough for an entire book.

The temptation is to throw everything at the big window. Sometimes that works. But three things make giant contexts a less obvious win than they sound:

1. **Cost.** You pay per input token. Cramming 500K tokens into every request is not free.
2. **Latency.** Bigger inputs mean slower responses. A 200K-token prompt typically adds 5–15 seconds before the first output token.
3. **Lost-in-the-middle.** Models reliably attend to the beginning and end of long contexts. The middle is genuinely harder for them. Benchmarks consistently show degraded recall for facts buried at the 60-70% mark of a long input.

For most workloads, careful retrieval (RAG) — surfacing the most relevant 5K tokens — beats dumping 500K tokens in and hoping. Use the giant context when you actually need it (whole-codebase analysis, long documents), not as the default.

## What an LLM is *not*

A few things that get conflated with LLMs that aren't them:

- **A search engine.** It doesn't look anything up. (Unless you give it a search tool.)
- **A database.** It doesn't store facts in a structured way. It has rough statistical knowledge of what was in its training data.
- **A calculator.** It can mimic arithmetic on common-looking sums and fail on uncommon ones in unpredictable ways. (Unless you give it a calculator tool.)
- **A reasoner.** Newer "thinking" or "reasoning" models do show genuine multi-step problem-solving, but the underlying mechanism is still next-token prediction — they're trained to produce a chain-of-thought before answering.

The pattern: an LLM by itself is a text predictor. The model is one component of a system, not the whole system.

<PullQuote pillar="explainers">An LLM by itself is a text predictor. Wrap it in tools, retrieval, and a control loop, and you get something that can act in the world.</PullQuote>

## The mental model to walk away with

Imagine an extremely well-read person who has read essentially the entire public internet, who only communicates by completing your sentences, who has no memory between conversations, who occasionally makes plausible things up because they're trained to sound confident, and whose entire mental world is just statistical patterns in text.

That's roughly what you're talking to. Use that mental model and most of the surprising-seeming behavior of LLMs makes sense: why they're great at writing, hit-or-miss at math, prone to confident-sounding errors, and dramatically improved when you let them step through their reasoning.

Everything you've heard about "agents," "AGI," "tool use," and the rest is built on top of this one core trick — predict the next token, then do it again. Genuinely. That's the whole pile.