Key takeaways

  1. An LLM doesn't 'know' things — it predicts text. The fact that this works so well is genuinely the surprising part.
  2. Tokens are not words. They're the unit the model actually thinks in, and they cost money.
  3. Attention is the mechanism that lets the model decide which parts of the input matter for the next token.
  4. Training and inference are completely different operations — months of GPU time vs. milliseconds per token.
  5. The 'context window' is just how much text the model can see at once. Bigger context isn't always better.

The phrase ‘large language model’ has shown up in roughly every business deck of the last three years, and almost none of them define it. So let’s just do that. By the end of this piece you’ll have a working mental model — not a research-paper-grade one, but one good enough to read the news and judge product claims.

What a language model does

Strip everything else away, and a language model is a function. You hand it some text. It returns a probability distribution over the next token — meaning, a list of possible next tokens with a score for each.

That’s the whole core operation. Repeat it in a loop, and you get text generation. Repeat it with a system prompt that says “you are a helpful assistant,” and you get ChatGPT. Repeat it with a tool-use harness and you get an agent. The trick is that the loop has gotten extraordinarily good.

Repeat one operation — predict the next token — extraordinarily well, and you get text generation, ChatGPT, and agents alike.

When you ask a model What is the capital of France?, what it actually does is:

  1. Convert your question into tokens.
  2. Run those tokens through a 200-billion-parameter neural network.
  3. Get back a probability distribution: maybe Paris at 0.92, Lyon at 0.03, the at 0.02, and a long tail of other options.
  4. Pick the most likely token (or sample, with some randomness).
  5. Append that token to the input and go back to step 2 for the next one.

That’s it. The model is doing one thing — predicting the next token — and what looks like answers, reasoning, and writing all emerge from doing that prediction extremely well, billions of times in sequence. The diagram at the top of this piece traces exactly that loop.

Tokens are not words

This is the first thing most explanations skip. A token is whatever atomic unit of text the model was trained to work in. It can be a word (apple), a piece of a word (unpre, dictable), a single character, or even punctuation. Frontier models use byte-pair encoding — a compression scheme that lets the model assign single tokens to common patterns and multi-token sequences to rare ones.

A rough rule: in English, a token is about 3/4 of a word. So a 1,000-word essay is roughly 1,300 tokens. This matters because:

  • You’re billed per token. Not per word. Not per character. Per token.
  • Context windows are measured in tokens. A “200K context window” means 200,000 tokens — about 150,000 words — and that includes the system prompt, tools schemas, and prior conversation.
  • Some inputs are dramatically more expensive than they look. A wall of code might tokenize to 2× more tokens than equivalent English prose. Tables and JSON are often inefficient.

What attention does

If a language model is just predicting the next token, how it does that is mostly a story about attention. Pre-transformer architectures struggled with long-range dependencies — by the time you’d processed a long paragraph, the model had effectively forgotten the start.

Attention solves this. When the model is predicting the next token, every other token in the input gets to “vote” on what should come next — weighted by how relevant that token is. The 2017 paper that introduced the modern transformer is literally titled Attention Is All You Need.

Concretely: when you ask The boy who lost his hat looked sad. What was the boy looking for? He was looking for his ___, the attention mechanism is what lets the model assign high weight to hat from twenty tokens earlier when picking the next word.

Modern models stack many layers of this attention machinery — each layer attending to the outputs of the previous one. With each layer, the model can capture progressively more abstract relationships in the input. Sixty layers deep, you get the kind of pattern-finding that produces working code or coherent essays.

Training vs. inference

These are two completely different operations, and conflating them is one of the most common confusions in AI conversations.

Training is the months-long process of running text through the model and adjusting its hundreds of billions of internal parameters to better predict the next token. It uses petabytes of text, thousands of GPUs, and tens of millions of dollars in electricity. A new frontier model is trained roughly once every six months.

Inference is what happens when you use a trained model. The weights are frozen. You give it input; it gives you output; nothing changes. It’s much cheaper per use — fractions of a cent per request — but it’s the cost that scales with how many users you have. Anthropic, OpenAI, and Google all spend more on serving inference than on training new models.

When people say “the model learned from my conversation,” that’s usually wrong. The weights don’t update from your chat. You’re using the frozen model. (Models with “memory” features fake this by saving notes about you and re-injecting them into the system prompt — the model itself doesn’t change.)

Context windows and why bigger isn’t always better

The context window is how much text the model can attend to in a single request. Five years ago, this was 2,000 tokens. Today, frontier models support 200K to 2M tokens — enough for an entire book.

The temptation is to throw everything at the big window. Sometimes that works. But three things make giant contexts a less obvious win than they sound:

  1. Cost. You pay per input token. Cramming 500K tokens into every request is not free.
  2. Latency. Bigger inputs mean slower responses. A 200K-token prompt typically adds 5–15 seconds before the first output token.
  3. Lost-in-the-middle. Models reliably attend to the beginning and end of long contexts. The middle is genuinely harder for them. Benchmarks consistently show degraded recall for facts buried at the 60-70% mark of a long input.

For most workloads, careful retrieval (RAG) — surfacing the most relevant 5K tokens — beats dumping 500K tokens in and hoping. Use the giant context when you actually need it (whole-codebase analysis, long documents), not as the default.

What an LLM is not

A few things that get conflated with LLMs that aren’t them:

  • A search engine. It doesn’t look anything up. (Unless you give it a search tool.)
  • A database. It doesn’t store facts in a structured way. It has rough statistical knowledge of what was in its training data.
  • A calculator. It can mimic arithmetic on common-looking sums and fail on uncommon ones in unpredictable ways. (Unless you give it a calculator tool.)
  • A reasoner. Newer “thinking” or “reasoning” models do show genuine multi-step problem-solving, but the underlying mechanism is still next-token prediction — they’re trained to produce a chain-of-thought before answering.

The pattern: an LLM by itself is a text predictor. The model is one component of a system, not the whole system.

An LLM by itself is a text predictor. Wrap it in tools, retrieval, and a control loop, and you get something that can act in the world.

The mental model to walk away with

Imagine an extremely well-read person who has read essentially the entire public internet, who only communicates by completing your sentences, who has no memory between conversations, who occasionally makes plausible things up because they’re trained to sound confident, and whose entire mental world is just statistical patterns in text.

That’s roughly what you’re talking to. Use that mental model and most of the surprising-seeming behavior of LLMs makes sense: why they’re great at writing, hit-or-miss at math, prone to confident-sounding errors, and dramatically improved when you let them step through their reasoning.

Everything you’ve heard about “agents,” “AGI,” “tool use,” and the rest is built on top of this one core trick — predict the next token, then do it again. Genuinely. That’s the whole pile.

Frequently asked questions

What does LLM stand for?

Large Language Model. 'Large' refers to the number of parameters — typically hundreds of billions for frontier models. 'Language' is the domain, and 'Model' is the math: a function that maps input text to output text.

Is an LLM the same as ChatGPT?

No. ChatGPT is a product. The LLM (GPT-5.5, in the current case) is the underlying model. ChatGPT wraps the LLM with a chat interface, memory features, web browsing, and other product layers.

Does an LLM think?

Not in any sense humans would recognize. It performs forward passes through a neural network — billions of multiplications producing probability distributions over tokens. The output can look like reasoning, but the mechanism is pattern completion at a very high level.

Why do LLMs hallucinate?

Because they're text predictors, not knowledge systems. When the model's training data doesn't cover a question — or when it does but ambiguously — the model still has to pick a next token, and picks the most statistically plausible one. That answer might not be true.

How much does it cost to run an LLM?

If you're using an API, you pay per token (typically $0.30–$15 per million tokens depending on model and cache hit rate). If you're hosting your own, you're paying for the GPUs to keep the weights resident — a single frontier model can run on a single 8-GPU node.

About Aditya Marin Gasga

Founding Editor

Aditya covers the whole AI surface area for Signal — frontier models, agent infrastructure, the economics of inference, and the policy decisions that quietly shape what everyone else can build. He writes for operators who need a calibrated view of what's actually shipping versus what's keynote theatre.

  • Founder of Signal; sets the publication's editorial line
  • A decade across product, growth, and AI tooling at venture-backed startups
  • Reads the model release notes, the system cards, and the benchmark papers — and tells you which ones matter
More from Aditya Marin Gasga →