Key takeaways

  1. An embedding is just a list of numbers — usually 256 to 3,072 of them — produced by a model that's been trained to put 'similar' things close together.
  2. Cosine similarity is how you compare two embeddings: basically the angle between two vectors. Score of 1 means identical direction; 0 means unrelated; -1 means opposite.
  3. Embeddings power RAG, semantic search, recommendation systems, deduplication, and clustering — anything where 'find me things like this' is the underlying question.
  4. Dimensionality is a cost/quality knob — higher dimensions capture more nuance but cost more to store and search. Most modern models default to 768-3,072.
  5. OpenAI text-embedding-4, Voyage voyage-large-3, Cohere Embed v4, and the open-weights BGE-M3 are the practical 2026 picks; choose by language coverage, dimensionality, and price.

If you’ve spent any time with modern AI infrastructure, you’ve heard the word embedding thrown around constantly — “we use embeddings for retrieval,” “embedding-based search,” “vector databases store embeddings.” It’s the kind of word that gets used as if everyone understands it, even when nobody on the call quite does. So let’s just define it.

By the end of this piece you’ll have a working mental model — not the research-paper version, but one good enough to read system designs and judge product claims. Same shape as our explainer on LLMs.

What an embedding actually is

An embedding is a list of numbers. That’s the entire definition.

The list is usually long — somewhere between 256 and 3,072 numbers, depending on which embedding model produced it. Each number is a floating-point value, typically between roughly -1 and 1. So an embedding for the sentence “I love dogs” might look like:

[0.0234, -0.1782, 0.9034, ..., -0.0012, 0.4421, 0.8801]

That’s it. A vector. Just numbers. You couldn’t read that and figure out it represents “I love dogs” — there’s no human-meaningful relationship between any single number and any particular word or concept. The meaning is distributed across all 1,536 (or whatever) numbers together.

What makes it useful is that the model that produced this vector was trained on enormous amounts of text with a specific objective: put similar things close together. The embedding for “I love dogs” and the embedding for “I adore puppies” will be very close in this high-dimensional space, even though they share only one short word. The embedding for “I love dogs” and the embedding for “the stock market crashed” will be very far apart, even though they share more words.

So when people say “embeddings capture meaning,” what they mean is: the geometry of the embedding space corresponds to the semantics of the input. Things that mean similar things end up in the same neighborhood. That’s the whole trick — and it’s what the diagram at the top of this piece shows.

Meaning becomes geometry — things that mean similar things land in the same neighbourhood.

How you compare two embeddings

Once you have two embeddings, you need a way to ask “how similar are these?” The answer is almost always cosine similarity.

Cosine similarity is the cosine of the angle between two vectors. If the vectors point in exactly the same direction, the angle is 0° and the cosine is 1. If they’re perpendicular, cosine is 0. If they point in opposite directions, cosine is -1. So cosine similarity gives you a single number between -1 and 1 that says how aligned two embeddings are, regardless of their magnitude.

Why not just measure the straight-line distance between them? You can — that’s called Euclidean distance — and some systems do. But cosine is more common because it ignores magnitude. If embedding A is “I love dogs” said quietly and embedding B is “I LOVE DOGS” said loudly (i.e., longer text, more emphatic), their vectors might have different magnitudes but the same direction. Cosine treats them as the same; Euclidean would say they’re far apart. For semantic comparison, direction is what matters.

In production code, a similarity computation looks something like this:

import numpy as np

a = embed("I love dogs")        # → 1,536-dim numpy array
b = embed("I adore puppies")    # → 1,536-dim numpy array

similarity = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# typically 0.85-0.95 for sentences this similar

Most embedding models normalize their output so the vectors all have magnitude 1, in which case the cosine simplifies to just the dot product — multiply the two vectors element-by-element and sum the result. Very fast.

Where embeddings come from

You don’t write embeddings by hand. You get them from a model that’s been trained to produce them.

The training objective for embedding models is roughly: take a huge corpus, pick pairs of texts that should be similar (a question and its answer, two paraphrases of the same fact, a query and a relevant document), and train the model so those pairs produce vectors close together — while pushing unrelated pairs apart. After hundreds of millions of training examples, the model learns to map any input text to a point in vector space whose location reflects its meaning.

Most modern embedding models are smaller and cheaper than chat models. OpenAI’s text-embedding-4 is dramatically smaller than GPT-5.5. That’s intentional — embedding is a simpler task than generation, so you don’t need the model to be as smart. You just need it to be consistent about which inputs map to which regions of the vector space.

This is also why embedding models are usually deterministic: the same input produces the same vector every time. No sampling, no temperature, no randomness. That makes them reliable building blocks for systems that need to compare and look things up.

What dimensionality buys you

The big knob on an embedding model is dimensionality — how many numbers are in each vector. More dimensions mean the model can capture more nuance about the input. Fewer dimensions mean less storage cost and faster search.

For example:

  • 256 dimensions is fine for short, narrow tasks like classifying customer support tickets into 5 categories. Vectors are small, search is fast, your vector database is cheap.
  • 768 dimensions is the historical sweet spot — used by most BERT-style models. Still very common.
  • 1,536 dimensions is what OpenAI’s text-embedding-3-small uses. Most production RAG systems built in the last two years are running at this dimensionality.
  • 3,072 dimensions is text-embedding-4’s high-quality option, and what Voyage and Cohere offer at their top tier. Better retrieval quality on hard queries; more expensive to store and search at scale.

A useful intuition: doubling dimensions doesn’t double quality. The relationship is steeply diminishing. Going from 256 → 1,536 typically gives you most of the quality you’d ever get; going from 1,536 → 3,072 might add 2-5% on retrieval benchmarks, depending on the task. The honest question for any production system is “is 5% better retrieval worth 2× the storage and search cost?” — and the answer is often no.

Most modern models (OpenAI text-embedding-4, Voyage’s models) support Matryoshka representations — meaning you can truncate the vector to a smaller dimensionality and still get usable results. So you can store 3,072-dim vectors for offline analytics but truncate to 768-dim for fast lookup. This is one of the more useful infrastructure innovations of the last 18 months.

Why embeddings make RAG possible

If embeddings just sat in a vector space looking pretty, none of this would matter. The reason they’re foundational to modern AI infrastructure is retrieval-augmented generation — RAG — and adjacent patterns like semantic search and recommendation.

The pattern, in three steps:

  1. Ingest. Take your corpus of documents — internal docs, knowledge base articles, past tickets, whatever. Chunk them into reasonable pieces (a paragraph, a page, a section). For each chunk, compute its embedding and store the vector + the chunk in a vector database (Pinecone, Weaviate, pgvector, etc.).
  2. Query. When a user asks a question, compute the embedding of the question and ask the database for the top-K chunks whose embeddings are closest to the question’s embedding (by cosine similarity). These are the chunks most likely to contain the answer.
  3. Generate. Pass those top-K chunks to an LLM as context, along with the user’s original question. The LLM answers the question grounded in the retrieved material.

That’s the entire architecture. It’s why every customer-support bot, every internal docs assistant, every “chat with your PDF” tool you’ve used in the last two years exists.

The LLM provides language. The embedding system provides relevance.

The same pattern, slightly modified, powers:

  • Semantic search — query embedding vs. document embeddings, return the closest. (No LLM step.)
  • Recommendation — embed the user’s interaction history, find items with embeddings close to that profile.
  • Deduplication — embed each entry, cluster nearby ones, treat them as duplicates.
  • Anomaly detection — anything whose embedding is far from any cluster is unusual.

If you’ve used a system prompt template for a RAG assistant, the strict citation discipline it enforces is built directly on top of this loop: the embeddings retrieve the chunks, the LLM cites them, and the answer is constrained to the material the retrieval surface returned.

How to choose an embedding model in 2026

Practical picks, all current as of mid-2026:

  • OpenAI text-embedding-4 — easiest integration with the OpenAI stack. Up to 3,072 dimensions, with Matryoshka support. Default choice if you’re already using GPT-5.5 for generation. ~$0.10 per million tokens.
  • Voyage voyage-large-3 — currently top of the MTEB retrieval benchmark for English. 1,024 dimensions standard. Slightly more expensive than OpenAI but consistently wins on hard retrieval tasks. ~$0.12 per million tokens.
  • Cohere Embed v4 — strongest multilingual coverage among the closed-source models, 100+ languages. Use when your corpus or your users aren’t English-default. ~$0.10 per million tokens.
  • BGE-M3 (BAAI, open-weights) — the strongest open-weights embedding model right now. Multilingual, supports dense + sparse + multi-vector retrieval in one model. Self-host on a single GPU or use via providers like Together. Zero per-token cost if you’re hosting; latency depends on your infrastructure.

How to pick:

  • If your application is English-dominant and you want best raw quality: Voyage.
  • If you’re already on OpenAI: text-embedding-4 (one fewer vendor).
  • If you need multilingual: Cohere v4 (managed) or BGE-M3 (self-host).
  • If you need on-prem / air-gapped deployment: BGE-M3 is the only real answer.

Don’t agonize. The quality differences between these models on most production tasks are smaller than the quality differences between good and bad chunking strategies upstream. Pick one that fits your stack and move on to evaluating your actual retrieval quality.

What embeddings aren’t

A few things that get conflated with embeddings:

  • Embeddings are not search. They’re an input to search. The search system itself is the vector database and the algorithm that finds nearest neighbors (typically approximate nearest neighbor / HNSW).
  • Embeddings are not understanding. A model that produces a useful embedding of a sentence doesn’t necessarily “understand” the sentence — it’s mapping the sentence to a region of vector space based on patterns in training data. Pieces of text can be near each other in embedding space for spurious reasons (shared style, shared topic that’s tangential, etc.).
  • Embeddings don’t replace LLMs. They power retrieval; the LLM generates the response. Most production systems use both.
  • One embedding model isn’t best for everything. A model trained on web data might be terrible at medical text or legal text or code. Specialized embedding models (Voyage’s code model, Cohere’s medical embeddings) exist for a reason.

The mental model to walk away with

Imagine every piece of text you’ve ever encountered as a dot floating in a vast multi-dimensional cloud. The location of each dot is determined by what the text means — not by what it literally says. Texts about similar things sit in the same neighborhoods. Texts about wildly different things sit on opposite sides of the cloud.

An embedding model is the function that converts new text into a coordinate in that cloud. A vector database is a lookup system that finds the nearest neighbors of any new point. RAG is the pattern of asking the LLM to answer questions using only the neighbors the embedding-based lookup found.

Everything else — Pinecone, Weaviate, semantic search, chunking strategies, retrieval evaluation — is engineering on top of that one idea. The idea itself is genuinely small. The architecture it enables is most of modern AI infrastructure.

If you remember nothing else: an embedding is a list of numbers that captures meaning, and the only thing you ever do with two of them is ask “how aligned are they?” The answer is a cosine similarity, and everything we’ve built on top of that one operation is the reason the AI infrastructure category exists.

Frequently asked questions

What is an embedding in simple terms?

An embedding is a list of numbers that represents the meaning of a piece of text (or an image, or audio, etc.) in a way computers can do math on. Think of it as a coordinate in a very high-dimensional space — pieces of content with similar meanings end up near each other in that space.

What's the difference between an embedding and a vector?

Mathematically, an embedding IS a vector — they're the same thing. But the word 'embedding' specifically means a vector that was produced by a model trained to put semantically similar inputs near each other. So all embeddings are vectors; not all vectors are embeddings.

Why are embeddings used for search?

Traditional keyword search only finds documents that share the exact words as your query. Embedding-based search (semantic search) compares the MEANING of your query to the meaning of each document, so it finds documents that answer your question even when they don't share any of your exact words. That's why a query like 'how do I cancel my subscription' can find a doc titled 'closing your account.'

How big are embeddings?

Modern embedding models produce vectors with somewhere between 256 and 3,072 dimensions (numbers per vector). OpenAI's text-embedding-3-small uses 1,536 dimensions; text-embedding-4 uses up to 3,072; Voyage's large model uses 1,024; the open-weights BGE-M3 uses 1,024. Higher dimensionality captures more nuance but costs more to store and search.

Which embedding model should I use in 2026?

For most production use cases: OpenAI text-embedding-4 if you want the easiest integration with the OpenAI stack, Voyage voyage-large-3 if you want the best raw retrieval quality on the MTEB leaderboard, Cohere Embed v4 if you need strong multilingual coverage at a major-vendor price, or BGE-M3 if you want open-weights so you can self-host or run on-prem.

Do embeddings hallucinate?

Not in the sense LLMs do. An embedding model produces a deterministic numerical representation of its input — it isn't generating text, so there's nothing to invent. But embedding-based SEARCH can return wrong-looking results: if the model was trained on data unlike yours, it can place obviously unrelated things close in vector space. The fix is to evaluate retrieval quality on your actual data before trusting it.

About Aditya Marin Gasga

Founding Editor

Aditya covers the whole AI surface area for Signal — frontier models, agent infrastructure, the economics of inference, and the policy decisions that quietly shape what everyone else can build. He writes for operators who need a calibrated view of what's actually shipping versus what's keynote theatre.

  • Founder of Signal; sets the publication's editorial line
  • A decade across product, growth, and AI tooling at venture-backed startups
  • Reads the model release notes, the system cards, and the benchmark papers — and tells you which ones matter
More from Aditya Marin Gasga →