The first version of Signal's model cost calculator only included Anthropic, OpenAI, Google, and xAI — a US-centric view that understated what 'cheap' actually means in 2026. DeepSeek V4 is 5-10× cheaper than Sonnet on comparable workloads; Qwen 3.5 matches Sonnet quality at ~1/6 the price; Llama 4 is the open-weights default; Mistral Large 3 is the European pick for data residency. All four are now in the calculator and picker.
Key takeaways
- Signal's calculator shipped missing DeepSeek, Qwen, Llama, and Mistral — a real gap, not a small one.
- DeepSeek V4 at roughly $0.20/$0.80/Mtok is the cheapest viable frontier model on any honest chart, and disrupts pricing assumptions even at the Opus tier.
- Qwen 3.5 Max matches Sonnet on most reasoning + writing workloads at roughly one-sixth the cost; APAC-facing or multilingual teams should default to it.
- Llama 4 405B is structurally different — open-weights — and that matters more than the price gap when you need on-prem, air-gapped, or provider-portable deployment.
- Mistral Large 3 is the answer for EU data residency requirements or for any team that explicitly wants a non-US model in their stack.
A reader pointed out that the model cost calculator and the picker we shipped last week only included American frontier models — Anthropic, OpenAI, Google, and xAI. They asked the obvious question: why no DeepSeek? No Qwen? No Llama? No Mistral?
The honest answer is that it was an oversight, not a deliberate choice. The initial pricing data came together around the models I’d already written articles about, and that list ran US-only. When you build something fast, the defaults you don’t question tend to be the ones that show up in the output. In a publication that explicitly claims to cover AI “as a whole,” shipping a frontier comparison missing four of the most important vendors is a real miss.
Both tools are updated now. DeepSeek V4, Qwen 3.5 Max, Llama 4 405B, and Mistral Large 3 are in the calculator and the picker as of this morning. Here’s what the corrected chart actually shows, and when each of them is the right pick.
What the calculator looked like before vs. after
Before, the cheapest line on the chart was Gemini 3.5 Flash at roughly $0.15 / $0.60 per Mtok input/output. On a representative 10M input + 2M output workload at 60% cache hit rate, that was about $2 per month — the low-end anchor against which Sonnet ($44) and Opus ($73) looked expensive.
After adding the four new models, the same workload reshuffles the whole field — the chart at the top of this piece has the full ranking.
Two things jump out. First: DeepSeek V4 is essentially tied with Gemini 3.5 Flash at the bottom of the chart, which is the single most important fact about 2026 model pricing that wasn’t visible in the original chart. Second: Opus 4.8 isn’t the runaway outlier its headline price implies — at a realistic 60% cache hit rate its effective input rate is about $2.30/Mtok, which lands it near GPT-5.5 and Gemini 2.5 Ultra at the high end (~36× the cheapest option), not far above them.
DeepSeek V4: the cost-disruptive frontier
DeepSeek’s pricing isn’t just “cheap” — it’s structurally cheap in a way that resets baseline expectations. At roughly $0.20/Mtok input and $0.80/Mtok output (with cached input around $0.05/Mtok), DeepSeek V4 is 5-10× under Sonnet for comparable quality on coding and math benchmarks. On problems where Opus is genuinely the right call — multi-step agentic work with hard correctness requirements — DeepSeek still trails. But the gap is much smaller than the price gap suggests, and for the large category of workloads that are “code-shaped, reasoning-shaped, or math-shaped” without requiring frontier-level performance, DeepSeek is now the rational default.
The political question (using a Chinese model in a US product) is real but increasingly bounded. For most production workloads — internal tools, RAG over public docs, content generation, code review — there’s no defensible argument that the model’s training origin matters. For workloads involving sensitive data, regulated industries, or government customers, it’s a different conversation. DeepSeek doesn’t disappear from your stack; it sits in the appropriate slot.
Qwen 3.5 Max: the Sonnet alternative at one-sixth the price
If DeepSeek is “shockingly cheap for what you get,” Qwen 3.5 Max is “Sonnet-class quality for a fraction of what Sonnet costs.” At roughly $0.50/$2.00 per Mtok, it sits in a price range with very few peers — the only US model anywhere close on a quality-adjusted basis is Sonnet itself, which is 4-6× more expensive.
Two specific cases where Qwen is the right default. First: any application with a meaningful APAC user base — Qwen’s multilingual handling, especially across CJK languages, is best-in-class. Second: workloads that previously made the obvious “Sonnet” call but are cost-sensitive at the margin. If you’re running Sonnet at $30K/month and the use case is general-purpose writing or reasoning (not the hardest agentic coding), Qwen 3.5 Max would cut that to roughly $5K with quality differences most users wouldn’t notice in production.
The honest weakness: Qwen’s tool-use and function-calling support is good but not Anthropic-level polished. For complex agent workloads, the integration cost may eat the savings. For everything else, the savings are real.
Llama 4 405B: structurally different, priced as one option among many
Llama is the only model on the updated chart that’s not a hosted API in the same sense as the others. The 405B model is open-weights — Meta releases the model, providers like Together / Fireworks / Replicate host inference, you pay them, and pricing varies by provider. The chart shows Together’s flat $2.50/$2.50 pricing because it’s the most common entry point, but it’s reasonable to mentally adjust that figure based on your deployment.
Llama’s strongest case is the case that the chart can’t really capture: it’s the only frontier model you can pick up and run yourself. On-prem, air-gapped, behind your own VPC, on hardware you control. For regulated industries (healthcare, finance, defense), that capability often trumps every per-token comparison on the chart. The price the chart shows is the price if you’re using it like a hosted API. The price if you’re self-hosting at meaningful scale is dominated by GPU economics, not by Together’s API pricing.
For most readers of this piece, the right way to think about Llama 4 is: it’s the option you have when the other nine on the chart don’t fit because of where the data has to live, not because of how much the tokens cost.
Mistral Large 3: the European pick
Mistral Large 3 sits in the middle of the chart on every dimension. It’s not the cheapest, not the smartest, not the fastest. The reason it’s in the picker isn’t because it wins on any single axis — it’s because for teams subject to EU data residency rules, or for teams that explicitly want a non-US model in their stack for risk-diversification or political reasons, it’s often the only option that clears those constraints.
If neither of those constraints applies to you, Mistral Large 3 isn’t usually the answer. The chart’s existing US-tier models almost always win on raw quality-per-dollar. But the constraint comes up more often than US-centric writing tends to acknowledge.
What this means for the picker’s recommendations
The picker’s scoring rules were retuned to incorporate the new models. The changes that matter:
- For
coding-agentic+cheapest-viable: previously Gemini 3.5 Flash won. Now DeepSeek V4 is the strong primary, with Flash as the alternate. DeepSeek’s coding strength at its price point genuinely changes that recommendation. - For
writing-reasoning+balanced: previously Sonnet 4.6 won decisively. It still wins, but Qwen 3.5 Max is now the alternate at a much cheaper price point — and for cost-sensitive teams, the gap may justify switching. - For any answer +
special: massive-context: Gemini 2.5 Ultra still wins; nothing dethrones it for 500K+ context workloads. - For any answer +
special: multimodal: GPT-5.5 still leads but Qwen 3.5 Max is now a strong alternate, especially when the multimodal content involves non-English text.
Run the picker on your actual workload to see how the addition of these models affects your specific recommendation. For many workloads I’ve already audited internally, the new winner is either DeepSeek or Qwen — and the previous winner was rarely wrong, just more expensive than it needed to be.
Once you stop iterating with critical readers, your defaults calcify.
What we still don’t cover
Two notable absences I want to be explicit about. The calculator and picker still don’t include:
AI agents (Manus, Devin, Operator, Replit Agent). These are products built on top of base models, not base models themselves. They have their own pricing structures — usually per-task or per-month, not per-token — and a direct cost comparison with raw API pricing is apples-to-oranges. They belong in a separate piece. Coverage is on the roadmap.
Specialized models (Cohere Command R, Mistral Codestral, Voyage embeddings, ElevenLabs). Each has a specific niche where they’re the right call (Codestral for code completion specifically, Voyage for embeddings, etc.), but they don’t slot into “default chat / agent / RAG” workloads the same way the ten models on the chart do. We may add a “specialized models” companion tool if reader demand suggests it.
If there’s another model you think should be on the chart that isn’t — say so. Both tools are designed to be extended; updates are five-minute changes to a typed data file. The original miss happened because nobody outside told us early enough, and once you stop iterating with critical readers, your defaults calcify. So: tell us what’s wrong.
The chart is now what it should have been at launch. Thank you to the reader who pushed back.