The Small Model Squeeze
Why budget LLMs face an existential competitive threat
"One big problem with small models is that the cheaper BIG models are just unbelievably inexpensive -
realistically the competition is things like Gemini 2.5 Flash-Lite and that model will
process 1 billion tokens for $100"
Gemini 2.5 Flash-Lite
$75
per 1 billion input tokens
True "Small" Models
$35-50
ministral-3b, llama-3.1-8b, nova-micro
Price Premium for "Small"
1.5-2x
more expensive than the savings suggest
Capability Gap
10-50x
parameter count difference (3B vs 100B+)
📊 The Competitive Landscape (1B input + 100M output tokens)
| Model |
Tier |
Total Cost |
Cost Visualization |
| ministral-3b |
Tiny (3B) |
$44 |
|
| gemini-1.5-flash-8b |
Tiny (8B) |
$53 |
|
| llama-3.1-8b |
Tiny (8B) |
$58 |
|
| gemini-2.5-flash-lite ⭐ |
Capable (~100B+) |
$105 |
|
| gpt-4.1-nano |
Small |
$140 |
|
| gemini-2.5-flash ⭐ |
Capable (~100B+) |
$210 |
|
| claude-3-haiku |
Small (~20B) |
$375 |
|
| deepseek-v3 |
Capable (MoE) |
$380 |
|
| gpt-4.1-mini |
Small-Mid |
$560 |
|
| claude-haiku-4 |
Small-Mid |
$1,200 |
|
Tiny Models (3-8B params)
$44-58
ministral-3b, llama-8b, flash-8b
← 💀 →
Capable "Lite" Models
$105
gemini-2.5-flash-lite
🎯 Key Implications
- The "Small Model Tax" is tiny: You save only $50-60 going from a capable model (Gemini 2.5 Flash-Lite at $105) to a truly tiny model (ministral-3b at $44). That's a 2x price difference for potentially 10-50x capability loss.
- Mid-tier small models are squeezed out: Claude 3 Haiku at $375 and GPT-4.1-mini at $560 are now awkwardly positioned — they cost 3-5x more than Flash-Lite but aren't meaningfully more capable.
- The viable niches are shrinking: Small models only make sense for extreme cost sensitivity (sub-$50 budgets) or specific deployment constraints (edge, latency, privacy).
- Google's pricing is market-defining: Gemini 2.5 Flash-Lite at $0.075/1M input creates a ceiling that other "small" models must beat — and most can't while maintaining profitability.
- Open-source pressure compounds this: Llama 3.1 8B at $58 (hosted) can be run for near-zero marginal cost on-premise, further eroding the small model market.
🔮 Strategic Takeaways
- For builders: Default to capable "lite" models unless you have specific constraints. The quality-per-dollar is unbeatable.
- For small model providers: Compete on latency, deployment flexibility, or specialization — not price. The price war is unwinnable.
- For enterprises: The cost difference between "cheap" and "capable" at scale (billions of tokens) is now measured in tens of dollars, not thousands.
Analysis generated using LLM Cost Analysis Skill • Prices as of 2025-06