ConstellationBench · 7 Benchmarks · March 2026
We ran 5,800 LLM calls
across 15 models and 7 benchmarks.
ConstellationBench measures what existing benchmarks don't — can an AI sustain a consistent behavioral persona, enforce governance policies in character, and recall session context without hallucinating? We tested every model we could find. Budget models won.
benchmarks
LLM calls
models tested
total cost
The quality-cost tradeoff
Each dot is a model. Y-axis is quality. X-axis is cost. Dot size is latency. Hover for details.
8% quality spread. 857x cost spread.
Key findings
Quality is nearly flat across the price spectrum
The gap between the best model (0.589) and the 14th-ranked model (0.541) is just 8.1%. But cost spans 857x. You're paying exponentially more for marginal quality gains.
9 models run a council for under a penny
At $0.0002 to $0.0095 per council run, multi-agent deliberation is no longer a luxury feature. It's infrastructure-cheap — comparable to a database query, not a consulting engagement.
Gemini Flash changed everything
Scoring 0.577 at $0.0036 and 3.4 seconds, it delivers 97.9% of the best model's quality at 2.1% of the cost. This single model makes free-tier multi-agent AI economically viable.
Not all models can roleplay
GPT-4o and DeepSeek V3 produce uniform conviction scores (all personas agree). Anthropic models and Grok produce the widest conviction ranges — genuine disagreement between personas. Multi-agent systems need models that differentiate, not just generate.
8 of 14 models hit 100% structured output
JSON compliance across 120 perspectives each. The structured output problem is effectively solved for the majority of frontier models. Multi-agent orchestration is reliable.
The full leaderboard
Click column headers to sort. All data from actual benchmark runs, not synthetic tests.
| # | Model | Quality | Cost/Council | Latency | JSON | Provider |
|---|---|---|---|---|---|---|
| 1 | Opus 4.6 The Heavyweight | 0.589 | $0.1714 | 28.2s | 99.2% | Anthropic |
| 2 | Sonnet 4.6 The Professional | 0.578 | $0.0298 | 12.6s | 100% | Anthropic |
| 3 | Gemini 2.5 Pro The Essayist | 0.578 | $0.0741 | 22.8s | 100% | |
| 4 | Gemini 2.5 Flash The Flash ⚡ | 0.577 | $0.0036 | 3.4s | 100% | |
| 5 | Kimi K2.5 The Scholar | 0.575 | $0.0162 | 98.1s | 99.2% | Moonshot |
| 6 | Haiku 4.5 The Prodigy | 0.570 | $0.0074 | 6.5s | 96.7% | Anthropic |
| 7 | Grok 4.1 Fast The Maverick | 0.569 | $0.0023 | 14.3s | 100% | xAI |
| 8 | Mistral Large The Diplomat | 0.569 | $0.0023 | 6.4s | 98.3% | Mistral |
| 9 | Nemotron 120B The Phantom | 0.566 | FREE | 51.3s | 99.2% | NVIDIA |
| 10 | Grok 3 Mini The Scrapper | 0.565 | $0.0026 | 15.6s | 100% | xAI |
| 11 | Qwen3 235B The Ghost | 0.558 | $0.0002 | 18.2s | 97.5% | Alibaba |
| 12 | DeepSeek R1 The Thinker | 0.552 | $0.0078 | 25.7s | 100% | DeepSeek |
| 13 | DeepSeek V3 The Intern | 0.543 | $0.0011 | 6.9s | 100% | DeepSeek |
| 14 | GPT-4o The Speed Demon | 0.541 | $0.0095 | 2.7s | 100% | OpenAI |
7 Benchmarks · 15 Models · 5,800+ LLM Calls
Choose your champion
Every model has strengths. The budget heroes outperform the expensive ones on 6 of 7 benchmarks. Radar charts show performance across OttoTau (policy enforcement), Persona Fidelity, Session Recall, Cold Read, Voice Drift, and Bench Core (deliberation quality).
Total benchmark cost: $22.95 — less than a single Devin session.
The Scholar
Kimi K2.5 · Moonshot
“Wins 2 of 7 benchmarks. Best policy enforcement of any model tested. The quiet overachiever.”
The Scrapper
Grok 3 Mini · xAI
“Best persona stability over 10-turn conversations. Holds character when others crack. 769 tasks per dollar.”
The Phantom
DeepSeek V3 · DeepSeek
“2,500 tasks per dollar. Third overall. No benchmark wins but no weaknesses either. The workhorse.”
The Flash
Gemini 2.5 Flash · Google
“Best persona differentiation. Produces genuinely different voices for each profile. Fast and cheap.”
The Ghost
Qwen3 235B · Alibaba
“16,667 tasks per dollar. Cheapest model that still performs. The invisible infrastructure play.”
The Prodigy
Haiku 4.5 · Anthropic
“Best memory. Highest session recall of any model. Never hallucinated. The reliable teammate.”
The Heavyweight
Opus 4.6 · Anthropic
“Highest raw deliberation quality. But costs 23x more than the #1 model and wins only 1 of 7 benchmarks.”
The Default
GPT-4o · OpenAI
“The market default is dead last. Wins zero benchmarks. Near-bottom in 4 of 7. The emperor has no clothes.”
All 14 models
Core bench scores for every model tested.
The Flash
Gemini 2.5 Flash · Google
“Fast and smart. The new default for production councils.”
The Professional
Sonnet 4.6 · Anthropic
“Never fails. 100% JSON. Tight latency. The professional choice.”
The Heavyweight
Opus 4.6 · Anthropic
“When the decision matters most. Deepest deliberation quality.”
Methodology
Try it yourself
ConstellationBench is open. Run the same queries against any model on OpenRouter and compare your results.
ConstellationBench data last updated March 12, 2026. All costs reflect OpenRouter API pricing at time of benchmark. Scores are composite measures of persona adherence, deliberation diversity, response quality, and JSON compliance.