ConstellationBench · 7 Benchmarks · March 2026

We ran 5,800 LLM calls
across 15 models and 7 benchmarks.

ConstellationBench measures what existing benchmarks don't — can an AI sustain a consistent behavioral persona, enforce governance policies in character, and recall session context without hallucinating? We tested every model we could find. Budget models won.

0

benchmarks

0+

LLM calls

0

models tested

$0

total cost

The quality-cost tradeoff

Each dot is a model. Y-axis is quality. X-axis is cost. Dot size is latency. Hover for details.

0.540.550.560.570.580.59$0.00$0.02$0.04$0.06$0.08$0.10$0.12$0.14$0.17Cost per council →Quality score

8% quality spread. 857x cost spread.

Key findings

Quality is nearly flat across the price spectrum

The gap between the best model (0.589) and the 14th-ranked model (0.541) is just 8.1%. But cost spans 857x. You're paying exponentially more for marginal quality gains.

9 models run a council for under a penny

At $0.0002 to $0.0095 per council run, multi-agent deliberation is no longer a luxury feature. It's infrastructure-cheap — comparable to a database query, not a consulting engagement.

Gemini Flash changed everything

Scoring 0.577 at $0.0036 and 3.4 seconds, it delivers 97.9% of the best model's quality at 2.1% of the cost. This single model makes free-tier multi-agent AI economically viable.

Not all models can roleplay

GPT-4o and DeepSeek V3 produce uniform conviction scores (all personas agree). Anthropic models and Grok produce the widest conviction ranges — genuine disagreement between personas. Multi-agent systems need models that differentiate, not just generate.

8 of 14 models hit 100% structured output

JSON compliance across 120 perspectives each. The structured output problem is effectively solved for the majority of frontier models. Multi-agent orchestration is reliable.

The full leaderboard

Click column headers to sort. All data from actual benchmark runs, not synthetic tests.

#ModelQualityCost/CouncilLatencyJSONProvider
1Opus 4.6
The Heavyweight
0.589
$0.1714
28.2s99.2%Anthropic
2Sonnet 4.6
The Professional
0.578
$0.0298
12.6s100%Anthropic
3Gemini 2.5 Pro
The Essayist
0.578
$0.0741
22.8s100%Google
4Gemini 2.5 Flash
The Flash
0.577
$0.0036
3.4s100%Google
5Kimi K2.5
The Scholar
0.575
$0.0162
98.1s99.2%Moonshot
6Haiku 4.5
The Prodigy
0.570
$0.0074
6.5s96.7%Anthropic
7Grok 4.1 Fast
The Maverick
0.569
$0.0023
14.3s100%xAI
8Mistral Large
The Diplomat
0.569
$0.0023
6.4s98.3%Mistral
9Nemotron 120B
The Phantom
0.566
FREE
51.3s99.2%NVIDIA
10Grok 3 Mini
The Scrapper
0.565
$0.0026
15.6s100%xAI
11Qwen3 235B
The Ghost
0.558
$0.0002
18.2s97.5%Alibaba
12DeepSeek R1
The Thinker
0.552
$0.0078
25.7s100%DeepSeek
13DeepSeek V3
The Intern
0.543
$0.0011
6.9s100%DeepSeek
14GPT-4o
The Speed Demon
0.541
$0.0095
2.7s100%OpenAI

7 Benchmarks · 15 Models · 5,800+ LLM Calls

Choose your champion

Every model has strengths. The budget heroes outperform the expensive ones on 6 of 7 benchmarks. Radar charts show performance across OttoTau (policy enforcement), Persona Fidelity, Session Recall, Cold Read, Voice Drift, and Bench Core (deliberation quality).

Total benchmark cost: $22.95 — less than a single Devin session.

TOP PICK

The Scholar

Kimi K2.5 · Moonshot

BUDGET
All-Rounder·$0.0047/task·213 tasks/$1
OttoTauPersonaSessionColdReadVoiceBench
OttoTauColdRead

Wins 2 of 7 benchmarks. Best policy enforcement of any model tested. The quiet overachiever.

BEST VALUE

The Scrapper

Grok 3 Mini · xAI

BUDGET
All-Rounder·$0.0013/task·769 tasks/$1
OttoTauPersonaSessionColdReadVoiceBench
VoiceDrift

Best persona stability over 10-turn conversations. Holds character when others crack. 769 tasks per dollar.

2,500 TASKS/$1

The Phantom

DeepSeek V3 · DeepSeek

BUDGET
Budget Hero·$0.00040/task·2,500 tasks/$1
OttoTauPersonaSessionColdReadVoiceBench

2,500 tasks per dollar. Third overall. No benchmark wins but no weaknesses either. The workhorse.

PERSONA KING

The Flash

Gemini 2.5 Flash · Google

BUDGET
Specialist·$0.00050/task·2,000 tasks/$1
OttoTauPersonaSessionColdReadVoiceBench
PersonaFidelity

Best persona differentiation. Produces genuinely different voices for each profile. Fast and cheap.

The Ghost

Qwen3 235B · Alibaba

BUDGET
Budget Hero·$0.00006/task·17K tasks/$1
OttoTauPersonaSessionColdReadVoiceBench
CostPerLifecycle

16,667 tasks per dollar. Cheapest model that still performs. The invisible infrastructure play.

The Prodigy

Haiku 4.5 · Anthropic

MID-TIER
Support·$0.0036/task·278 tasks/$1
OttoTauPersonaSessionColdReadVoiceBench
SessionRecall

Best memory. Highest session recall of any model. Never hallucinated. The reliable teammate.

The Heavyweight

Opus 4.6 · Anthropic

FRONTIER
Tank·$0.1109/task·9 tasks/$1
OttoTauPersonaSessionColdReadVoiceBench
BenchCore

Highest raw deliberation quality. But costs 23x more than the #1 model and wins only 1 of 7 benchmarks.

The Default

GPT-4o · OpenAI

MID-TIER
Glass Cannon·$0.0045/task·222 tasks/$1
OttoTauPersonaSessionColdReadVoiceBench

The market default is dead last. Wins zero benchmarks. Near-bottom in 4 of 7. The emperor has no clothes.

All 14 models

Core bench scores for every model tested.

BEST VALUE

The Flash

Gemini 2.5 Flash · Google

Quality0.577
Cost/Council$0.0036
Latency3.4s
JSON100%

Fast and smart. The new default for production councils.

MOST RELIABLE

The Professional

Sonnet 4.6 · Anthropic

Quality0.578
Cost/Council$0.0298
Latency12.6s
JSON100%

Never fails. 100% JSON. Tight latency. The professional choice.

HIGHEST QUALITY

The Heavyweight

Opus 4.6 · Anthropic

Quality0.589
Cost/Council$0.1714
Latency28.2s
JSON99.2%

When the decision matters most. Deepest deliberation quality.

Methodology

0

Queries

8 discover, 8 build, 7 ship, 7 audit — spanning enterprise data-ops scenarios

0

Models

From 8 providers: Anthropic, Google, OpenAI, xAI, DeepSeek, Mistral, Qwen, NVIDIA

0

Council size

Personas per council, each with distinct DECF behavioral profiles (Dominance, Extraversion, Patience, Formality)

0

Total perspectives

Individual AI perspective responses scored across 4 dimensions

0

Scoring dimensions

Persona adherence (30%), deliberation diversity (25%), response quality (25%), JSON compliance (20%)

$0

Total cost

For the entire 450-run benchmark suite

Try it yourself

ConstellationBench is open. Run the same queries against any model on OpenRouter and compare your results.

ConstellationBench data last updated March 12, 2026. All costs reflect OpenRouter API pricing at time of benchmark. Scores are composite measures of persona adherence, deliberation diversity, response quality, and JSON compliance.