ConstellationBench · 7 Benchmarks · March 2026

We ran 5,800 LLM calls
across 15 models and 7 benchmarks.

ConstellationBench measures what existing benchmarks don't — can an AI sustain a consistent behavioral persona, enforce governance policies in character, and recall session context without hallucinating? We tested every model we could find. Budget models won.

benchmarks

LLM calls

models tested

total cost

The quality-cost tradeoff

Each dot is a model. Y-axis is quality. X-axis is cost. Dot size is latency. Hover for details.

8% quality spread. 857x cost spread.

Key findings

Quality is nearly flat across the price spectrum

The gap between the best model (0.589) and the 14th-ranked model (0.541) is just 8.1%. But cost spans 857x. You're paying exponentially more for marginal quality gains.

9 models run a council for under a penny

At $0.0002 to $0.0095 per council run, multi-agent deliberation is no longer a luxury feature. It's infrastructure-cheap — comparable to a database query, not a consulting engagement.

Gemini Flash changed everything

Scoring 0.577 at $0.0036 and 3.4 seconds, it delivers 97.9% of the best model's quality at 2.1% of the cost. This single model makes free-tier multi-agent AI economically viable.

Not all models can roleplay

GPT-4o and DeepSeek V3 produce uniform conviction scores (all personas agree). Anthropic models and Grok produce the widest conviction ranges — genuine disagreement between personas. Multi-agent systems need models that differentiate, not just generate.

8 of 14 models hit 100% structured output

JSON compliance across 120 perspectives each. The structured output problem is effectively solved for the majority of frontier models. Multi-agent orchestration is reliable.

The full leaderboard

Click column headers to sort. All data from actual benchmark runs, not synthetic tests.

#	Model	Quality	Cost/Council	Latency	JSON	Provider
1	Opus 4.6 The Heavyweight	0.589	$0.1714	28.2s	99.2%	Anthropic
2	Sonnet 4.6 The Professional	0.578	$0.0298	12.6s	100%	Anthropic
3	Gemini 2.5 Pro The Essayist	0.578	$0.0741	22.8s	100%	Google
4	Gemini 2.5 Flash The Flash ⚡	0.577	$0.0036 SWEET SPOT	3.4s	100%	Google
5	Kimi K2.5 The Scholar	0.575	$0.0162	98.1s	99.2%	Moonshot
6	Haiku 4.5 The Prodigy	0.570	$0.0074	6.5s	96.7%	Anthropic
7	Grok 4.1 Fast The Maverick	0.569	$0.0023	14.3s	100%	xAI
8	Mistral Large The Diplomat	0.569	$0.0023	6.4s	98.3%	Mistral
9	Nemotron 120B The Phantom	0.566	FREE	51.3s	99.2%	NVIDIA
10	Grok 3 Mini The Scrapper	0.565	$0.0026	15.6s	100%	xAI
11	Qwen3 235B The Ghost	0.558	$0.0002	18.2s	97.5%	Alibaba
12	DeepSeek R1 The Thinker	0.552	$0.0078	25.7s	100%	DeepSeek
13	DeepSeek V3 The Intern	0.543	$0.0011	6.9s	100%	DeepSeek
14	GPT-4o The Speed Demon	0.541	$0.0095	2.7s	100%	OpenAI

7 Benchmarks · 15 Models · 5,800+ LLM Calls

Choose your champion

Every model has strengths. The budget heroes outperform the expensive ones on 6 of 7 benchmarks. Radar charts show performance across OttoTau (policy enforcement), Persona Fidelity, Session Recall, Cold Read, Voice Drift, and Bench Core (deliberation quality).

Total benchmark cost: $22.95 — less than a single Devin session.

TOP PICK

The Scholar

Kimi K2.5 · Moonshot

BUDGET

All-Rounder·$0.0047/task·213 tasks/$1

OttoTauColdRead

“Wins 2 of 7 benchmarks. Best policy enforcement of any model tested. The quiet overachiever.”

BEST VALUE

The Scrapper

Grok 3 Mini · xAI

BUDGET

All-Rounder·$0.0013/task·769 tasks/$1

VoiceDrift

“Best persona stability over 10-turn conversations. Holds character when others crack. 769 tasks per dollar.”

2,500 TASKS/$1

The Phantom

DeepSeek V3 · DeepSeek

BUDGET

Budget Hero·$0.00040/task·2,500 tasks/$1

“2,500 tasks per dollar. Third overall. No benchmark wins but no weaknesses either. The workhorse.”

PERSONA KING

The Flash

Gemini 2.5 Flash · Google

BUDGET

Specialist·$0.00050/task·2,000 tasks/$1

PersonaFidelity

“Best persona differentiation. Produces genuinely different voices for each profile. Fast and cheap.”

The Ghost

Qwen3 235B · Alibaba

BUDGET

Budget Hero·$0.00006/task·17K tasks/$1

CostPerLifecycle

“16,667 tasks per dollar. Cheapest model that still performs. The invisible infrastructure play.”

The Prodigy

Haiku 4.5 · Anthropic

MID-TIER

Support·$0.0036/task·278 tasks/$1

SessionRecall

“Best memory. Highest session recall of any model. Never hallucinated. The reliable teammate.”

The Heavyweight

Opus 4.6 · Anthropic

FRONTIER

Tank·$0.1109/task·9 tasks/$1

BenchCore

“Highest raw deliberation quality. But costs 23x more than the #1 model and wins only 1 of 7 benchmarks.”

The Default

GPT-4o · OpenAI

MID-TIER

Glass Cannon·$0.0045/task·222 tasks/$1

“The market default is dead last. Wins zero benchmarks. Near-bottom in 4 of 7. The emperor has no clothes.”

All 14 models

Core bench scores for every model tested.

BEST VALUE

The Flash

Gemini 2.5 Flash · Google

Quality0.577

Cost/Council$0.0036

Latency3.4s

JSON100%

“Fast and smart. The new default for production councils.”

MOST RELIABLE

The Professional

Sonnet 4.6 · Anthropic

Quality0.578

Cost/Council$0.0298

Latency12.6s

JSON100%

“Never fails. 100% JSON. Tight latency. The professional choice.”

HIGHEST QUALITY

The Heavyweight

Opus 4.6 · Anthropic

Quality0.589

Cost/Council$0.1714

Latency28.2s

JSON99.2%

“When the decision matters most. Deepest deliberation quality.”

Methodology

Queries

8 discover, 8 build, 7 ship, 7 audit — spanning enterprise data-ops scenarios

Models

From 8 providers: Anthropic, Google, OpenAI, xAI, DeepSeek, Mistral, Qwen, NVIDIA

Council size

Personas per council, each with distinct DECF behavioral profiles (Dominance, Extraversion, Patience, Formality)

Total perspectives

Individual AI perspective responses scored across 4 dimensions

Scoring dimensions

Persona adherence (30%), deliberation diversity (25%), response quality (25%), JSON compliance (20%)

Total cost

For the entire 450-run benchmark suite

Try it yourself

ConstellationBench is open. Run the same queries against any model on OpenRouter and compare your results.

Join the waitlist Investor relations

ConstellationBench data last updated March 12, 2026. All costs reflect OpenRouter API pricing at time of benchmark. Scores are composite measures of persona adherence, deliberation diversity, response quality, and JSON compliance.

We ran 5,800 LLM callsacross 15 models and 7 benchmarks.

The quality-cost tradeoff

Key findings

Quality is nearly flat across the price spectrum

9 models run a council for under a penny

Gemini Flash changed everything

Not all models can roleplay

8 of 14 models hit 100% structured output

The full leaderboard

Choose your champion

The Scholar

The Scrapper

The Phantom

The Flash

The Ghost

The Prodigy

The Heavyweight

The Default

All 14 models

Methodology

Try it yourself

We ran 5,800 LLM calls
across 15 models and 7 benchmarks.