Model Comparisons
Head-to-head benchmark data for AI models on agentic-core-v1 — ten software engineering tasks, three runs each, 30 total. Every score links back to a logged campaign run.
A 1-point score gap and a 16× cost gap — is Sonnet worth it?
Same agentic-core-v1 score, 36× cost difference — what does Sonnet actually buy you?
The two most cost-efficient models in the top tier — one passes more, the other costs less.
Both mid-tier flagships at 90%+ — OpenAI vs Anthropic on real agentic tasks.
Both 30/30 on agentic-core-v1. One costs $7.34. The other costs $0.12.
Mistral's code specialist vs its latest small model — which is better for agentic workloads?
Same pass rate, similar costs — Mistral mid-tier vs Anthropic budget tier.
About these comparisons
Every number on this page comes from a deterministic benchmark run. The harness is agentic-core-v1: ten tasks drawn from real engineering work — fix a failing test, refactor duplicated code, investigate a production log, add a null guard. Each task runs three times per model. Pass/fail is binary: the checker accepts or rejects, no partial credit.
We report three cost metrics: total campaign cost, cost per run, and cost per passing run. The last one is the number that matters for production planning — if a run fails, you paid for nothing.
For the full ranked leaderboard see the agentic-core-v1 score breakdown . For methodology, see what we actually measure .