Model Comparisons

Head-to-head benchmark data for AI models on agentic-core-v1 — ten software engineering tasks, three runs each, 30 total. Every score links back to a logged campaign run.

Claude Sonnet 4.6 vs Claude Haiku 4.5

A 1-point score gap and a 16× cost gap — is Sonnet worth it?

Claude Sonnet 4.6 28/30 $0.0514/pass

Claude Haiku 4.5 27/30 $0.0033/pass

See full comparison →

DeepSeek V4 Flash vs Claude Sonnet 4.6

Same agentic-core-v1 score, 36× cost difference — what does Sonnet actually buy you?

DeepSeek V4 Flash 28/30 $0.0014/pass

Claude Sonnet 4.6 28/30 $0.0514/pass

See full comparison →

Mistral Small 4 vs Claude Haiku 4.5

The two most cost-efficient models in the top tier — one passes more, the other costs less.

Mistral Small 4 29/30 $0.0010/pass

Claude Haiku 4.5 27/30 $0.0033/pass

See full comparison →

GPT-5.5 Instant vs Claude Sonnet 4.6

Both mid-tier flagships at 90%+ — OpenAI vs Anthropic on real agentic tasks.

GPT-5.5 Instant 27/30 $0.0700/pass

Claude Sonnet 4.6 28/30 $0.0514/pass

See full comparison →

Claude Opus 4.8 vs DeepSeek V4 Pro Thinking

Both 30/30 on agentic-core-v1. One costs $7.34. The other costs $0.12.

Claude Opus 4.8 30/30 $0.2450/pass

DeepSeek V4 Pro Thinking 30/30 $0.0040/pass

See full comparison →

Devstral 2 123B vs Mistral Small 4

Mistral's code specialist vs its latest small model — which is better for agentic workloads?

Devstral 2 123B 27/30 $0.0019/pass

Mistral Small 4 29/30 $0.0010/pass

See full comparison →

Mistral Large 3 vs Claude Haiku 4.5

Same pass rate, similar costs — Mistral mid-tier vs Anthropic budget tier.

Mistral Large 3 27/30 $0.0022/pass

Claude Haiku 4.5 27/30 $0.0033/pass

See full comparison →

About these comparisons

Every number on this page comes from a deterministic benchmark run. The harness is agentic-core-v1: ten tasks drawn from real engineering work — fix a failing test, refactor duplicated code, investigate a production log, add a null guard. Each task runs three times per model. Pass/fail is binary: the checker accepts or rejects, no partial credit.

We report three cost metrics: total campaign cost, cost per run, and cost per passing run. The last one is the number that matters for production planning — if a run fails, you paid for nothing.

For the full ranked leaderboard see the agentic-core-v1 score breakdown . For methodology, see what we actually measure .

Model Comparisons

About these comparisons

ClawWorks Weekly