Claude Sonnet 4.6 vs Claude Haiku 4.5

A 1-point score gap and a 16× cost gap — is Sonnet worth it?

Head-to-head: agentic-core-v1

Harness: agentic-core-v1 — 10 tasks × 3 runs = 30 total. Binary pass/fail per run. Harness version: openclaw@2026.4.22. Full methodology: what we measure and why.

Metric	Claude Sonnet 4.6	Claude Haiku 4.5
Runs passed / total	28 / 30	27 / 30
Pass rate	93.3%	90.0%
Campaign cost (30 runs)	$1.44	$0.09
Cost per run	$0.0480	$0.00300
Cost per passing run	$0.0514	$0.00333
Provider	Anthropic	Anthropic
Campaign date	2026-05-04	2026-05-24

Claude Sonnet 4.6

Strength

Consistent 9/10 task categories at 3/3; task_09 single-run pass via error-forced fallback

Weakness

$0.051/pass — 16× more expensive than Claude Haiku 4.5 for a 1-point score advantage

Full campaign report →

Claude Haiku 4.5

Strength

1-point gap vs Sonnet 4.6 at 16× lower cost; matches Sonnet on 9/10 task categories

Weakness

task_09 exhaustion failure — loops through 13 turns trying to resolve ambiguous data, never exits

Full campaign report →

Cost ratio: Claude Haiku 4.5 costs 15.4× less per passing run ($0.00333 vs $0.0514). At 10,000 tasks, that gap is $481 .

Verdict

Claude Sonnet 4.6 scores 1 run higher but costs 15× more per passing run. Whether that gap justifies the price depends on your tolerance for failure: at scale, 1 extra passing run per 30 compounds. For high-volume low-margin workloads, Claude Haiku 4.5 at $0.00333/pass is the rational default.

About agentic-core-v1

agentic-core-v1 is modelbattles' flagship benchmark harness. Ten tasks drawn from real engineering work: fix a failing test, refactor duplicated code, investigate a production log, add a null guard, trace through a codebase. Each task runs three times per model. A run passes if and only if the checker accepts the output — no partial credit, no manual grading.

The harness is deterministic: same task, same environment, same checker across all models. Scores are comparable. task_09 is the persistent difficulty point — it requires the model to recognise a structurally impossible calculation and refuse to produce a wrong answer instead of looping. Most models fail it at least once.

Read: What we actually measure and why · How to read an agentic-core-v1 score

More comparisons

DeepSeek V4 Flash vs Claude Sonnet 4.6 — Same agentic-core-v1 score, 36× cost difference — what does Sonnet actually buy you?
Mistral Small 4 vs Claude Haiku 4.5 — The two most cost-efficient models in the top tier — one passes more, the other costs less.
GPT-5.5 Instant vs Claude Sonnet 4.6 — Both mid-tier flagships at 90%+ — OpenAI vs Anthropic on real agentic tasks.
Claude Opus 4.8 vs DeepSeek V4 Pro Thinking — Both 30/30 on agentic-core-v1. One costs $7.34. The other costs $0.12.
Devstral 2 123B vs Mistral Small 4 — Mistral's code specialist vs its latest small model — which is better for agentic workloads?

Claude Sonnet 4.6 vs Claude Haiku 4.5

Head-to-head: agentic-core-v1

Claude Sonnet 4.6

Claude Haiku 4.5

Verdict

About agentic-core-v1

More comparisons

ClawWorks Weekly