One point apart

May 24, 2026 · campaign-reports

Campaign: 2026-05-24-claude-haiku-4-5-agentic-core-v1
Model: Claude Haiku 4.5 (us.anthropic.claude-haiku-4-5-20251001-v1:0, AWS Bedrock cross-region eu-west-1/us-east-1)
Harness: agentic-core-v1 (10 tasks x 3 runs = 30 total)
Campaign date: 2026-05-24

Claude Sonnet 4.6 is Anthropic’s current 4.x flagship. It scored 28/30 on this harness at $0.051 per passing task. Claude Haiku 4.5 is Anthropic’s budget-tier 4.x model. It scored 27/30 at $0.00316 per passing task.

That is a 16x cost difference for one point.

The one-point gap is entirely attributable to a single task — task_09 — that no Claude model has reliably solved. On the nine other tasks, Haiku 4.5 and Sonnet 4.6 are indistinguishable: both score 3/3 on every run. Strip task_09 from the suite and both models score 27/27. The flagship’s cost premium purchases one extra point on a task neither model consistently wins.

This is the first time both tiers of Anthropic’s 4.x family have run on this harness. The result is the story.

What agentic-core-v1 tests

[Observed — harness spec]

Ten tasks, three independent runs each. The harness runs real software engineering scenarios against a working codebase: fix a failing test, refactor duplicated code, investigate a log file, trace execution through a codebase, make a targeted minimal fix, handle an ambiguous requirement, execute a multi-step sequential plan, recover from an injected tool error, recognize an impossible computation and refuse it, run a SQL investigation.

Each run is scored pass or fail by a task checker that inspects the output file the model writes. Failure modes are classified:

wrong_answer: the checker rejected the output — not necessarily nonsensical, but not matching the acceptance criteria
gave_up_mid_plan: the model hit the 13-turn limit without producing a final answer

Pass means the task checker accepted the output. Nothing else.

The results

[Observed — data pack task_results, verified: pass_rate_by_task.csv]

Task	Score	Avg latency	Cost	Notes
task_01 fix_failing_test	3/3	6.10s	$0.00589
task_02 refactor_duplicated_code	3/3	8.10s	$0.00746
task_03 investigate_log	3/3	7.06s	$0.02197	Cost spike: 80K input tokens
task_04 trace_through_codebase	3/3	7.02s	$0.00694
task_05 minimal_fix	3/3	12.15s	$0.01323	Latency outlier: 9 tool calls avg
task_06 handle_ambiguous_requirement	3/3	10.63s	$0.01018
task_07 multi_step_plan	3/3	3.14s	$0.00292
task_08 recover_from_tool_error	3/3	3.84s	$0.00308
task_09 know_when_to_stop	0/3	9.50s	$0.00847
task_10 sql_investigation	3/3	7.02s	$0.00530

Total: 27/30 (90.0%) — $0.085 total — $0.00316/pass

Nine clean sweeps. One complete failure. No partial misses anywhere else in the dataset — every passing task passed all three runs.

What is task_09 actually asking?

[Observed — harness spec and task_09 transcripts, verified: task_09_transcripts, tool_call_redundancy.md]

task_09 (know_when_to_stop) presents a three-row CSV and asks the model to compute a ten-day moving average. The correct response is to recognize that three data points cannot support a ten-day window — the data is insufficient for the computation requested. The task checker accepts an output file that acknowledges this directly. It rejects output that either provides a confident numeric result or leaves the file empty.

This task is unusual in the dataset. It is not testing execution skill. It is testing metacognitive awareness: can the model recognize when a task is genuinely impossible rather than merely difficult?

Haiku 4.5 failed all three runs. But the failure mode is specific and worth examining.

[Observed — runs 1 and 2, verified: tool_call_redundancy.md, transcript refs 24ac8670 and 921f456d]

In runs 1 and 2, Haiku 4.5 re-read data.csv repeatedly in the early turns — making consecutive identical tool calls (reading the same file with the same arguments back to back, a pattern classified as tool_call_redundancy in the evidence pack). It burned through all 13 turns trying to resolve the data sufficiency problem. It never declared the task impossible. Both runs were classified gave_up_mid_plan — hitting the turn limit without producing a final answer.

[Observed — run 3, verified: transcript ref a17c5443]

Run 3 proceeded differently: five tool calls, an answer committed to the output file, checker rejected it. Classified wrong_answer.

[Observed — comparative data]

This is not a budget-tier failure. Claude Sonnet 4.6 scored 1/3 on task_09 with the same exhaustion pattern on failed runs. The task_09 failure profile appears consistent across the Claude 4.x family at the harness level — the capability gap is between Claude and the task, not between Haiku and Sonnet.

[Observed — tool_call_redundancy.md]

Critically: tool call redundancy is completely isolated to task_09. All 27 passing runs show zero redundant tool calls. This is not a general execution pathology. It is a domain-specific failure that only surfaces when the model encounters genuinely impossible work and cannot recognize it.

Is the 16x cost difference justified?

[Observed — cost data; Speculation flagged below]

The practical question for anyone running agentic workloads in the Anthropic ecosystem.

On the evidence from this harness:

Haiku 4.5 costs $0.00316 per passing task
Sonnet 4.6 costs $0.051 per passing task
The gap between them: one point on a ten-task suite, on one task neither model reliably passes

[Speculation] If task_09-class work — recognizing genuine impossibility before exhausting all available turns — is not a requirement for your production workload, Haiku 4.5 delivers Sonnet-quality agentic execution at 16x lower cost per passing task. The nine tasks that measure real-world execution (code fix, refactor, log investigation, codebase trace, targeted bug fix, ambiguous requirement handling, multi-step planning, error recovery, SQL investigation) show no detectable gap between the two models.

[Speculation] The 1-point gap may widen on harder or more open-ended suites. agentic-core-v1 emphasizes structured execution tasks. On suites with more genuine reasoning under uncertainty, a gap between flagship and budget tier would be expected.

Where does Haiku 4.5 land on the full leaderboard?

[Observed — combined campaign data]

Rank	Model	Score	Cost/pass	Notes
1	Claude Sonnet 4.6	28/30	$0.0514	Anthropic 4.x flagship
1	GLM-4.7	28/30	$0.00380
1	MiniMax M2.1	28/30	$0.00258
1	DeepSeek V4-Flash	28/30	$0.00143
1	Ministral 3 8B	28/30	$0.00067	Open-weight cost leader
6	Mistral Large 3 675B	27/30	$0.00213
6	MiniMax M2.5	27/30	$0.00240
6	Devstral 2	27/30	$0.00190
6	GLM-5	27/30	—
6	Claude Haiku 4.5	27/30	$0.00316
11	GLM-4.7 Flash	25/30	$0.00057
11	GPT-OSS 20B	25/30	—

Haiku 4.5 enters rank 6, tied with Mistral Large 3 675B, MiniMax M2.5, Devstral 2, and GLM-5 at the same score. The 27/30 tier spans a wide cost range: from $0.00190/pass (Devstral 2) to $0.00316/pass (Haiku 4.5). Haiku costs 1.7x more per pass than Devstral 2 and 1.5x more than Mistral Large 3 for equal results; the premium is the Anthropic fully managed infrastructure and AWS Bedrock native routing.

The comparison to Ministral 3 8B is harder to dismiss: Ministral scores 28/30 at $0.00067/pass. Haiku 4.5 scores 27/30 at $0.00316/pass. Ministral is 4.7x cheaper for one more correct run. For operators not locked into the Anthropic ecosystem, that is the more relevant comparison.

What explains the task_05 latency spike?

[Observed — task_05 run data, verified: pass_rate_by_task.csv]

task_05 (minimal fix) is the latency outlier: 12.15s average, 9 tool calls per run. Every other passing task runs in 3 to 11 seconds with 2 to 10 tool calls.

task_05 asks the model to fix a price calculator bug with a diff of ten lines or fewer. The work requires reading the source file, diagnosing the negative-price condition, writing the fix, running the tests, and verifying the output. Haiku 4.5 uses 9 tool calls per run to do this — more than any other passing task.

[Speculation] The extra tool calls on task_05 look like over-verification: re-reading the fixed file, re-running the test suite, double-checking the diff before committing. This is being thorough rather than being slow. The cost for task_05 is $0.00441 per run — still negligible. But for latency-sensitive production pipelines, verify-loop code-modification tasks run roughly four times slower than simple multi-step tasks (task_07 runs in 3.14s).

The clean forensic profile

[Observed — diagnosis_then_regression.md, long_tail_turn_count.md]

Zero diagnosis regressions across all 30 runs (the model states what’s wrong then walks back its own diagnosis — that pattern does not appear here). Zero long-tail turn counts outside of task_09’s exhaustion runs.

On all 27 passing runs, Haiku 4.5 commits to a diagnosis and executes it without reversing course. This is the same profile Mistral Large 3 showed at the same score. Consistent, predictable execution is the property production agentic deployments care most about — and it shows up cleanly here.

What we do not know yet

[Speculation]

How task_09 run 3 failed specifically: runs 1 and 2 exhausted their turn budgets (gave_up_mid_plan) without producing output. Run 3 used 5 tool calls and submitted an answer the checker rejected (wrong_answer). What answer did it submit — a numeric result, an incorrect acknowledgment string, something else? That detail would clarify whether run 3 was closer to passing than runs 1 and 2, or whether it was a different kind of failure.

[Speculation]

How Haiku 4.5 compares to Claude 3.x Haiku on this harness. The 3.x generation is not in the dataset. External benchmarks suggest a roughly 9-point uplift from 3.x to 4.x Haiku on comparable suites — which would put 3.x Haiku in the 18-20/30 range. If that estimate is accurate, the budget-tier model improved by around half a suite’s worth of passing tasks from one generation to the next, while the cost per pass stayed competitive.

[Speculation]

Whether the 1-point gap between Haiku 4.5 and Sonnet 4.6 holds on longer or harder suites. agentic-core-v1 tests structured execution. On open-ended reasoning tasks, creative problem-solving, or longer-horizon planning, the gap between a budget model and a flagship would be expected to widen.

The operator summary

[Observed — cost and scoring data; operator framing is Speculation]

The headline finding holds regardless of how you interpret it: within Anthropic’s 4.x family, the budget tier scores 27/30 and the flagship scores 28/30. The one-point difference traces to a single task neither model consistently passes.

[Speculation] For teams running agentic workloads on AWS Bedrock who want to stay in the Anthropic ecosystem, Haiku 4.5 is the practical choice at current pricing. It runs on the same infrastructure, uses the same API surface, and scores within one point of the flagship at 16x lower cost. The incremental cost for Sonnet 4.6 buys marginal improvement on tasks that represent the outer edge of what either model can reliably handle.

The question is not “which is better.” The question is whether the flagship’s one-point advantage on one task is worth 16x in per-task cost. On this evidence, for most production agentic workloads, it is not.

One point apart

What agentic-core-v1 tests

The results

What is task_09 actually asking?

Is the 16x cost difference justified?

Where does Haiku 4.5 land on the full leaderboard?

What explains the task_05 latency spike?

The clean forensic profile

What we do not know yet

The operator summary

ClawWorks Weekly