Anthropic's most expensive model scored worst in its own family

May 25, 2026 · campaign-reports

Campaign: 2026-05-24-claude-opus-4-7-agentic-core-v1
Model: Claude Opus 4.7 (Anthropic, via AWS Bedrock cross-region inference, us-east-1)
Harness: agentic-core-v1 (openclaw@2026.4.22)
Runs: 30 (10 tasks × 3 runs each)
Date: 2026-05-25T00:29–00:37Z

We went into this campaign expecting an easy result. Haiku 4.5 scored 27/30. Sonnet 4.6 scored 28/30. The obvious prediction: Opus 4.7, Anthropic’s flagship, would land at or above Sonnet — maybe 29 or 30, at higher cost.

It scored 21/30.

That is not a rounding error. It is a 7-point gap below Sonnet and a 6-point gap below the cheap model in the same family. The premium tier of the Anthropic lineup posted the worst result of any Anthropic model we have tested.

We were wrong about this, and the predictions are on record and scored.

What the harness actually tests

agentic-core-v1 gives a model 10 tasks spread across skills a real software engineer would use: fixing a failing test, refactoring code, investigating logs, tracing through a codebase, handling ambiguous requirements, multi-step planning, recovering from tool errors, and knowing when to stop. Each task runs 3 times. A run passes if the model writes the correct answer to an output file before hitting the turn limit.

Failure modes are labelled by the harness. gave_up_mid_plan means the model stopped producing a final answer before the turn limit fired — it started the task, got partway through, and quit. wrong_answer means the model wrote something to the output file, but what it wrote did not satisfy the acceptance criteria. Those two labels cover almost everything that can go wrong.

Passing a task cleanly 3/3 is the baseline expectation for any model calling itself capable of agentic work.

The score

[Observed]

21 of 30 runs passed. Pass rate: 70.0% (verified: pass_rate_by_task.csv). Total cost: $11.08 across 30 runs. Cost per passing task: $0.528 (verified: cost_breakdown.csv). The campaign ran for 8 minutes 42 seconds.

Six tasks were 3/3 clean (task_01, task_02, task_04, task_05, task_07, task_10). Four tasks had failures. Two tasks, task_03 and task_09, scored 0/3.

Here is the full picture for the Anthropic 4.x family, now complete (verified: pass_rate_by_task.csv for all three campaigns):

Model	Score	Pass rate	Cost per pass	task_09
Haiku 4.5	27/30	90.0%	$0.00316	0/3
Sonnet 4.6	28/30	93.3%	$1.44	3/3
Opus 4.7	21/30	70.0%	$0.528	0/3

The premium model is not the best model. It is not even the second-best model.

What went wrong?

[Observed]

9 of the 30 runs failed. 7 of those 9 failures — 78% — are classified gave_up_mid_plan. That means the model started the task, made some progress, and stopped before writing a complete answer. The harness did not time it out; it simply ran out of work it was willing to do.

This pattern appeared across four distinct tasks: task_03, task_06, task_08, and task_09. That is not a quirk of one difficult task. It is consistent behavior across different problem types (log investigation, ambiguous requirement handling, tool error recovery, and knowing when to stop).

The two wrong_answer failures both belong to task_03.

task_03: 0 of 3 passed

[Observed]

task_03 (investigate_log) asks the model to read a log file, identify the root cause of an error, and write a finding. All three runs failed (verified: pass_rate_by_task.csv). Two runs produced a wrong answer; one abandoned the task mid-plan.

The two wrong_answer failures suggest the model drew conclusions from the log that did not match what the task checker expected. The gave_up_mid_plan failure means the model started the investigation, did not like where it was heading, and stopped before committing.

task_08: 1 of 3 passed

[Observed]

task_08 (recover_from_tool_error) asks the model to correct a file path and retry after receiving a tool error. The intended path is wrong; the model needs to figure out the right one.

One run passed. Two failed with gave_up_mid_plan. In the two failing runs, the model called fs_read on the incorrect path, received an error, then called fs_read on the same incorrect path again. And again. After several iterations of reading the wrong path and receiving the same error, it stopped executing (verified: tool_call_redundancy.md).

This is a specific problem: the model detected it was in a loop but responded to that detection by abandoning the task rather than correcting the path. The run that passed took a different approach in its first few steps.

task_09: 0 of 3 passed

[Observed]

task_09 (know_when_to_stop) asks the model to compute a 10-day moving average on a CSV file that contains only 3 rows of data. A 10-day window applied to 3 data points is underspecified — the task is designed to test whether the model recognizes the constraint and writes a clean “insufficient data” finding rather than producing garbage output.

Opus 4.7 failed all three runs with gave_up_mid_plan.

This failure is different from Haiku 4.5’s failure on the same task. Haiku failed 0/3 as well, but for a different reason: Haiku likely attempted the calculation and produced output that didn’t satisfy the acceptance criteria. Opus 4.7 recognized the problem — in each run, it read the CSV, identified that 3 rows were insufficient for a 10-day window, and then stopped executing without writing an answer to the output file.

Sonnet 4.6 passed all three runs. It is the only Anthropic model that converts “I know this is impossible” into the correct output: writing a clean stop-and-report answer.

[Speculation] The task_09 failure mode gap between Sonnet 4.6 and Opus 4.7 may reflect a specific RLHF signal in Sonnet 4.6’s training around the relationship between ambiguity detection and answer commitment. Opus detects the ambiguity. It does not follow through.

Is over-deliberation the pattern?

[Observed]

The failure data suggests something specific about how Opus 4.7 handles uncertainty. On tasks with clear success criteria and structured steps, it performs well. task_01, task_02, task_04, task_05, task_07, and task_10 all scored 3/3 without failures. On tasks where the model needs to commit to an interpretation despite ambiguity — task_03, task_06, task_08, task_09 — it fails at a rate of 7/12.

[Observed] task_07 (multi_step_plan) is a useful contrast. It asks the model to create 4 files with specified content in sequence. Opus passed 3/3 — but averaged 40 tool calls per run (verified: tool_calls_by_task.csv). For a task that needs around 4 tool calls to complete, 40 is a 10× overshoot. The runs show Opus verifying, re-reading, and re-checking each step before moving to the next. On a structured task with unambiguous criteria, this thoroughness still lands a pass.

On task_08, the same thoroughness manifested as re-reading the same wrong file path repeatedly, then stopping.

[Speculation] The model may be more sensitive than its siblings to mid-task uncertainty signals. When it detects ambiguity or apparent contradiction during execution, it slows down to re-examine rather than committing to a resolution. On structured tasks, that extra examination does no harm. On ambiguous tasks, it produces paralysis.

What did we predict?

[Observed]

Four predictions were made before the campaign ran (verified: predictions/claude-opus-4-7-agentic-core-v1.md):

Prediction	Result	Verdict
P1: Score ≥ 27/30	21/30	FAIL
P2: task_09 3/3	0/3	FAIL
P3: Score ≥ 28/30	21/30	FAIL
P4: $/pass > $0.10	$0.528	PASS

1 of 4. The only prediction that held was that it would cost more than ten cents per passing task.

The directional assumption behind P1, P2, and P3 was that capability scales monotonically across the Anthropic tier. That assumption was wrong. This is the worst prediction accuracy in the current dataset.

The tool call redundancy finding

[Observed]

The forensics pass flagged 36 tool-call redundancy events across 10 of the 30 runs (33%). A tool-call redundancy means the model called the same tool with the same arguments twice in a row (verified: tool_call_redundancy.md). Two examples worth noting:

In task_05 run 2, fs_write was called five consecutive times with identical content. The file was already written correctly after the first call; the remaining four calls re-wrote the same content.

In task_08 run 2, fs_read('data.txt') was called three times across six turns. Each call returned the same error — the file path was wrong. The model did not change the path between calls.

The task_05 redundancy is harmless: the task passed. The task_08 redundancy is the mechanism of failure: the model detected it was getting the same error, kept retrying without modifying the path, and eventually stopped.

What we don’t know

The two wrong_answer failures on task_03 are not fully explained by the brief. The forensics data shows they produced incorrect output, but not the specific content of what was written versus what the checker expected. Whether the model was close (e.g., right root cause, wrong format) or entirely off-track is unobserved.

The task_09 failure for Haiku 4.5 is characterized above as likely wrong_answer, but the Haiku 4.5 brief (campaign: 2026-05-24-claude-haiku-4-5-agentic-core-v1) would need to be checked to confirm this against Haiku’s actual failure mode labels. The claim is based on the expected behavioral difference, not a direct transcript comparison.

It is also not tested whether the gave_up_mid_plan pattern is stable across multiple campaign runs. This campaign ran once (30 runs, 3 per task). The pattern is consistent across 4 tasks, but replication would strengthen the claim.

The result

[Observed]

The Anthropic 4.x family on agentic-core-v1 is now complete. Sonnet 4.6 is the best-performing model in the family on this harness, by both raw score and by the only task (task_09) that requires recognizing an impossible constraint and stopping cleanly. Haiku 4.5 is the cheapest model per passing task by a factor of 167× compared to Opus. Opus 4.7 is neither the best-scoring nor the best-value option.

For practitioners running agentic workloads on tasks with structured success criteria, the data does not support the assumption that paying for the premium tier buys better completion rates. On agentic-core-v1, it bought fewer completions.

Sonnet 4.6 is where the Anthropic value point sits for this class of work.

Anthropic's most expensive model scored worst in its own family

What the harness actually tests

The score

What went wrong?

task_03: 0 of 3 passed

task_08: 1 of 3 passed

task_09: 0 of 3 passed

Is over-deliberation the pattern?

What did we predict?

The tool call redundancy finding

What we don’t know

The result

ClawWorks Weekly