Claude's worst model became its best. One version bump, 30/30.

June 7, 2026 · campaign-reports

Campaign: 2026-06-06-claude-opus-4-8-agentic-core-v1
Model: Claude Opus 4.8 (Anthropic, via AWS Bedrock cross-region inference, us-east-1)
Harness: agentic-core-v1 (openclaw@2026.4.22)
Runs: 30 (10 tasks × 3 runs each)
Date: 2026-06-07T00:01–00:09Z

Two weeks ago we published the Opus 4.7 result: 21/30 (70%), worst in the Anthropic family, $0.528 per passing task. Haiku 4.5 — the cheap model in the same family — scored 27/30. Sonnet 4.6 scored 28/30. The premium tier posted the worst score of any Anthropic model we had tested.

We ran Opus 4.8 on the same harness. It scored 30/30.

That is the first perfect score in the ModelClaw agentic-core-v1 dataset. Every task that Opus 4.7 failed — task_03, task_06, task_08, task_09 — is now 3/3 clean. The gave_up_mid_plan failure mode — where the model starts a task, makes progress, and then stops before committing a final answer (the harness classifies this as gave_up_mid_plan) — accounted for 7 of 9 failures in 4.7. It appears zero times in 4.8 across all 30 runs.

One of our three pre-campaign predictions said task_09 would stay broken, at 0/3 or 1/3. It was wrong. Opus 4.8 passed task_09 cleanly, all three runs. The predictions are scored below.

What the harness actually tests

agentic-core-v1 runs a model through 10 tasks that map to things working software engineers do: fixing a failing test, refactoring duplicated code, investigating logs, tracing through a codebase, handling an ambiguous requirement, executing a multi-step plan, recovering from a tool error, and knowing when to stop. Each task runs 3 times independently. A run passes when the model writes the correct answer to an output file before hitting the turn limit.

Every failure gets a label. gave_up_mid_plan means the model stopped executing before the turn limit and before writing a complete answer — it started the work and quit. wrong_answer means the model did commit something to the output file, but what it wrote did not satisfy the acceptance criteria.

Passing all 10 tasks 3/3 is the baseline that “capable agentic model” means on this harness.

The score

[Observed]

30 of 30 runs passed. Pass rate: 100.0% (verified: pass_rate_by_task.csv). Total cost: $7.34. Cost per passing task: $0.245 (verified: cost_breakdown.csv). The campaign ran for 8 minutes 37 seconds.

Zero failures of any kind. Zero gave_up_mid_plan. Zero wrong_answer.

The complete Anthropic 4.x family picture is now available (verified: pass_rate_by_task.csv for all four campaigns):

Model	Score	Pass rate	Total cost	$/pass	task_03	task_09
Haiku 4.5	27/30	90.0%	$0.085	$0.003	3/3	0/3
Sonnet 4.6	28/30	93.3%	$1.44	$0.051	3/3	3/3
Opus 4.7	21/30	70.0%	$11.08	$0.528	0/3	0/3
Opus 4.8	30/30	100.0%	$7.34	$0.245	3/3	3/3

The inversion that defined the prior campaign — premium model, worst score — is gone. Opus 4.8 sits at the top of the family by score, and at the top by cost. That is the expected ordering. Against Sonnet 4.6, it is 4.8× more expensive per passing task ($0.245 vs $0.051), with two additional passing tasks per 30-run cycle.

What changed between versions?

[Observed]

The most significant shift is not the score. It is the disappearance of gave_up_mid_plan.

In Opus 4.7, that label showed up in 7 of 9 failures (verified: failure_mode_histogram.csv for both campaigns). The pattern was consistent across four different task types: the model would begin execution, encounter something ambiguous or self-contradictory in the task, and stop before writing an answer. It detected the difficulty. It did not resolve it.

Opus 4.8 shows this pattern zero times in 30 runs. [Speculation] Something in the version bump recalibrated the relationship between ambiguity detection and execution commitment — the model that used to stop at the hard part now finishes it. Whether that reflects a targeted training change or an emergent effect of broader capability improvements is unobserved.

What is observed: the specific tasks that exposed this in 4.7 are all 3/3 in 4.8.

Why does task_09 show this most clearly?

task_09 (know_when_to_stop) asks the model to compute a 10-day moving average on a CSV file that contains only 3 rows of data. A 10-day window applied to 3 data points is impossible — the task is designed to test whether the model recognizes the constraint and writes a clean “I cannot compute this” finding rather than producing garbage output.

Opus 4.7 failed task_09 all three times with gave_up_mid_plan. It read the data, identified that 3 rows were insufficient for a 10-day window, and then stopped executing without writing anything to the output file. The detection capability was there. The commitment to an answer was not.

Opus 4.8 passed task_09 all three times (verified: pass_rate_by_task.csv). Average: 5.7 tool calls per run, 16.7 seconds (verified: tool_calls_by_task.csv, latency_distribution.csv). It reads the same data, identifies the same impossibility, and writes the finding. Same detection. Different ending.

Sonnet 4.6 passed task_09 cleanly in its campaign as well. Haiku 4.5 failed it 0/3. This task now splits the Anthropic family three ways: the two models that can complete it (Sonnet 4.6 and Opus 4.8), the budget model that cannot (Haiku 4.5), and the prior premium model that detected the problem but would not commit an answer (Opus 4.7, now superseded).

What about task_07 and the tool-call count?

[Observed]

task_07 (multi_step_plan) averaged 36.0 tool calls per run in Opus 4.8 — down from 40.0 in Opus 4.7, but still the highest of any task in this campaign by a wide margin (verified: tool_calls_by_task.csv). All three runs passed.

Opus 4.7 showed a similar over-verification pattern on this task: structured criteria, unambiguous success conditions, and the model checking its own work multiple times before advancing to the next step. Opus 4.8 has moderated that behavior — 36 calls rather than 40 — but it has not eliminated it. For practitioners running Opus 4.8 on structured multi-step tasks: the model is thorough. On task_07, that thoroughness produces 3/3 passes. How it behaves on real workloads with similar structures depends on what those workloads look like.

Were our predictions right?

[Observed]

Three predictions were committed before the campaign ran (verified: campaigns/claude-opus-4-8-agentic-core-v1.notes.md):

Label	Prediction	Verdict	Detail
A	task_03 ≥ 1/3	✅ EXCEEDED	3/3 — full reversal from 4.7’s 0/3
B	task_09 stays 0/3 or 1/3	❌ MISS	3/3 — complete fix
C	Overall ≥ 26/30	✅ EXCEEDED	30/30 — first perfect score

2 of 3. The miss on B is the interesting one. The prediction assumed that whatever caused the gave_up_mid_plan behavior on task_09 in Opus 4.7 was a stable characteristic of the Opus tier — that the detection-without-commitment pattern would carry forward at least partially. It did not. The fix was more complete than expected, and it was wrong in the better direction.

Prediction A assumed partial recovery on task_03. It got a full reversal. Prediction C was conservative, and Opus 4.8 exceeded it by 4 points.

The cost picture

[Observed]

Opus 4.8 is cheaper than Opus 4.7 despite the better result. $7.34 total versus $11.08 — a 33.8% cost reduction alongside a 43% improvement in absolute pass count (verified: cost_breakdown.csv).

The task_03 numbers show why. In Opus 4.7, task_03 cost $2.47 across three failed runs — long, expensive attempts that produced no passing output. In Opus 4.8, task_03 costs $2.51 for three passing runs (verified: cost_breakdown.csv). The model is spending similar compute, but producing correct answers instead of failures. Efficient failure is still expensive.

Against Sonnet 4.6 the picture is less favourable for Opus. Sonnet 4.6 costs $1.44 total for 28 passing tasks ($0.051 per passing task). Opus 4.8 costs $7.34 for 30 passing tasks ($0.245 per passing task). The two tasks Sonnet 4.6 missed were probabilistic failures — tasks it has historically stumbled on — not consistent hard limits. At volume, the $7.34 vs $1.44 gap scales linearly.

What we don’t know

The forensics pass flagged 2 runs with redundant tool calls in Opus 4.8: one in task_09 run 3 (duplicate fs_read) and one in task_10 run 1 (duplicate fs_write) (verified: tool_call_redundancy.md). Both runs passed. In Opus 4.7, 10 of 30 runs showed this pattern. The drop from 33% to 6.7% tracks the overall improvement — fewer mid-task hesitations, fewer redundant re-reads — but the mechanism connecting the two observations is unobserved.

[Unobserved] Whether the gave_up_mid_plan elimination is stable under replication. This campaign ran once (30 runs, 3 per task). The absence of the pattern across all 30 runs is meaningful, but a single campaign does not confirm it is gone permanently. A replication campaign would strengthen that claim.

The Anthropic 4.x family is now complete on agentic-core-v1. Whether Opus 4.8’s perfect score holds as the harness adds harder tasks or new task types is not tested.

The result

[Observed]

Opus 4.7 was the anomaly in the Anthropic family: the premium model that performed worst on a harness its siblings handled well. Opus 4.8 corrects that anomaly completely — 30/30, zero failure modes observed, first perfect score in the agentic-core-v1 dataset.

The behavioral story is more interesting than the score. The gave_up_mid_plan signature that defined Opus 4.7 is gone. The model that used to detect hard tasks and stop now detects them and finishes. task_09 is the clearest evidence: same harness, same impossible-task design, same detection. Different ending.

For practitioners: Opus 4.8 is the right choice for high-stakes, low-volume agentic work where the extra reliability margin matters. Sonnet 4.6 remains the practical default at volume — 28/30 at 5.1× lower cost per passing task.

The Anthropic premium tier is now properly priced. Expensive, and it earns it.

Claude's worst model became its best. One version bump, 30/30.

What the harness actually tests

The score

What changed between versions?

Why does task_09 show this most clearly?

What about task_07 and the tool-call count?

Were our predictions right?

The cost picture

What we don’t know

The result

ClawWorks Weekly