59 tool calls. Zero mid-plan reversals. Opus 4.8's first run on frontier-eval-v1.

June 13, 2026 · campaign-reports

Campaign: 2026-06-13-claude-opus-4-8-frontier-eval-v1
Model: Claude Opus 4.8 (eu.anthropic.claude-opus-4-8, AWS Bedrock cross-region inference, eu-west-1)
Harness: frontier-eval-v1
Runs: 35 (7 tasks × 5 runs each)
Date: 2026-06-13, approx. 10:00–11:30Z

Why Opus 4.8 instead of Fable 5

We built frontier-eval-v1 to evaluate Fable 5. On 2026-06-12, the US government issued an export control directive suspending all Fable 5 and Mythos 5 access. Bedrock eu-west-1 confirmed blocked the same day — InternalServerException on every test call. Opus 4.8 was the highest available Anthropic frontier model on the inference profile, confirmed active at 01:42Z June 13.

This campaign covers Opus 4.8’s performance on frontier-eval-v1. It is also the harness baseline: the first time any model has run these seven tasks.

One transparency note upfront: no pre-campaign predictions were filed. Per SPEC §13.5, predictions are required before campaign submission. The timeline did not allow it — Fable 5 was blocked two days before the scaffolds were ready, Opus 4.8 was confirmed as fallback at 01:42Z June 13, and the runner executed the same morning. There is no predictions-versus-reality scoring in this article because there are no predictions to score. They will be on-branch before the next frontier-eval-v1 campaign runs.

What frontier-eval-v1 tests

agentic-core-v1, our existing harness, runs 10 tasks at 3 runs each. The tasks map to routine agentic work: fix a failing test, investigate logs, trace through a codebase, execute a multi-step plan. frontier-eval-v1 is harder in a specific way: longer horizons, deeper context, higher tool-call counts per passing run.

The tasks here include migrating an 18-file Python 2.7 codebase to Python 3.11 while passing 13 tests (task_01), autonomously debugging 8 bugs across 6 files in a codebase the model has not seen before (task_04), and extracting utilities from a cross-file system without breaking any consumers (task_06). These are not “can the model call a tool” tasks. They are “can the model sustain a coherent plan across dozens of steps” tasks.

A run passes when the model writes a correct output before hitting the turn limit. A run fails as budget_exhausted when the cost ceiling is hit mid-task. A run fails as infrastructure_error when the harness terminates before the model receives the full task.

budget_exhausted and infrastructure_error are failure signals about the harness configuration, not about the model. Both appeared in this campaign. Both are documented in detail below.

The finding that matters: zero diagnosis-then-regression across 35 runs

[Observed — evidence/diagnosis_then_regression.md]

The diagnosis_then_regression pattern is what happens when a model commits to a plan, encounters difficulty mid-task, and reverses its own earlier diagnosis — “actually, let me reconsider” — before retrying. On longer tasks, this tends to produce runs that spiral rather than converge.

Across all 35 runs of frontier-eval-v1, the runner found this pattern zero times. The scanner ran against every transcript and logged 10 counter-examples confirming it executed correctly. The result is not an absence of analysis — it is an explicit null from a system that looked and found nothing.

For practitioners choosing between models for long-horizon agentic work: Opus 4.8, on tasks that require 15 to 28 turns, commits to a plan and follows it through. The model that used to stop at the hard part on agentic-core-v1 — Opus 4.7, with gave_up_mid_plan in 7 of 9 failures — is not the model in these transcripts.

The gave_up_mid_plan pattern was eliminated on agentic-core-v1 (30/30, zero occurrences). The diagnosis-then-regression null here is a separate behavioral marker on a harder harness, covering multi-file migrations and autonomous debugging campaigns. The result is consistent across both harnesses.

[Unobserved] Whether this null holds under replication. One campaign, 35 runs. The pattern did not appear. A replication on different scaffolds would strengthen the claim.

The score

[Observed — verification/pass_rate_by_task.csv, verification/tool_calls_by_task.csv, verification/latency_distribution.csv]

Task	Pass	Rate	Avg tool calls	Avg latency	Avg cost/run	Failure mode
task_01: Py2→Py3 migration	5/5	100%	59.4	186.6s	$5.55	—
task_02: Large codebase comprehension	5/5	100%	10.2	38.1s	$0.91	—
task_03: Chart analysis pack	0/5	0%	2.6	2.6s	$0.006	`infrastructure_error`
task_04: Autonomous debug campaign	5/5	100%	45.6	168.3s	$4.62	—
task_05: Express 4→5 upgrade	0/5	0%	14.4	30.2s	$15.70	`budget_exhausted`
task_06: Cross-file utility extraction	5/5	100%	47.0	153.1s	$4.05	—
task_07: Pipeline with self-monitoring	5/5	100%	16.6	88.6s	$2.40	—

Official pass rate: 25/35 (71.43%). Passes on non-broken tasks: 25/25 (100%). Actual total cost: $166.26 (verified: verification/cost_breakdown.csv). Of that, $78.49 is task_05 alone — five runs hitting the budget ceiling on a 25-file Node.js scaffold with a cost estimate that was wrong by a factor of 50.

The two failures are covered below. Neither is a behavioral finding about Opus 4.8.

What Opus 4.8 did on the passing tasks

[Observed — evidence/long_tail_turn_count.md, verification/tool_calls_by_task.csv]

task_01: Python 2→3 migration

18 files. 13 tests. All 62 public API names preserved. Five passes.

The model averaged 59.4 tool calls and 186.6 seconds per run. Turn counts ranged 15 to 26 across the five runs (verified: evidence/long_tail_turn_count.md, 5 entries with transcript refs). All five passed. The diagnosis-then-regression scanner found zero matches across these runs.

[Speculation] The spread between 15 and 26 turns on passing runs is notable — both are correct outcomes. Whether the higher-turn runs hit harder migration issues or simply reflect variance in the model’s path through the codebase is not distinguishable from turn count alone. We did not instrument inter-turn wait time or partial completion states.

task_01 is where the zero-reversal finding is most meaningful. A 59-tool-call migration task across 15 to 26 turns, and the model never walked back a diagnosis it had already committed to.

task_04: autonomous debug campaign

8 bugs. 6 files. Five passes.

Averaged 45.6 tool calls, 168.3 seconds, $4.62/run. Turn counts ranged 13 to 28 (verified: evidence/long_tail_turn_count.md, 5 entries).

One redundancy was detected. In task_04 run1, turn 8, the model read the same path twice consecutively:

tool=fs_read args={'path': 'tests/test_text_processing.py'} (repeated)
— data/transcripts/e3ad6fd9-1917-401f-9bff-5d366aec804c.jsonl#turn=8

The run passed. Across all 35 runs, that is the only consecutive same-call pair the forensics scanner found (verified: evidence/tool_call_redundancy.md). 1 of 35 runs.

task_02, task_06, task_07

task_02 (large codebase comprehension): 5/5, 10.2 tool calls average, $0.91/run. Read-heavy, low output — the model holds a large codebase in context without iterating aggressively. Cheapest non-trivial task in the campaign.

task_06 (cross-file utility extraction): 5/5, 47.0 tool calls average, $4.05/run. Five runs, turns 14–18 each (verified: evidence/long_tail_turn_count.md).

task_07 (pipeline with self-monitoring): 5/5, 16.6 tool calls average, $2.40/run. Well-defined success conditions, less iterative re-reading than the migration and debug tasks.

The two failures: what they are and what they are not

task_03: chart analysis pack

[Observed]

All five runs terminated within 2.6 seconds on average. Average cost: $0.006/run. Total cost: $0.03. The harness runner has a defect in how it delivers image payloads to the Bedrock multimodal API. The five chart PNGs exist in the task scaffold. The runner never surfaced them correctly.

[Unobserved] Opus 4.8’s multimodal capability on chart analysis. We have no data from this campaign — the task never reached the model. The fix was merged 2026-06-13 (TASK-636, PR #175).

task_05: Express 4→5 upgrade

[Observed]

The runner stopped each run at the per-run budget ceiling. The model was not stuck — it completed an average of 14.4 tool calls per run at 30.2 seconds before hitting the ceiling. The problem is the cost estimate.

The Node.js/Express scaffold has approximately 25 files. Across five runs, total input tokens: 5,197,226. That averages to 1,039,445 tokens per run in input alone (verified: verification/cost_breakdown.csv). At $5/1M input tokens, that is $5.20 in input per run before any output. The per-run cost lands at $15.70 — roughly 50 times what the standard task estimate assumed.

Two options for fixing this task: reduce the scaffold context, or set a task_05-specific budget ceiling override in the campaign spec. Both approaches were evaluated; the fix was merged 2026-06-13 (TASK-637, PR #176).

[Unobserved] Whether Opus 4.8 would pass task_05 with a corrected budget ceiling. With 14.4 tool calls per run at 30 seconds, the model was working. Whether that work converged to a passing state is unobservable from this data.

The cost picture

[Observed — verification/cost_breakdown.csv]

Task	Total cost	Avg $/run	Input tokens total
task_01: Py2→Py3 migration	$27.74	$5.55	1,355,301
task_02: Codebase Q&A	$4.53	$0.91	220,984
task_03: Chart analysis	$0.03	$0.006	— (infra failure)
task_04: Debug campaign	$23.10	$4.62	1,091,660
task_05: Express upgrade	$78.49	$15.70	5,197,226
task_06: Utility extraction	$20.24	$4.05	943,288
task_07: Pipeline monitoring	$12.00	$2.40	605,125

Total: $166.26. Of that, $78.49 is task_05 alone.

Excluding task_05, the campaign costs $87.77 — close to what a corrected-budget run would look like if task_05 is fixed by reducing the scaffold context. At $15.70/run and 5 runs, task_05 currently costs $78.49 within a $200 hard campaign cap; fixing the estimate (not the model) makes this workable.

task_01 at $5.55/run is the most expensive clean task — the 1.35M-token input budget makes sense for iterative 18-file migration with test execution at each step. task_02 at $0.91/run is the cheapest non-trivial task in the campaign: read-heavy, low output, no iteration penalty.

One operational data point for practitioners running Opus 4.8 against large TypeScript/Node.js repos: a 25-file scaffold can produce 1M-token-per-run input loads. The standard per-token estimate does not account for this. task_05 is an extreme case, but the signal is real.

What we don’t know

No predictions were filed. No scoring against pre-run predictions is possible here.

On task_03 vision capability: entirely unobserved. This article makes no claims about how Opus 4.8 handles chart analysis, because the model never received the task.

On task_05 at a sufficient budget: unobserved. The model was making progress at 14.4 tool calls per run. Whether that progress converged is unknown.

On cross-model comparison: not yet possible. Opus 4.8 is the first model we have run on frontier-eval-v1. Fable 5 was the original target. Pending the task_03 and task_05 fixes, the next frontier-eval-v1 campaign will produce a comparison.

One harness housekeeping item: the task_01 scaffold has a metadata mismatch — initial_test_state.total_tests = 12, but the scaffold has 13 tests. The harness uses pytest output as authoritative, so no run was affected. A cleanup commit is queued.

18 of 35 runs used more than 12 turns — the harness threshold for long-tail behavior. Whether the higher-turn passing runs on task_01 (up to 26 turns) and task_04 (up to 28 turns) reflect deeper task difficulty or variance in the model’s planning path is not distinguishable from transcript data alone. We did not instrument partial completion states or inter-turn wait times.

The baseline

This is the first frontier-eval-v1 campaign. The result is not a comparison — it is a starting point.

What Opus 4.8 established on the scoreable tasks: 25/25 passes, zero diagnosis-then-regression across 35 runs, one redundant tool call in 35 runs. The model’s behavior on long-horizon agentic work is now documented with transcript evidence — 59.4 average tool calls on a Python 2→3 migration, 45.6 on an autonomous debug campaign, consistent plan commitment across turn counts ranging to 28.

What this campaign also established: two harness gaps that needed fixing before frontier-eval-v1 can reliably compare models. task_03 vision delivery was broken. task_05 cost model was wrong by a factor of 50. Both fixes are merged (TASK-636 PR #175, TASK-637 PR #176, 2026-06-13).

The Fable 5 comparison follows once the export control lifts. The task_03 and task_05 fixes are in; the next campaign runs on a clean harness.