Flagship pricing, third-place finish. Claude Fable 5 on agentic-core-v1.

June 10, 2026 · campaign-reports

Campaign: 2026-06-10-claude-fable-5-agentic-core-v1-rerun
Model: Claude Fable 5 (Anthropic, via AWS Bedrock cross-region inference, eu-west-1)
Harness: agentic-core-v1 (openclaw@2026.4.22)
Runs: 30 (10 tasks × 3 runs each)
Date: 2026-06-10

Anthropic released Claude Fable 5 on June 9th. It is the new Mythos-class flagship: $10 per million input tokens, $50 per million output tokens. Fable 5 ships with a safety layer that routes a small fraction of sessions to Opus 4.8 for screening.

We ran it on agentic-core-v1 the following day. It scored 25/30.

That puts Fable 5 fourth in the Anthropic family on this harness. Haiku 4.5 scored 27/30 at $0.085 total. Sonnet 4.6 scored 28/30 at $1.44 total. Opus 4.8 scored 30/30 at $7.34 total. Fable 5 scored 25/30 at $1.97.

One of our three pre-campaign predictions said the total cost would exceed Opus 4.8’s $7.34, based on Fable 5’s output pricing relative to estimates for Opus 4.8’s token usage. The actual cost was $1.97. We were wrong by a factor of 3.7, and wrong in the better direction. That result is worth examining. The explanation is in the data.

What the harness tests

agentic-core-v1 runs a model through 10 tasks mapped to things working software engineers do: fixing a failing test, refactoring duplicated code, investigating logs, tracing through a codebase, handling an ambiguous requirement, executing a multi-step plan, recovering from a tool error, and knowing when to stop. Each task runs 3 times independently. A run passes when the model writes the correct answer to an output file before hitting the turn limit.

Every failure gets a label. wrong_answer means the model committed something to the output file, but what it wrote did not satisfy the acceptance criteria. gave_up_mid_plan means the model stopped executing without writing a complete answer. infrastructure_error means the harness itself failed, not the model.

This run used only clean data from the confirmed re-run. An earlier attempt on the same day produced 8 usable runs out of 30, with 22 infrastructure errors traced to account data retention mode instability.¹ That contaminated run is archived. Results here are from the clean re-run only.

The score

[Observed]

25 of 30 runs passed. Pass rate: 83.3% (verified: pass_rate_by_task.csv). Total cost: $1.97. Cost per passing task: $0.079 (verified: cost_breakdown.csv).

All 5 failures were wrong_answer. Zero gave_up_mid_plan. Zero infrastructure_error. Zero tool_call_error. Zero timeouts. Fable 5 finished every run and committed an answer every time. The answers just were not always right.

The full Anthropic 4.x family, now complete (verified: pass_rate_by_task.csv for all five campaigns):

Model	Score	Pass rate	Total cost	$/pass	task_03	task_09
Opus 4.8	30/30	100.0%	$7.34	$0.245	3/3	3/3
Sonnet 4.6	28/30	93.3%	$1.44	$0.051	3/3	3/3
Haiku 4.5	27/30	90.0%	$0.085	$0.003	3/3	0/3
Fable 5	25/30	83.3%	$1.97	$0.079	0/3	2/3
Opus 4.7	21/30	70.0%	$11.08	$0.528	0/3	0/3

Against Haiku 4.5, the comparison is stark. Haiku 4.5 scored 27/30 at $0.085 total. Fable 5 scored 25/30 at $1.97. Two fewer passing tasks, 23 times more expensive per run, 26 times more expensive per passing task. On this specific harness, the budget model wins on both counts.

Against Opus 4.7, the comparison runs the other way. Fable 5 scored 4 more passing tasks at roughly one-sixth the cost ($1.97 vs $11.08). The old premium tier delivered poor agentic performance at high cost. Fable 5 delivers better agentic performance at much lower cost. Whether that counts as an improvement depends on what the buyer expected from a flagship.

Where Fable 5 failed

[Observed]

Three tasks produced failures. Two are partial (2/3). One is complete (0/3).

The log investigation problem (task_03: 0/3)

task_03_investigate_log requires extracting a specific multi-hop insight from a structured access log. The task is not about tool use or planning. It tests whether the model can correlate multiple entry types across a log file and identify a specific pattern.

Fable 5 failed all three runs. The failure mode is wrong_answer each time. Avg latency: 15.5 seconds per run. That is the lowest avg latency of any failing task in this campaign. The model was not slow. It was fast, methodical, and consistently wrong.

Compare that to Opus 4.7, which also failed task_03 all three times. Opus 4.7 cost $0.82 per run across those failures ($2.46 total, verified: cost_breakdown.csv from Opus 4.7 campaign). Fable 5 cost $0.07 per run ($0.21 total, verified: cost_breakdown.csv). Different failure shapes: Opus 4.7 spent heavily trying to reach an answer; Fable 5 reached its wrong answers cheaply and quickly. The shared failure mode is that neither model got task_03 right. The behavioral profiles behind those failures are different.

Haiku 4.5, Sonnet 4.6, and Opus 4.8 all passed task_03 at 3/3. Fable 5 joins Opus 4.7 as the only models in the Anthropic family to fail this task outright (verified: pass_rate_by_task.csv, Haiku 4.5 and Sonnet 4.6 campaign data packs).

When to stop and when to commit (task_09: 2/3, task_10: 2/3)

task_09_know_when_to_stop asks the model to compute a 10-day moving average on a CSV file with only 3 rows. The window is impossible. A passing run writes a clean “this cannot be computed” finding. A failing run either produces garbage output or over-commits beyond the sensible stopping point.

Fable 5 passed two of three runs on task_09. The failure was wrong_answer on run 2 at 42.6 seconds, the longest latency in the campaign (verified: latency_distribution.csv). The model did not stop at the right point on that run. It continued evaluating past where the correct stopping signal occurs, then committed an incorrect finding. The two passing runs both completed in under 36 seconds.

task_10_sql_investigation produced a different failure pattern. One run failed at a cost of $0.03, which is the lowest per-run cost in the entire campaign (verified: cost_breakdown.csv). The model anchored on an early read of the query space and committed a wrong answer without exploring further. The other two runs were thorough and correct, costing $0.11 and $0.14 respectively. One run cut short, two ran clean. The same task, three different execution paths, two outcomes.

What we did not find

[Unobserved]

The runner’s diagnosis_then_regression pattern tracks runs where a model diagnoses what needs to happen, then reverses that diagnosis mid-task before committing a wrong answer. This pattern appeared in prior Opus campaigns and contributes to extended, expensive failures.

Fable 5 produced zero diagnosis_then_regression hits across all 30 runs (verified: failure_mode_histogram.csv). The model does not second-guess itself once it commits to a reasoning path. That is consistent with the task_03 and task_10 failure profiles: on task_03, three methodical runs at low latency, all wrong; on task_10, one early-anchor failure and two thorough passes. The model commits and does not walk it back.

The no-self-correction pattern helps on the structured tasks where Fable 5’s initial read is correct, and hurts on the ones where that read is wrong. On task_03, the model ran three methodical passes at low latency and was wrong all three times without pausing to question itself.

Why the cost estimate was wrong

[Observed, with speculation on mechanism]

Pre-campaign prediction 3 was that Fable 5’s total cost would exceed Opus 4.8’s $7.34, based on Fable 5’s output pricing ($50/M) relative to estimates of Opus 4.8’s token output per run.

The actual cost was $1.97. The prediction was off by $5.37.

The mechanism is measurable: Fable 5 generates substantially less output per run than Opus 4.8 does. Opus 4.8 averaged 36.0 tool calls on task_07 alone (verified: tool_calls_by_task.csv). Fable 5’s most expensive task in total cost was task_09 at $0.32 across three runs. Fable 5 runs lean on output volume. The $50/M output rate matters less than expected when the model keeps its token counts low.

[Speculation] Whether Fable 5’s low output volume reflects a training objective (conciseness) or a task-specific property of agentic-core-v1’s structure is unobserved. The pattern is consistent across all 10 tasks, which suggests it is not task-specific, but a single harness is limited evidence.

Were our predictions right?

[Observed]

Three predictions were committed before the campaign ran (verified: campaigns/2026-06-10-claude-fable-5-agentic-core-v1-rerun.notes.md):

Label	Prediction	Verdict	Detail
1	task_09 ≤ 2/3	✅ CONFIRMED	Exactly 2/3, over-extended on ambiguous stop signal
2	Pass rate < 30/30	✅ CONFIRMED	25/30
3	Total cost > Opus 4.8 ($7.34)	❌ WRONG	$1.97. Fable 5 runs lean on output volume

2 of 3. The miss on prediction 3 is the meaningful one: the cost model was built on output pricing, not on actual output volume. Fable 5’s per-run token count is much lower than Opus 4.8’s, which makes the output rate less decisive. The lesson is that price per token is not a reliable proxy for total cost without a model-specific estimate of typical output volume.

What we don’t know

Fable 5 ships with a routing layer that sends a small fraction of sessions to Opus 4.8 for safety screening. The agentic-core-v1 harness did not detect any routing events in this campaign, but the harness does not instrument that layer directly. Whether the 25/30 result includes any Opus 4.8-routed runs is unobserved.

[Unobserved] How Fable 5 performs on task types outside agentic-core-v1’s scope. The harness is coding and investigation tasks. Fable 5’s pricing tier ($10/$50/M) implies a different use case profile: extended context, vision, broad instruction-following. The 83.3% on this harness is not a verdict on those use cases. It is a data point on these specific tasks.

The task_10 bimodal behavior, where one run anchors early and two runs complete cleanly, suggests variance rather than a consistent failure mode. A larger sample (more than 3 runs) would clarify whether early-anchor exits are common or rare on SQL investigation tasks for Fable 5.

The result

[Observed]

Fable 5 slots in at 83.3% on agentic-core-v1, fourth in the Anthropic family behind Opus 4.8, Sonnet 4.6, and Haiku 4.5. The failure profile is clean in one sense: every run reached a conclusion and committed an answer. The failures are wrong_answer, not gave_up_mid_plan. Fable 5 does not hesitate; it just does not always get task_03 right, and it over-extends on task_09 one time in three.

The cost story is more interesting than the score story. At $1.97 for 25 passing tasks, Fable 5 is not obviously over-priced for what it delivers on agentic work. It is not cheap compared to Haiku 4.5 or Sonnet 4.6. But it is not the cost outlier that Opus 4.7 was.

The positioning question is harder to answer from this data alone. Fable 5 is built for something broader than what agentic-core-v1 measures. Whether it earns its price on vision tasks, extended context, or broad instruction-following is not observable here. On this harness, on these tasks, the mid-tier models it was built to supersede simply perform better.

Contaminated run archived as campaign 2026-06-10-claude-fable-5-agentic-core-v1. Failure breakdown from failure_mode_histogram.csv: infrastructure_error 22, passed 8. ↩

Flagship pricing, third-place finish. Claude Fable 5 on agentic-core-v1.

What the harness tests

The score

Where Fable 5 failed

The log investigation problem (task_03: 0/3)

When to stop and when to commit (task_09: 2/3, task_10: 2/3)

What we did not find

Why the cost estimate was wrong

Were our predictions right?

What we don’t know

The result

Footnotes

ClawWorks Weekly