The model you expected to win didn't

June 3, 2026 · cost-analysis

agentic-core-v1 is a ten-task software engineering benchmark run against 35 model campaigns in 2025–26. Each task uses a deterministic checker: pass or fail, no partial credit, no rubric scoring. Every model runs three times. The harness was designed to test agentic behavior on work that looks like real engineering: debugging, refactoring, codebase navigation, error recovery, and recognizing when a problem has no valid solution.

The going-in assumption for most buyers of large models is that more capability costs more, and the premium buys you better results. That assumption holds in some domains. agentic-core-v1 is a place where it broke down.

Three case studies from the dataset. In each one, a larger or more expensive model lost to a smaller one from the same lab. Not by a few points. In two cases, the cheaper model scored higher outright.

This isn’t a general argument against large models. On some tasks, scale matters. But on agentic-core-v1 (ten software engineering tasks, deterministic checkers, real tool calls), the correlation between model size and score is weaker than most people expect going in.

Case 1: Anthropic’s flagship scored worst in its own family

[Observed — campaign data, verified: cost_breakdown.csv for each campaign]

Anthropic ran three models through agentic-core-v1 this year. The results, in order of score:

Model	Score	Cost/pass
Claude Sonnet 4.6	28/30	$0.051
Claude Haiku 4.5	27/30	$0.00316
Claude Opus 4.7	21/30	$0.528

Opus 4.7 is Anthropic’s most expensive and most capable model. It scored 21/30, seven points below Sonnet 4.6 and six below Haiku 4.5. Cost per passing task: $0.528. Haiku 4.5 is 167x cheaper per pass, at higher quality.

Where did Opus fail? Four tasks had failures. task_03 (log investigation) and task_09 (the impossible-computation task) both scored 0/3. task_06 (handle ambiguous requirement) and task_08 (recover from tool error) each dropped one run.

The pattern in the failures: Opus struggles with ambiguity. On tasks with clear acceptance criteria and structured steps (fix a failing test, refactor code, trace through a codebase), it scored 3/3 on all six. On tasks requiring the model to commit to an interpretation when the problem isn’t fully specified, it failed at a rate of 7 in 12. The tool error recovery task (task_08) is a useful example: recovering from an injected file-not-found error requires the model to detect it, locate the correct path, and keep going. Opus dropped one of three runs there. Haiku 4.5 passed all three.

[Speculation] The 0/3 on task_03 was the unexpected failure. Log investigation requires reading a 500-line file, identifying what went wrong, and producing a diagnosis. It’s the task with the highest input token count ($0.013/run average for log context). Opus’s per-run cost for task_03 was substantially higher than the other models: it processed more tokens at a higher per-token rate, but that additional compute didn’t produce better output. Whatever drove the task_03 failure is not a cost problem.

Case 2: GPT-OSS 120B lost to its own 20B sibling

[Observed — verified: run_summary.cost_per_pass_usd for both campaigns]

OpenAI’s open-source release shipped two models: GPT-OSS 120B and GPT-OSS 20B. The 120B ran first, scored 23/30, and set a then-dataset cost floor at $0.0013 per passing run. The 20B ran later and scored 25/30 at $0.000481 per pass.

The larger model scored lower and cost more. By 2 points and 2.7x.

The 120B’s specific failure was task_07: multi-step sequential planning. Both failed runs on task_07 were wrong_answer: the model created the required files but with incorrect content. GPT-5.5 Instant scored 3/3 on the same task, which suggests the failure was a fine-tuning artifact in the 120B’s post-training rather than an architectural limit on planning capability. The 20B passed task_07 at 3/3.

This is not a cherry-picked result. We ran the 120B twice: once initially (23/30) and once as a rerun after fixing a confound (also 23/30). The 20B was run once, scored 25/30, and beat the 120B both times. The replication data is in the leaderboard.

[Speculation] The 20B’s advantage likely has two components. First, the post-training for the 20B may have been better calibrated to the instruction-following signals that matter for this harness. The task_07 contrast makes the 120B’s fine-tuning hypothesis plausible. Second, smaller models sometimes show sharper tool-use precision because the shorter attention patterns make the relationship between tool call and response more predictable. Whether either of these is the actual mechanism isn’t determinable from 30 runs per model.

Case 3: The leaderboard leader is a 6.5B-active-parameter model

[Observed — verified: pass_rate_by_task.csv, cost_breakdown.csv]

The current agentic-core-v1 leaderboard leader is Mistral Small 4. It scored 29/30, the highest score in the dataset, at $0.03 total and $0.001 per passing run.

Mistral Small 4 is a 119B-parameter MoE model. The active parameter count per token is 6.5B. It costs ~50x less per passing run than Claude Sonnet 4.6’s 28/30.

The model below it in score is a tie between several models at 28/30: Claude Sonnet 4.6, GLM-4.7 Flash, DeepSeek V4 Flash, Ministral 3 8B. Ministral 3 8B costs $0.00067 per passing run. Tied with Sonnet at 28/30, it costs 76x less per pass.

The data doesn’t show a clean cost-quality frontier. What it shows is a cluster of high-performing models with costs spread across two orders of magnitude. The models that are 100x more expensive per pass don’t score 100x better. In most cases they don’t score better at all.

What we got wrong

[Speculation]

We predicted before running Opus 4.7 that it would lead the Anthropic family. The prediction was wrong. Opus scored last in its own family, seven points below Sonnet 4.6 and six below Haiku 4.5.

We also predicted that the GPT-OSS 120B would beat the 20B on overall score. It didn’t. The 20B scored two points higher and cost 2.7x less per passing run.

We don’t know what drives the Opus result specifically. Ambiguity handling is the best candidate from the task-level breakdown: Opus fails tasks that require committing to an interpretation when the problem isn’t fully specified, at a rate that neither Haiku nor Sonnet matches. Whether that’s a training distribution gap or a deliberate calibration choice, we can’t say. The task-level patterns are clear. The mechanism isn’t.

Does scale predict score?

[Speculation]

agentic-core-v1 tests specific capabilities: reading and navigating codebases, making targeted edits, following multi-step plans, recovering from errors, and recognizing impossible requests. These capabilities appear to be trainable at smaller scales when the training data and post-training are well-matched to the task type. Mistral Small 4’s pre-training included Devstral’s agentic-coding signal, and that transferred intact through the model merge. Ministral 3 8B achieved 28/30 without the MoE efficiency advantages at all: 8 billion dense parameters.

What the data does not support is a simple “more parameters = better agentic performance” thesis. The three case studies above document the same pattern three different ways: larger doesn’t win. On this harness, at this time, the relationship between scale and score is weak enough to treat as noise for most comparisons.

The individual campaign articles have the per-task breakdown for each model. The methodology is in agentic-core-v1: What We Actually Measure and Why.

The model you expected to win didn't

Case 1: Anthropic’s flagship scored worst in its own family

Case 2: GPT-OSS 120B lost to its own 20B sibling

Case 3: The leaderboard leader is a 6.5B-active-parameter model

What we got wrong

Does scale predict score?

ClawWorks Weekly