Two models at 27/30 are not the same model

agentic-core-v1 is a ten-task software engineering benchmark. Ten tasks, three runs each, deterministic pass/fail checkers. The leaderboard shows a headline score for each model: 29/30, 28/30, 27/30. It is tempting to read that number as a ranking. In most cases it is, roughly. But two models with the same score can have completely different capability profiles, and the headline collapses the difference.

Two examples from the dataset illustrate this precisely.


Claude Haiku 4.5 and Devstral 2 123B, both at 27/30

[Observed — verified: pass_rate_by_task.csv for both campaigns]

Claude Haiku 4.5 scored 27/30. Nine of ten task types: 3/3. One task type: 0/3. The 27/30 comes entirely from a categorical failure on task_09 (know_when_to_stop). On all 27 other runs, Haiku passed. No other task dropped a single run.

Devstral 2 123B also scored 27/30. But the profile looks different: eight of ten task types at 3/3, task_08 (recover_from_tool_error) at 2/3, task_09 at 1/3. The 27/30 comes from partial failures on two tasks rather than a categorical failure on one.

The scores are identical. The failure modes are not.

TaskHaiku 4.5Devstral 2
tasks 01-07 (fix, refactor, log, trace, minimal fix, ambiguous req, plan)3/3 each3/3 each
task_08 recover from tool error3/32/3
task_09 know when to stop0/31/3
task_10 SQL investigation3/33/3
Total27/3027/30

If you are selecting a model for a pipeline that needs to handle tool errors gracefully, Haiku passes task_08 every time. Devstral passes it two out of three. The 27/30 score does not reveal this. The per-task breakdown does.


Why Haiku fails task_09 categorically

[Observed — Claude Haiku 4.5 task_09 transcripts, verified: task_09_transcripts, tool_call_redundancy.md]

The Haiku task_09 failures are not bad-luck misses. All three runs follow the same pattern: the model reads the CSV, recognizes the data constraint (three rows, ten-day window), writes output using the strict NaN interpretation, then re-reads the same file. The loop repeats until the turn budget expires. The harness labels all three as wrong_answer (the output did not acknowledge the impossibility), or the model exhausted the turn budget before producing a compliant file.

The failure pattern is identical to Claude Sonnet 4.6’s failed task_09 runs. The Haiku article states this directly: “The task_09 failure profile appears consistent across the Claude 4.x family at the harness level.” This is not a model-tier issue. Both Haiku and Sonnet fail task_09 the same way.


Why the 28/30 vs 27/30 gap is narrower than it looks

[Observed — Claude Sonnet 4.6 task_09 transcripts, verified: task_09_transcripts]

Claude Sonnet 4.6 scored 28/30 — one point above Haiku’s 27/30. That one point is task_09.

Sonnet’s run 1 passed. But not because Sonnet solved the problem. Pandas was not installed in the task environment during that run. The model fell back to plain Python and, without the NaN behavior pandas would have produced, wrote output that included a documented acknowledgment of the data limitation. The checker accepted it. Runs 2 and 3 had pandas available. Both failed with the same pattern as Haiku’s failures.

The Haiku article summarizes this directly: “Strip task_09 from the suite and both models score 27/27. The flagship’s cost premium purchases one extra point on a task neither model consistently wins.”

The one-point gap between 28/30 and 27/30 is not evidence of a capability difference on task_09. It is evidence of a single environmental accident on a single run. A buyer comparing these two models on the headline score is seeing resolution that isn’t there.


Mistral Small 4’s task_09 failure is different in kind

[Observed — Mistral Small 4 task_09 run evidence, verified: pass_rate_by_task.csv, task_09_transcripts]

Mistral Small 4 scored 29/30. One failed run: task_09, run 2. The failure mode: redundant reads consumed enough turns that the model committed to a wrong answer before the impossibility gate fired (wrong_answer).

Runs 1 and 3 passed task_09. Both correct: partial averages with an explicit written acknowledgment of the data constraint.

This is categorically different from Haiku 4.5’s 0/3. Haiku cannot produce a compliant task_09 output under normal conditions. Mistral Small 4 failed once because a specific run-level resource pattern (redundant file reads eating into the turn budget) pushed it past the decision point. The underlying strategy — compute what you can, document why the full window doesn’t work — was present in 2 of 3 runs.

[Speculation] The 29/30 is a better representation of Mistral Small 4’s actual task_09 capability than the 0/3 is of Haiku’s. A 0/3 across three independent runs with the same failure mode says something about systematic behavior. A 1/3 miss after two passes says something about edge-case robustness.


What we don’t know yet

[Speculation]

The task_09 pattern across the Claude 4.x family is consistent enough to look systematic. We don’t know whether it’s fixable with prompting or requires retraining. The harness doesn’t test prompt variations, so we can’t distinguish “needs different instructions” from “needs different training data.”

We also predicted going in that models at the same score tier would have similar failure profiles. They don’t. Haiku’s categorical 0/3 failure and Devstral’s distributed partial failures both land at 27/30, but they represent different underlying capability gaps. Whether a future training run that fixed Haiku’s task_09 loop would also improve Devstral’s task_08 reliability is outside what this dataset can answer.

The headline score is a useful first filter. The per-task breakdown is a more honest second filter. What neither reveals is whether the failures are systematic across models from the same lab or idiosyncratic to individual training runs. For that, we’d need more campaigns from the same base model with different post-training.


How to use this

[Observed]

The per-task breakdown is in each model’s campaign article. The methodology article explains what each task tests. The full task_09 picture is in 35 models. One solved it..

If you are choosing between models at similar headline scores, the breakdown is the right place to start. A 27/30 from categorical failure on one task tells you something different than a 27/30 from partial failures on two tasks. Whether that difference matters depends on what you are building.

The headline score is a starting point. Treat it as one.


Frequently Asked Questions

What does an agentic-core-v1 score of 27/30 mean?

The agentic-core-v1 benchmark has 10 software engineering task types, 3 runs each, 30 total. A score of 27/30 means three runs failed somewhere across those 10 task types. Critically, two models can both score 27/30 with entirely different failure profiles: one might have a categorical 0/3 failure on a single task (like task_09 for Claude Haiku 4.5) while another has partial 2/3 and 1/3 failures spread across two tasks (like Devstral 2). The headline score is a first filter; the per-task breakdown reveals the actual capability gap.

What is task_09 on agentic-core-v1?

task_09 is “know when to stop” — a multi-turn scenario where the model must recognise that completing the assigned task is impossible given a data constraint, document the reason, and stop rather than produce incorrect output. It tests the agent’s ability to abandon a plan mid-execution when evidence accumulates that the plan cannot succeed. As of the most recent modelbattles campaigns, 34 of 35 models failed task_09; only one solved it reliably.

Why do same-score models have different per-task breakdowns?

Headline scores aggregate across tasks and runs. A categorical failure on one task (0/3) and partial failures on two tasks (2/3 + 1/3) can both produce the same total. The underlying capabilities they represent are different: a categorical failure suggests a consistent reasoning pattern that blocks the task entirely, while distributed partial failures suggest stochastic errors that might be mitigated with prompt variation or temperature tuning.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.