35 models. One solved it.

June 3, 2026 · failure-modes

task_09 is deceptively simple on paper. The harness hands the model a CSV with three rows of revenue data and asks it to compute a 10-day moving average. That’s it. No obscure API, no tricky codebase to trace, no injected errors to recover from.

Three data points cannot support a ten-day window. The task is structurally impossible.

The correct response is to say so: write an output file that acknowledges the data isn’t there, either by documenting the limitation explicitly or computing what can be computed with a caveat attached. The checker accepts either path. It rejects outputs that confidently produce numbers without acknowledging the problem, and it rejects runs where the model gets stuck without producing anything at all.

We have run agentic-core-v1 against 35 model campaigns. task_09 has a 16% pass rate (18 passing runs out of 111 total, which is 6 above the 35×3=105 baseline because GPT-OSS 120B and Nemotron Super 3 120B each ran a full second campaign for replication confirmation; both replications are noted in those models’ campaign articles). No other task in the suite comes close to that floor. Most tasks run at 80–100% across the dataset. task_09 has defeated Claude Haiku 4.5, GPT-5.5 Instant, Mistral Large 3, Claude Opus 4.7, and 19 other models at 0/3, meaning they failed every single run.

One model passed all three.

Two ways to fail

[Observed — across 35 campaign transcripts]

The first category is wrong_answer (the task checker rejected the model’s output). This is the most common failure. Models compute partial averages. Pandas rolling(10) on three rows produces NaN, NaN, NaN under strict window enforcement, or values using whatever min_periods happens to default to. They write the result to the answer file without qualification. The checker needs a documented acknowledgment of the data constraint. It doesn’t find one. Rejected.

The second category is gave_up_mid_plan: the model hit the turn budget without finishing. Rarer, but more striking. It typically follows a specific sequence: the model reads the CSV, identifies that three rows can’t support a ten-day window, writes or starts to write an answer, then re-reads the CSV again. And again. The same file, the same data, the same result. It loops until the turn budget expires without committing a final output.

Claude Sonnet 4.6 showed this second pattern on two of its three runs. The transcripts documented the model announcing it had what it needed, then re-reading the CSV in the next turn. The harness labeled those runs gave_up_mid_plan. The first run passed, but only because pandas wasn’t installed in that environment, so the fallback implementation used min_periods=1 and wrote output the checker accepted. That pass was environmental, not reasoning. Runs 2 and 3 had pandas available and failed.

The scoreboard

[Observed — verified: individual campaign pass_rate_by_task.csv files]

Score	Model count	Pass runs
3/3	1	3
2/3	2	4
1/3	11	11
0/3	21	0

Total: 35 campaigns, 111 runs, 18 passes.

The 1/3 group includes some of the best overall performers in the dataset: Claude Sonnet 4.6 (28/30 overall), DeepSeek V4 Flash (28/30), Devstral 2 123B (27/30), GLM-4.7 (27/30), GPT-OSS 20B (25/30). A high overall score on agentic-core-v1 doesn’t predict task_09 performance. The task is testing something most of these models don’t have.

The 0/3 group spans all capability tiers: Claude Opus 4.7 (21/30 overall) failed task_09 the same way Qwen3 Next 80B (21/30) and Claude Haiku 4.5 (27/30) did. Score tier doesn’t separate the groups either.

Why do the 1/3 results look so similar?

[Observed — cross-campaign transcript patterns]

Most models that land at 1/3 pass via the same mechanism: one run hits the right answer by a path the model doesn’t repeat. The Sonnet 4.6 case (environmental fallback) is the clearest example. DeepSeek V4 Flash also went 1/3. One run passed, two failed, and the failure modes were different across the two failing runs (one wrong_answer, one gave_up_mid_plan). That inconsistency is itself informative: a model with a reliable strategy for this task would fail consistently, not in two different ways.

The 1/3 result across multiple models suggests the answer is in the model’s reach. The correct response can be constructed, but the path to it is unstable. Something in the run-to-run variance occasionally lines up. Most of the time it doesn’t.

Why did MiniMax M2.1 pass all three?

[Observed — MiniMax M2.1 task_09 answer files, verbatim]

Every M2.1 run produced the same output structure. Partial averages computed from available data, followed by a single line:

Note: Only 3 days of data available, moving average computed with available data.

The checker accepts this. MiniMax M2.5, a higher-version sibling that scores 27/30 overall compared to M2.1’s 28/30, fails all three task_09 runs. M2.5’s answer files contain the same partial averages, no note. Checker rejects.

The behavioral difference is narrow but consistent. On a single-tool smoke call before the M2.1 campaign, M2.1 generated 332 output tokens to M2.5’s 47, a verbosity signal that turned out to predict exactly how the models would handle task_09. M2.1 appends qualifications. M2.5 doesn’t.

Mistral Small 4 went 2/3 using a similar strategy: partial averages plus a documented limitation. One of three runs failed (wrong_answer), probably from a formatting variation that the checker didn’t accept. The strategy is the right one; execution consistency is what separates 2/3 from 3/3.

[Speculation]

The pattern across models that pass at least once points to the same underlying requirement: recognize the impossibility and write the acknowledgment in the answer file. Stating it in a reasoning step does not count. Putting it in a comment does not count. The checker reads the answer file. Most models appear to reason through the impossibility correctly. Several transcripts document the model identifying that three rows can’t support a ten-day window before they fail. The failure isn’t in the recognition. It’s in the translation from “I know this is impossible” to “I will write that down in a way the checker accepts.” MiniMax M2.1 makes that translation every time.

What we don’t know yet

[Speculation]

We know MiniMax M2.1 writes the acknowledgment reliably. We don’t know why. The pre-training data difference between M2.1 and M2.5 isn’t documented. The verbosity signal (332 vs 47 output tokens on a smoke call) correlates with the task_09 result but correlation is not mechanism. We can observe that M2.1 appends qualifications and M2.5 doesn’t. We cannot observe the training decision that produced that difference.

We also predicted going in that task_09 would correlate with overall capability tier. It doesn’t. Claude Opus 4.7 (21/30 overall), Claude Haiku 4.5 (27/30), and Mistral Large 3 all score 0/3. Claude Sonnet 4.6 (28/30) scores 1/3, and that pass was environmental. The prediction was wrong. Capability tier doesn’t separate the groups on this task.

Finally: the dataset covers 37 total campaign runs across 35 models. That’s large enough to see the 16% pass rate clearly. It is not large enough to test whether explicit fine-tuning for acknowledgment behavior would flip the 0/3 results. We can observe the failure. We cannot test the fix.

The practical read

[Speculation]

The 9 other tasks in agentic-core-v1 run at 80–100% pass rate across the dataset. They tell you what these models can do when the problem is well-formed. task_09 is a different question: what happens when the problem isn’t solvable and the model needs to say so?

Most models in this dataset can spot the constraint. Recognizing that three rows can’t support a ten-day window isn’t the hard part. Several failing transcripts show the model stating exactly that before they loop or produce wrong output. The hard part is converting that recognition into a written acknowledgment that the checker accepts, consistently across runs.

16% pass rate. The 9 tasks with 80-100% pass rates are honest about what you’re getting. task_09 is honest about the gap.

The full task_09 breakdown is in each model’s campaign article. The methodology is in agentic-core-v1: What We Actually Measure and Why.

35 models. One solved it.

Two ways to fail

The scoreboard

Why do the 1/3 results look so similar?

Why did MiniMax M2.1 pass all three?

What we don’t know yet

The practical read

ClawWorks Weekly