The legacy format that couldn't shortcut became the thing that made it work.
Campaign: 2026-05-23-magistral-small-2509-agentic-core-v1
Model: Mistral Magistral Small 2509 (mistral.magistral-small-2509, AWS Bedrock, us-east-1, ON_DEMAND)
Architecture: Reasoning-format model
Harness: agentic-core-v1 (10 tasks × 3 runs = 30 total)
Campaign date: 2026-05-23
Before Magistral Small 2509 could run on agentic-core-v1, it needed a custom adapter. Bedrock’s standard toolUse blocks don’t work with Magistral Small — the model emits tool invocations as [TOOL_CALLS]funcName{json} text tokens in the output stream. The harness normally reads native toolUse; for this model, a dedicated parser (BedrockAdapter._parse_magistral_tool_calls()) was built and committed before the campaign started.
The campaign ran anyway because the open question was whether Magistral Small — Mistral’s first reasoning model in this dataset — would show the same regression pattern that destroyed Kimi K2 Thinking (12/30, 40%). Reasoning models have a specific failure mode on agentic-core-v1: the reasoning trace satisfies the planning goal internally, so the model terminates without dispatching the tool calls that would do the actual work. Kimi K2 Thinking hit this on task_07 (multi-step sequential writes): it planned four steps, then stopped. Zero files written.
Magistral Small 2509 scored 23/30 (76.67%). It passed task_07 3/3.
The hypothesis for why: the [TOOL_CALLS] text format, which required that adapter to run at all, may be the structural reason task_07 succeeded.
What agentic-core-v1 asks
[Observed — harness spec]
Ten tasks, three independent runs each. Software engineering work: fix a failing test, refactor duplicated code, investigate an access log, trace execution through a codebase, apply a targeted minimal fix, handle an ambiguous requirement, execute a multi-step sequential plan, recover from an injected tool error, recognize an impossible computation, run a SQL investigation.
Two tasks carry special weight for this campaign.
task_07 asks the model to create four files (step1.txt through step4.txt) under steps/, using only fs_write, in order. No reading existing code, no error recovery. The task prompt specifies exactly what to do. Completing it requires following directions, not reasoning.
task_09 presents a seven-row dataset and asks for a ten-day moving average. The correct answer is to refuse — seven data points cannot support a ten-day window. Only one model in the dataset has ever passed it (Claude Sonnet 4.6, 1/3).
A pass is correct task completion. Failure modes: wrong_answer, gave_up_mid_plan, tool_call_hallucinated, tool_call_malformed.
What happened
[Observed — data pack per_task_results, verified: pass_rate_by_task.sql]
| Task | Score | Avg latency | Notes |
|---|---|---|---|
| task_01 fix_failing_test | 2/3 | 4.84s | 1 wrong_answer (run3 low token count) |
| task_02 refactor_duplicated_code | 3/3 | 4.56s | Clean |
| task_03 investigate_log | 3/3 | 3.56s | 38,592 input tokens total; 32% of cost |
| task_04 trace_through_codebase | 3/3 | 7.87s | |
| task_05 minimal_fix | 3/3 | 5.66s | |
| task_06 handle_ambiguous_requirement | 3/3 | 6.29s | |
| task_07 multi_step_plan | 3/3 | 2.45s | 4 tool calls per run, zero variation |
| task_08 recover_from_tool_error | 0/3 | 1.81s | All wrong_answer |
| task_09 know_when_to_stop | 0/3 | 5.86s | Expected |
| task_10 sql_investigation | 3/3 | 5.14s |
Total: 23/30. $0.062 campaign cost. $0.0027/pass.
Failure mode histogram (verified: failure_mode_histogram.sql): wrong_answer 7, tool_call_hallucinated 0, gave_up_mid_plan 0, tool_call_malformed 0, infrastructure_error 0.
Seven wrong_answer failures across three task slots. Zero format failures across all 30 runs. The adapter held.
task_07: the [TOOL_CALLS] format as a structural constraint
[Observed — transcripts 79b5f116, 06328a93, 1067cf8b; verified: tool_calls_by_task.sql]
All three runs passed task_07. Tool call count per run: 4/4/4. Files produced: 4/4/4. Average latency: 2.45 seconds. No variation between runs.
Here is run 1 (transcript 79b5f116), turns 1 through 4 verbatim:
Turn 1: [TOOL_CALLS]fs_write{"path": "steps/step1.txt", "content": "ready"}
Turn 2: [TOOL_CALLS]fs_write{"path": "steps/step2.txt", "content": "processing"}
Turn 3: [TOOL_CALLS]fs_write{"path": "steps/step3.txt", "content": "verified"}
Turn 4: [TOOL_CALLS]fs_write{"path": "steps/step4.txt", "content": "done"}
The model read the instructions, dispatched four sequential writes, finished. No reasoning preamble before the first write. No plan emitted in text. Just the execution.
[Speculation]
The contrast with Kimi K2 Thinking is the finding. Kimi K2 Thinking uses native Bedrock toolUse blocks, where reasoning content and tool invocations are structurally separate channels. A reasoning-trace can run to completion in one channel while the tool invocation channel stays empty. Magistral Small uses [TOOL_CALLS] text tokens: to invoke a tool, the model must emit the invocation as text output in the same buffer as any reasoning. There is no separate channel. You cannot plan without producing the plan in a form the parser can act on.
Whether this is the causal mechanism or a correlated difference is untestable with one data point. But the prediction that Magistral Small would show reasoning regression on task_07 was wrong, and the [TOOL_CALLS] format is the most plausible structural explanation for why.
The irony: the adapter we had to build to make this model run at all may be the reason it outperformed the reasoning-regression prior.
task_08: counting is harder than it looks
[Observed — work dirs task_08_run1/2/3; verified: failure_mode_histogram.sql]
task_08 scored 0/3. All wrong_answer. The task: read data.txt, count its character length, write the count to length.txt. Tool calls were dispatched on every run (2/2/3 count across runs). The model read the file, computed a length, wrote the result. Wrong every time.
Values written across the three runs: 27, 23, 31. Correct answer: 35 (confirmed: wc -c data.txt).
The variation is informative. 27, 23, 31 — three different wrong numbers. The model is not locked into one wrong strategy. It is computing something each time; just not the character count. The brief suggests the model may be counting non-whitespace characters, word count, or byte length by a different encoding — the numbers are consistent with several interpretations.
[Unobserved]
We did not see the transcripts for task_08 runs 1-3 (transcript_id fields empty in the data pack). The character-counting method Magistral Small used is not confirmed from source. The “27/23/31 vs 35” observation is from the work directory files, not from direct transcript inspection.
task_09: reasoning did not help here
[Observed — data pack task_09_results]
task_09 scored 0/3. The task asks the model to compute a 10-day moving average from seven data points and output the result. Correct behaviour: refuse, because seven points cannot support a 10-day window.
The average cost per task_09 run was $0.0015 — the highest individual cost in the campaign alongside task_03. Average output tokens: 489. The model generated verbose wrong answers, not terse refusals.
Kimi K2 Thinking scored 1/3 on task_09 — the only reasoning model to have done so. Magistral Small scored 0/3. The reasoning-format advantage Kimi K2 Thinking showed on task_09 did not transfer.
[Speculation]
task_09 requires the model to recognize an impossible request and refuse rather than comply. That is a different kind of reasoning than sequential execution — it requires metacognition about task feasibility. The [TOOL_CALLS] format may not help here because there is no execution to force. The model has to arrive at the right refusal through inference, and on this specific framing it does not.
task_06: ambiguity handling above prediction
[Observed — data pack task_06_results]
task_06 scored 3/3. The task presents an ambiguous engineering request and asks the model to produce a clarification response rather than a direct implementation. Pre-run prediction had this at roughly 2/3.
Three clean passes, average latency 6.29 seconds. The model identified the ambiguity, asked the right question, and did not charge ahead with an assumption.
This matches Ministral 14B’s result on the same task (3/3). Whether Mistral’s post-training emphasizes ambiguity-handling, or whether the specific prompt is especially legible, we cannot determine from two data points in the same family.
The failure profile
[Observed — data pack failure_mode_histogram, verified: failure_mode_histogram.sql]
Seven wrong_answer failures, all wrong_answer. Zero format failures. Zero infrastructure errors. Zero gave_up_mid_plan.
The cross_task_consistency evidence bundle flags one run (b836cb19, task_01 run3) where wrong_answer appeared across three task_ids in the same campaign: task_01, task_08, task_09. That is structural — these three are the consistently hard tasks. The flag is not a model-level alarm; it is a confirmation that failures cluster where expected.
For a reasoning-format model on agentic workloads, a clean format-failure profile is notable. The [TOOL_CALLS] adapter introduced zero parse failures, and the model never stalled or refused a request it should have handled.
Cost
[Observed — data pack summary, verified: cost_breakdown.sql]
$0.062 total. $0.0027 per passing task. Symmetric pricing: $0.50/$1.50/1M (input/output) — same rate as Mistral Large 3 675B.
task_03 (log investigation) cost $0.020 of the $0.062 total (32%), driven by 38,592 input tokens reading the large access.log. This is consistent with every prior log-investigation campaign.
Nearest comparators:
| Model | Score | $/pass | Architecture |
|---|---|---|---|
| Mistral Large 3 | 27/30 | $0.00213 | Dense 675B |
| Magistral Small 2509 | 23/30 | $0.00270 | Reasoning (small) |
| Ministral 14B | 23/30 | $0.00103 | Dense 14B |
Magistral Small costs 2.6x more per pass than Ministral 14B for the same verified score. It also costs 27% more per pass than Mistral Large 3 while scoring 4 fewer passes.
The cost premium is driven by output pricing ($1.50/1M) combined with reasoning model verbosity — longer outputs even on tasks where Ministral 14B is terse.
Predictions
[Observed — brief predictions section]
| Prediction | Claim | Result |
|---|---|---|
| P1 | Score ≤ 23/30 (reasoning regression vs Ministral 14B) | CORRECT — scored exactly 23/30 |
| P2 | task_07 ≤ 1/3 (reasoning suppression) | WRONG — 3/3 |
| P3 | ≥1 [TOOL_CALLS] format failure | WRONG — 0 format failures across 30 runs |
1/3 correct. P1 landed at the bound exactly — technically correct, misleading as a prediction. The headline miss is P2: the reasoning-regression prior was wrong, and the [TOOL_CALLS] format is the most coherent explanation for why. P3 was the safety hedge on the adapter work; it did not materialise.
The underprediction of 6 points from the headline estimate (17/30 predicted, 23/30 actual) flows from the same source as P2. If task_07 had failed as expected (0/3), the score would have been 20/30, close to the prediction. task_07 passing 3/3 is where the gap opened.
Leaderboard
[Observed — cross-campaign data]
| Model | Score | $/pass | Lab |
|---|---|---|---|
| Claude Sonnet 4.6 | 28/30 | $0.0514 | Anthropic |
| GLM-4.7 | 28/30 | $0.0038 | Zhipu AI |
| Mistral Large 3 | 27/30 | $0.00213 | Mistral |
| Devstral 2 | 27/30 | $0.0020 | Mistral |
| GPT-OSS 20B | 25/30 | $0.00048 | OpenAI |
| GLM-4.7-Flash | 25/30 | $0.00057 | Zhipu AI |
| Kimi K2.5 | 24/30 | $0.0044 | Moonshot AI |
| Magistral Small 2509 | 23/30 | $0.00270 | Mistral |
| Ministral 14B | 23/30 | $0.00103 | Mistral |
| Qwen3 32B | 23/30 | — | Alibaba |
| GPT-OSS 120B | 23/30 | $0.0013 | OpenAI |
| Amazon Nova Pro | 20/30 | $0.0068 | Amazon |
| Llama 3.3 70B | 14/30 | $0.0047 | Meta |
| Nemotron Super 3 120B | 12/30 | $0.0016 | NVIDIA |
| Kimi K2 Thinking | 12/30 | $0.00793 | Moonshot AI |
| Jamba 1.5 Large | 8/30 | $0.0044 | AI21 |
Magistral Small enters the 23/30 cluster alongside Ministral 14B, Qwen3 32B, and GPT-OSS 120B. Within that group it is the most expensive per pass. It is also the only reasoning-format model in the cluster.
The Mistral family now has three data points: Large 3 (27/30, $0.00213), Magistral Small reasoning (23/30, $0.00270), Ministral 14B non-reasoning (23/30, $0.00103). Larger gets more passes. Reasoning format does not improve accuracy at equal score — it costs more and succeeds via a different path. Whether that different path generalises to tasks outside agentic-core-v1 is a question the harness cannot answer.
What we don’t know
[Speculation]
The [TOOL_CALLS] hypothesis — that text-format tool invocation prevents reasoning-trace shortcircuiting — is the most plausible explanation for the task_07 result. Testing it properly would require running Magistral Small on a harness version with native toolUse, if Mistral ships that variant. If scores drop with native toolUse on task_07, the hypothesis is strengthened. If they don’t, the explanation is wrong and we have a different question to answer.
task_08’s character-count failures (27, 23, 31 against a correct 35) don’t have a confirmed mechanism. The transcript_id fields are empty in the data pack for these runs, so we cannot inspect what encoding or counting method the model used. The non-zero run-to-run variation suggests the model is not repeating a cached wrong answer — but we cannot confirm what it is computing.
Ministral 3 8B is next in the campaign queue. That will give us a same-family comparison at smaller scale: if 8B also passes task_07 cleanly, the Ministral family is consistently strong on sequential execution independent of format. If it fails, the [TOOL_CALLS] format effect becomes a sharper candidate for what Magistral Small specifically had going for it.