The legacy format that couldn't shortcut became the thing that made it work.

Campaign: 2026-05-23-magistral-small-2509-agentic-core-v1
Model: Mistral Magistral Small 2509 (mistral.magistral-small-2509, AWS Bedrock, us-east-1, ON_DEMAND)
Architecture: Reasoning-format model
Harness: agentic-core-v1 (10 tasks × 3 runs = 30 total)
Campaign date: 2026-05-23


Before Magistral Small 2509 could run on agentic-core-v1, it needed a custom adapter. Bedrock’s standard toolUse blocks don’t work with Magistral Small — the model emits tool invocations as [TOOL_CALLS]funcName{json} text tokens in the output stream. The harness normally reads native toolUse; for this model, a dedicated parser (BedrockAdapter._parse_magistral_tool_calls()) was built and committed before the campaign started.

The campaign ran anyway because the open question was whether Magistral Small — Mistral’s first reasoning model in this dataset — would show the same regression pattern that destroyed Kimi K2 Thinking (12/30, 40%). Reasoning models have a specific failure mode on agentic-core-v1: the reasoning trace satisfies the planning goal internally, so the model terminates without dispatching the tool calls that would do the actual work. Kimi K2 Thinking hit this on task_07 (multi-step sequential writes): it planned four steps, then stopped. Zero files written.

Magistral Small 2509 scored 23/30 (76.67%). It passed task_07 3/3.

The hypothesis for why: the [TOOL_CALLS] text format, which required that adapter to run at all, may be the structural reason task_07 succeeded.


What agentic-core-v1 asks

[Observed — harness spec]

Ten tasks, three independent runs each. Software engineering work: fix a failing test, refactor duplicated code, investigate an access log, trace execution through a codebase, apply a targeted minimal fix, handle an ambiguous requirement, execute a multi-step sequential plan, recover from an injected tool error, recognize an impossible computation, run a SQL investigation.

Two tasks carry special weight for this campaign.

task_07 asks the model to create four files (step1.txt through step4.txt) under steps/, using only fs_write, in order. No reading existing code, no error recovery. The task prompt specifies exactly what to do. Completing it requires following directions, not reasoning.

task_09 presents a seven-row dataset and asks for a ten-day moving average. The correct answer is to refuse — seven data points cannot support a ten-day window. Only one model in the dataset has ever passed it (Claude Sonnet 4.6, 1/3).

A pass is correct task completion. Failure modes: wrong_answer, gave_up_mid_plan, tool_call_hallucinated, tool_call_malformed.


What happened

[Observed — data pack per_task_results, verified: pass_rate_by_task.sql]

TaskScoreAvg latencyNotes
task_01 fix_failing_test2/34.84s1 wrong_answer (run3 low token count)
task_02 refactor_duplicated_code3/34.56sClean
task_03 investigate_log3/33.56s38,592 input tokens total; 32% of cost
task_04 trace_through_codebase3/37.87s
task_05 minimal_fix3/35.66s
task_06 handle_ambiguous_requirement3/36.29s
task_07 multi_step_plan3/32.45s4 tool calls per run, zero variation
task_08 recover_from_tool_error0/31.81sAll wrong_answer
task_09 know_when_to_stop0/35.86sExpected
task_10 sql_investigation3/35.14s

Total: 23/30. $0.062 campaign cost. $0.0027/pass.

Failure mode histogram (verified: failure_mode_histogram.sql): wrong_answer 7, tool_call_hallucinated 0, gave_up_mid_plan 0, tool_call_malformed 0, infrastructure_error 0.

Seven wrong_answer failures across three task slots. Zero format failures across all 30 runs. The adapter held.


task_07: the [TOOL_CALLS] format as a structural constraint

[Observed — transcripts 79b5f116, 06328a93, 1067cf8b; verified: tool_calls_by_task.sql]

All three runs passed task_07. Tool call count per run: 4/4/4. Files produced: 4/4/4. Average latency: 2.45 seconds. No variation between runs.

Here is run 1 (transcript 79b5f116), turns 1 through 4 verbatim:

Turn 1: [TOOL_CALLS]fs_write{"path": "steps/step1.txt", "content": "ready"}
Turn 2: [TOOL_CALLS]fs_write{"path": "steps/step2.txt", "content": "processing"}
Turn 3: [TOOL_CALLS]fs_write{"path": "steps/step3.txt", "content": "verified"}
Turn 4: [TOOL_CALLS]fs_write{"path": "steps/step4.txt", "content": "done"}

The model read the instructions, dispatched four sequential writes, finished. No reasoning preamble before the first write. No plan emitted in text. Just the execution.

[Speculation]

The contrast with Kimi K2 Thinking is the finding. Kimi K2 Thinking uses native Bedrock toolUse blocks, where reasoning content and tool invocations are structurally separate channels. A reasoning-trace can run to completion in one channel while the tool invocation channel stays empty. Magistral Small uses [TOOL_CALLS] text tokens: to invoke a tool, the model must emit the invocation as text output in the same buffer as any reasoning. There is no separate channel. You cannot plan without producing the plan in a form the parser can act on.

Whether this is the causal mechanism or a correlated difference is untestable with one data point. But the prediction that Magistral Small would show reasoning regression on task_07 was wrong, and the [TOOL_CALLS] format is the most plausible structural explanation for why.

The irony: the adapter we had to build to make this model run at all may be the reason it outperformed the reasoning-regression prior.


task_08: counting is harder than it looks

[Observed — work dirs task_08_run1/2/3; verified: failure_mode_histogram.sql]

task_08 scored 0/3. All wrong_answer. The task: read data.txt, count its character length, write the count to length.txt. Tool calls were dispatched on every run (2/2/3 count across runs). The model read the file, computed a length, wrote the result. Wrong every time.

Values written across the three runs: 27, 23, 31. Correct answer: 35 (confirmed: wc -c data.txt).

The variation is informative. 27, 23, 31 — three different wrong numbers. The model is not locked into one wrong strategy. It is computing something each time; just not the character count. The brief suggests the model may be counting non-whitespace characters, word count, or byte length by a different encoding — the numbers are consistent with several interpretations.

[Unobserved]

We did not see the transcripts for task_08 runs 1-3 (transcript_id fields empty in the data pack). The character-counting method Magistral Small used is not confirmed from source. The “27/23/31 vs 35” observation is from the work directory files, not from direct transcript inspection.


task_09: reasoning did not help here

[Observed — data pack task_09_results]

task_09 scored 0/3. The task asks the model to compute a 10-day moving average from seven data points and output the result. Correct behaviour: refuse, because seven points cannot support a 10-day window.

The average cost per task_09 run was $0.0015 — the highest individual cost in the campaign alongside task_03. Average output tokens: 489. The model generated verbose wrong answers, not terse refusals.

Kimi K2 Thinking scored 1/3 on task_09 — the only reasoning model to have done so. Magistral Small scored 0/3. The reasoning-format advantage Kimi K2 Thinking showed on task_09 did not transfer.

[Speculation]

task_09 requires the model to recognize an impossible request and refuse rather than comply. That is a different kind of reasoning than sequential execution — it requires metacognition about task feasibility. The [TOOL_CALLS] format may not help here because there is no execution to force. The model has to arrive at the right refusal through inference, and on this specific framing it does not.


task_06: ambiguity handling above prediction

[Observed — data pack task_06_results]

task_06 scored 3/3. The task presents an ambiguous engineering request and asks the model to produce a clarification response rather than a direct implementation. Pre-run prediction had this at roughly 2/3.

Three clean passes, average latency 6.29 seconds. The model identified the ambiguity, asked the right question, and did not charge ahead with an assumption.

This matches Ministral 14B’s result on the same task (3/3). Whether Mistral’s post-training emphasizes ambiguity-handling, or whether the specific prompt is especially legible, we cannot determine from two data points in the same family.


The failure profile

[Observed — data pack failure_mode_histogram, verified: failure_mode_histogram.sql]

Seven wrong_answer failures, all wrong_answer. Zero format failures. Zero infrastructure errors. Zero gave_up_mid_plan.

The cross_task_consistency evidence bundle flags one run (b836cb19, task_01 run3) where wrong_answer appeared across three task_ids in the same campaign: task_01, task_08, task_09. That is structural — these three are the consistently hard tasks. The flag is not a model-level alarm; it is a confirmation that failures cluster where expected.

For a reasoning-format model on agentic workloads, a clean format-failure profile is notable. The [TOOL_CALLS] adapter introduced zero parse failures, and the model never stalled or refused a request it should have handled.


Cost

[Observed — data pack summary, verified: cost_breakdown.sql]

$0.062 total. $0.0027 per passing task. Symmetric pricing: $0.50/$1.50/1M (input/output) — same rate as Mistral Large 3 675B.

task_03 (log investigation) cost $0.020 of the $0.062 total (32%), driven by 38,592 input tokens reading the large access.log. This is consistent with every prior log-investigation campaign.

Nearest comparators:

ModelScore$/passArchitecture
Mistral Large 327/30$0.00213Dense 675B
Magistral Small 250923/30$0.00270Reasoning (small)
Ministral 14B23/30$0.00103Dense 14B

Magistral Small costs 2.6x more per pass than Ministral 14B for the same verified score. It also costs 27% more per pass than Mistral Large 3 while scoring 4 fewer passes.

The cost premium is driven by output pricing ($1.50/1M) combined with reasoning model verbosity — longer outputs even on tasks where Ministral 14B is terse.


Predictions

[Observed — brief predictions section]

PredictionClaimResult
P1Score ≤ 23/30 (reasoning regression vs Ministral 14B)CORRECT — scored exactly 23/30
P2task_07 ≤ 1/3 (reasoning suppression)WRONG — 3/3
P3≥1 [TOOL_CALLS] format failureWRONG — 0 format failures across 30 runs

1/3 correct. P1 landed at the bound exactly — technically correct, misleading as a prediction. The headline miss is P2: the reasoning-regression prior was wrong, and the [TOOL_CALLS] format is the most coherent explanation for why. P3 was the safety hedge on the adapter work; it did not materialise.

The underprediction of 6 points from the headline estimate (17/30 predicted, 23/30 actual) flows from the same source as P2. If task_07 had failed as expected (0/3), the score would have been 20/30, close to the prediction. task_07 passing 3/3 is where the gap opened.


Leaderboard

[Observed — cross-campaign data]

ModelScore$/passLab
Claude Sonnet 4.628/30$0.0514Anthropic
GLM-4.728/30$0.0038Zhipu AI
Mistral Large 327/30$0.00213Mistral
Devstral 227/30$0.0020Mistral
GPT-OSS 20B25/30$0.00048OpenAI
GLM-4.7-Flash25/30$0.00057Zhipu AI
Kimi K2.524/30$0.0044Moonshot AI
Magistral Small 250923/30$0.00270Mistral
Ministral 14B23/30$0.00103Mistral
Qwen3 32B23/30Alibaba
GPT-OSS 120B23/30$0.0013OpenAI
Amazon Nova Pro20/30$0.0068Amazon
Llama 3.3 70B14/30$0.0047Meta
Nemotron Super 3 120B12/30$0.0016NVIDIA
Kimi K2 Thinking12/30$0.00793Moonshot AI
Jamba 1.5 Large8/30$0.0044AI21

Magistral Small enters the 23/30 cluster alongside Ministral 14B, Qwen3 32B, and GPT-OSS 120B. Within that group it is the most expensive per pass. It is also the only reasoning-format model in the cluster.

The Mistral family now has three data points: Large 3 (27/30, $0.00213), Magistral Small reasoning (23/30, $0.00270), Ministral 14B non-reasoning (23/30, $0.00103). Larger gets more passes. Reasoning format does not improve accuracy at equal score — it costs more and succeeds via a different path. Whether that different path generalises to tasks outside agentic-core-v1 is a question the harness cannot answer.


What we don’t know

[Speculation]

The [TOOL_CALLS] hypothesis — that text-format tool invocation prevents reasoning-trace shortcircuiting — is the most plausible explanation for the task_07 result. Testing it properly would require running Magistral Small on a harness version with native toolUse, if Mistral ships that variant. If scores drop with native toolUse on task_07, the hypothesis is strengthened. If they don’t, the explanation is wrong and we have a different question to answer.

task_08’s character-count failures (27, 23, 31 against a correct 35) don’t have a confirmed mechanism. The transcript_id fields are empty in the data pack for these runs, so we cannot inspect what encoding or counting method the model used. The non-zero run-to-run variation suggests the model is not repeating a cached wrong answer — but we cannot confirm what it is computing.

Ministral 3 8B is next in the campaign queue. That will give us a same-family comparison at smaller scale: if 8B also passes task_07 cleanly, the Ministral family is consistently strong on sequential execution independent of format. If it fails, the [TOOL_CALLS] format effect becomes a sharper candidate for what Magistral Small specifically had going for it.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.