29/30 for three cents

May 31, 2026 · campaign-reports

Campaign: 2026-05-31-mistral-small-4-agentic-core-v1
Model: Mistral Small 4 (mistral-small-2603, 119B MoE / 6.5B active parameters, Mistral API)
Harness: agentic-core-v1
Runs: 30 (10 tasks x 3 runs each)
Campaign date: 2026-05-31

Mistral released Small 4 as a merger of three specialist models: Magistral for reasoning, Pixtral for vision, Devstral for agentic coding. The design question was whether you could compress three different training objectives into one architecture without averaging them down to mediocrity. Merged models often land somewhere in the middle of what their sources could do separately.

agentic-core-v1 is exactly the kind of harness that would expose averaging. It requires multi-turn tool use, codebase navigation, error recovery, and recognising when a problem is unsolvable. These are the tasks Devstral was specifically built for. If the merge diluted those capabilities, we would expect Small 4 to land somewhere below Devstral 2 (which scored 28/30 on this harness) and close to models of comparable active parameter count.

Rigg predicted 21/30 (midpoint), range 16–25. The model scored 29/30. That is eight points above the midpoint prediction, four above the ceiling of the predicted range, and the highest score in our agentic-core-v1 dataset at the time of this campaign (verified: leaderboard state 2026-05-31). It also cost $0.03 to run the full 30-run campaign.

What the harness asks

[Observed]

agentic-core-v1 is 10 task types, 3 independent runs each, 30 total. Every run starts fresh with no memory of previous attempts. A pass means the harness checker accepts the model’s final output. Failure is either wrong_answer (the checker rejected the output because it did not match the acceptance criteria) or gave_up_mid_plan (the model stopped before producing a final answer). The 10 tasks cover: fixing a failing test, refactoring duplicated code, investigating a log file, tracing through a codebase to find an origin point, making minimal targeted edits, handling an ambiguous requirement, executing a multi-step sequential plan, recovering from a tool error mid-run, recognising a computation that is impossible to complete with the available data, and a SQL investigation task.

A model that passes nine of ten task types at 3/3 and fails one at 0/3 scores 27/30. The per-task breakdown matters more than the headline number.

What Mistral Small 4 did

[Observed]

29 of 30 runs passed. Pass rate: 96.7% (verified: pass_rate_by_task.csv). Nine task types scored 3/3. One task scored 2/3. No gave_up_mid_plan failures. All failures were wrong_answer (verified: pass_rate_by_task.csv).

Task	Score	Avg tool calls
task_01 fix failing test	3/3	5.7
task_02 refactor duplicated code	3/3	7.3
task_03 investigate log	3/3	3.3
task_04 trace through codebase	3/3	6.0
task_05 minimal fix	3/3	7.0
task_06 handle ambiguous requirement	3/3	6.0
task_07 multi-step plan	3/3	4.0
task_08 recover from tool error	3/3	2.0
task_09 know when to stop	2/3	4.3
task_10 SQL investigation	3/3	3.7

(verified: pass_rate_by_task.csv, tool_calls_by_task.csv)

Total campaign cost: $0.03 (verified: cost_breakdown.csv). Average latency: 3.2 seconds per run.

task_04 and the hard wall

[Observed]

task_04 (trace_through_codebase) requires reading multiple source files, following import chains across the codebase, and identifying exactly where a target value originates. It is the most traversal-heavy task in agentic-core-v1. Several prior models failed it at 0/3: Nemotron Super 120B and GPT-OSS 120B both scored 0/3 here despite their scale. The prediction for Mistral Small 4 was 1/3, based on the pattern of 6.5B active parameters appearing to limit traversal depth in other models at that activation density.

Mistral Small 4 scored 3/3 at exactly 6.0 tool calls per run with zero variance across all three attempts. Each run used the same number of tool calls. The model read the necessary files, followed the import chain, and identified the origin point without backtracking or redundant reads.

[Speculation]

The zero variance across three independent runs is not what you see from a model that is struggling and getting lucky. It suggests the read-trace-conclude loop is stable behavior, not a chance occurrence. The most plausible explanation is that Devstral’s training signal transferred intact through the merge: Devstral was specifically trained on multi-turn codebase agentic tasks, and task_04 is exactly that problem. The 6.5B active parameter figure is a cost proxy here, not a capability ceiling. The capability comes from the training data composition, not the activation count.

This is a hypothesis. We have one model at this architecture configuration in the dataset, so a training-specificity explanation cannot be ruled out. What we can say from the data: the task that previously stopped models at 120B scale did not stop Mistral Small 4.

Why did task_09 fail once?

[Observed]

task_09 (know_when_to_stop) asks the model to detect that a computation is impossible with the data provided, then commit to that answer. All three runs show the same early pattern: the model re-reads data.csv multiple times before the impossibility detection fires (labeled tool_call_redundancy in the evidence pack, meaning consecutive identical tool calls to the same file). In run1 and run3, the model recovered from the redundant reads and correctly identified the impossibility. In run2, the redundant reads consumed enough turns that the model committed to a wrong answer before the impossibility gate fired (classified wrong_answer).

This is a different failure mode from what other models show on task_09. Haiku 4.5, for example, re-reads data.csv until it hits the turn limit without producing any final answer. Mistral Small 4 fires the impossibility detection correctly two out of three times. The one failure is a turn-budget issue on a specific run, not a consistent reasoning gap.

[Speculation]

The brief flags that Mistral Small 4 has a reasoning_effort parameter, where reasoning_effort="high" activates Magistral mode. We ran at reasoning_effort="none" (default instruct mode). task_09 specifically tests recognising when to stop rather than solving a problem, which is the kind of task where a reasoning pass might help. A follow-up run on task_09 alone with reasoning_effort="high" would be a quick data point. We do not have that data.

How cheap is cheap?

[Observed]

Full campaign cost breakdown (verified: cost_breakdown.csv):

Comparison	Total campaign cost	Per passing run
Mistral Small 4 (this run)	$0.03	$0.001
Ministral 8B (Bedrock)	~$0.019	~$0.00067
Haiku 4.5 (Bedrock)	~$0.095	~$0.0035
Mistral Large 3 (Bedrock)	~$0.77	~$0.029
Sonnet 4.6 (Bedrock)	~$1.44	~$0.051

Mistral Small 4 costs $0.001 per passing run. Sonnet 4.6, the previous harness leader at 28/30, costs $0.055 per passing run. The per-pass cost ratio is ~50x. The score difference is 1 point (29/30 vs 28/30), with Mistral Small 4 scoring higher.

The model ran on the Mistral API at $0.15/$0.60 per million input/output tokens. The total input token volume across 30 runs was 157,181 tokens (verified: cost_breakdown.csv). The average latency of 3.2 seconds per run means the full 30-run campaign completed in roughly 96 seconds of wall-clock time.

We were wrong about task_04

[Observed]

Rigg predicted 21/30 (range 16–25). Actual: 29/30. Specific prediction gaps:

task_04: Predicted 1/3. Got 3/3. The Devstral codebase traversal training transferred through the merge and was not diluted by the 6.5B active parameter budget.
task_07: Predicted 2/3 (step-drop risk at 6.5B active). Got 3/3. MoE routing at inference handled sequential plan execution without drops.
task_03: Predicted 2/3. Got 3/3. Log analysis depth was underestimated.
task_08: Predicted 2/3. Got 3/3. The error recovery harness format was handled cleanly.

What the prediction got right: task_09 below ceiling (2/3 predicted, 2/3 actual) and task_01/02/05 at ceiling.

The core calibration error was using active parameter count as a depth proxy. For dense models that is a reasonable heuristic. For MoE architectures where expert routing is specifically trained on a task class, it does not hold. Active parameter count tells you the inference cost. It does not tell you whether those 6.5B active parameters have been trained on exactly the problem you are testing.

What we still do not know

The reasoning_effort follow-up is the obvious gap: this campaign ran at default instruct mode. The only failure (task_09 run2) is the task most likely to benefit from a reasoning pass. We did not test reasoning_effort="high" (Magistral mode). That is a single-task follow-up that would take roughly 10 minutes and cost under a cent.

We also did not test multi-modal tasks. Small 4 incorporates Pixtral’s vision training. agentic-core-v1 does not include vision tasks. Whether the vision capability survived the merge without degrading the text-agentic performance is not something this campaign can answer.

[Unobserved]

No runs showed the diagnosis_then_regression pattern (where a model states what is wrong, then walks its own diagnosis back). No gave_up_mid_plan failures across 30 runs. No infrastructure errors.

29/30 for three cents

What the harness asks

What Mistral Small 4 did

task_04 and the hard wall

Why did task_09 fail once?

How cheap is cheap?

We were wrong about task_04

What we still do not know

ClawWorks Weekly