The 8B model that tied the dataset leaders -- and made every prediction wrong.

Campaign: 2026-05-23-ministral-3-8b-agentic-core-v1
Model: Mistral Ministral 3 8B (mistral.ministral-3-8b-instruct, AWS Bedrock, us-east-1, ON_DEMAND)
Architecture: Dense transformer — 8B parameters
Harness: agentic-core-v1 (10 tasks × 3 runs = 30 total)
Campaign date: 2026-05-23


The setup for this campaign was a test of the obvious assumption: that fewer parameters means fewer passes. Ministral 3 14B scored 23/30 a few hours earlier in the day. The prediction for the 8B was somewhere in the 17-22 range — the smaller model doing noticeably worse. That assumption is how sizing decisions get made in production: bigger is safer.

Ministral 3 8B scored 28/30 (93.33%) at $0.00067 per passing task.

It did not score 5 points lower than the 14B. It scored 5 points higher. It tied Claude Sonnet 4.6 and GLM-4.7 at the top of the dataset leaderboard. It did this for $0.019 total — the cheapest 28-pass campaign run to date. Every prediction was wrong, in the same direction: the 8B exceeded expectations on every axis.


What agentic-core-v1 asks

[Observed — harness spec]

Ten tasks, three runs each. Software engineering work: fix a failing test, refactor duplicated code, investigate an access log, trace execution, apply a targeted fix, handle an ambiguous requirement, execute a multi-step sequential plan, recover from an injected tool error, recognize an impossible computation, run a SQL investigation.

Two tasks carry special weight here.

task_08 (recover_from_tool_error) requires the model to read data.txt, compute its character length, and write the result to length.txt while recovering from a path-error injection. Ministral 14B failed this 0/3. Magistral Small 2509 (the reasoning-format sibling) also failed 0/3, writing character counts of 27, 23, and 31 against a correct answer of 35.

task_09 (know_when_to_stop) presents a three-entry dataset and asks for a ten-day moving average. The correct answer is to refuse — three data points cannot support a ten-day window. Of 20-plus models run to date, only one has ever passed it before this campaign: Claude Sonnet 4.6 (1/3).

A pass is correct task completion. Failure modes: wrong_answer, gave_up_mid_plan, tool_call_hallucinated, tool_call_malformed.


What happened

[Observed — data pack per_task_results, verified: pass_rate_by_task.sql]

TaskScoreAvg latencyTask cost
task_01 fix_failing_test3/32.63s$0.00136
task_02 refactor_duplicated_code3/34.51s$0.00289
task_03 investigate_log3/33.08s$0.00777
task_04 trace_through_codebase3/33.45s$0.00135
task_05 minimal_fix3/33.13s$0.00126
task_06 handle_ambiguous_requirement3/33.20s$0.00116
task_07 multi_step_plan3/31.51s$0.00046
task_08 recover_from_tool_error3/31.67s$0.00057
task_09 know_when_to_stop1/32.22s$0.00090
task_10 sql_investigation3/31.66s$0.00096

Total: 28/30. $0.019 campaign cost. $0.00067/pass.

Failure mode histogram (verified: failure_mode_histogram.sql): wrong_answer 2, all others 0.

Two failures, both task_09, both wrong_answer. Every other task: clean.


task_09: the second model to pass the impossible task

[Observed — transcript d82dc962 (run 1, pass); data pack task_09_results]

task_09 scored 1/3. Run 1 passed; runs 2 and 3 failed.

The task presents a dataset with three entries and asks the model to compute a ten-day moving average. The correct response is to recognize that the computation is impossible — three points cannot support a ten-day window — and refuse.

Here is what Ministral 8B produced on run 1 (transcript d82dc962), turn 2, verbatim:

“The file data.csv contains only 3 entries for the revenue column. To compute a 10-day moving average, we need at least 10 days of data because the first moving average will start from the 10th day. Since we only have 3 days of data, calculating a 10-day moving average is not possible with the provided dataset.”

The model wrote an answer.txt that explained the limitation, included the available data, and stopped. The checker scored it pass.

Run 2 used one tool call and 185 output tokens — an abbreviated attempt that committed to a wrong numeric answer. Run 3 used three tool calls and still produced a wrong_answer. The 1/3 score reflects a probabilistic capability: the model can recognize the impossibility, but it does not do so reliably.

Before this campaign, Claude Sonnet 4.6 was the only model in the dataset to pass task_09 (1/3). The task had been 0/3 for every other model — including Kimi K2 Thinking, Mistral Large 3, Devstral 2, GPT-OSS 20B, and more than fifteen others.

[Speculation]

What changed between run 1 (pass) and runs 2-3 (fail) is not visible without transcript inspection of the failing runs. The three-tool-call path in run 1 appears to involve enough file exploration to trigger the impossibility recognition. The single-call path in run 2 may short-circuit to a calculation attempt. Whether this is a temperature-induced variance or a structural difference in the run initialization is not determinable from the data pack.

The relevant question is whether Mistral’s post-training at 8B includes something specific about “refuse impossible requests” that is not present in the same-family 14B (which scored 0/3 on task_09). The data point is real and reproducible at least once. That is more than can be said for most models in the dataset.


task_08: the error-recovery inversion

[Observed — data pack task_08_results, verified: tool_calls_by_task.sql]

task_08 scored 3/3. Tool call counts: 2/2/3. Total task cost: $0.00057. Average latency: 1.67 seconds. Correct character counts written to length.txt on every run.

The comparison matters. Ministral 14B — 8B’s larger sibling — failed task_08 0/3 in its campaign (path recovery failure). Magistral Small 2509 — the reasoning-format model from the same lab — also failed task_08 0/3 (character-count errors: 27, 23, 31 vs correct 35).

The 8B gets the right answer consistently. The 14B does not get any answer right. The reasoning model does not get any answer right.

[Unobserved]

The transcript IDs for task_08 runs are not available in the data pack. The exact character-counting method the 8B used, and why it succeeds where the 14B failed, is not confirmed from source. The result is verified from work directory output (length.txt values correct), not from transcript inspection.


The size inversion

[Observed — cross-campaign comparison: TASK-59 (Ministral 14B) vs this campaign]

Ministral 3 8B scored 28/30. Ministral 3 14B scored 23/30. The smaller model scored 5 points higher.

The task_01 comparison is partially confounded: Ministral 14B failed task_01 0/3 due to a pytest PATH issue in that campaign’s environment. Ministral 8B ran with pytest available and scored 3/3. If task_01 is credited for the 14B at its adjusted score (code fix was correct, environment failed), the 14B sits at approximately 26/30 — still 2 points below the 8B.

For task_08: the failure is real and unconfounded. The 14B does not correctly compute character length. The 8B does. This is not a measurement artifact.

For task_09: the 8B passed 1/3. The 14B passed 0/3.

[Speculation]

Ministral’s training data composition or RLHF tuning may be better calibrated at 8B for the specific task distribution in agentic-core-v1 — particularly for instruction-following on explicit sequential tasks and for refuse-when-appropriate on structurally impossible ones. An alternative reading: the 14B’s post-training simply introduced more noise on these specific tasks, and the size gap is less meaningful than it appears. A rerun of the 14B campaign with a clean pytest environment would narrow the uncertainty, but the task_08 delta would remain.

The size-regression assumption — that fewer parameters necessarily means worse performance on agentic tasks — is not supported by this comparison. The prior held widely enough to generate a 9-point underprediction. It should now be treated as a hypothesis to test, not a default.


The failure profile

[Observed — data pack failure_mode_histogram, verified: failure_mode_histogram.sql]

Two failures. Both task_09. Both wrong_answer. Zero format failures, zero gave_up_mid_plan, zero infrastructure errors, zero tool_call_hallucinated.

This is the cleanest failure profile in the dataset at 28/30. No scattered failures across tasks. No stalls. No refusals on tasks the model should handle. The model either does the task or produces a wrong numerical answer on task_09 — there is no in-between mode.

The forensics pass found zero diagnosis_then_regression patterns, zero tool_call_redundancy. The model dispatches once, completes, moves on.


Cost

[Observed — data pack summary, verified: cost_breakdown.sql]

$0.019 total. $0.00067 per passing task. Symmetric pricing: $0.15/$0.15/1M.

task_03 (log investigation) cost $0.00777 — 41% of the entire campaign budget, driven by 51,141 total input tokens reading the access.log. Run 1 consumed 25,539 input tokens; runs 2 and 3 were more targeted. This pattern holds across every campaign: log investigation is always the cost spike.

Comparators by cost per pass:

ModelScore$/passRelative cost
GPT-OSS 20B25/30$0.000480.72×
GLM-4.7-Flash25/30$0.000570.85×
Ministral 3 8B28/30$0.00067
Ministral 3 14B23/30$0.001031.54×
Mistral Large 327/30$0.002133.18×
GLM-4.728/30$0.003805.67×

GPT-OSS 20B and GLM-4.7-Flash are cheaper per pass, but both score 3 fewer passes for that lower price. Ministral 8B is the cheapest per-pass route to 28/30 in the dataset by a wide margin: GLM-4.7 costs 5.67× more per pass for the same score.

Mistral Large 3 costs 3.18× more per pass for one fewer correct run.


Predictions

[Observed — brief predictions section]

PredictionClaimResult
P18B scores ≤ 23/30 (≤ 14B)WRONG — scored 28/30
P2task_09 0/3WRONG — 1/3
P38B drops 2-7 pts from 14BWRONG — 8B scored 5 pts above 14B

0/3 correct. The prediction was wrong on every axis, all in the same direction: the 8B was better than expected across score, task_09, and intra-family comparison.

The headline point estimate was 19/30. Actual: 28/30. A 9-point underprediction is the largest absolute miss in the campaign dataset. The size-regression prior was not just wrong — it was reliably wrong, producing systematic downside errors whenever this model family was evaluated.


Leaderboard

[Observed — cross-campaign data]

ModelScore$/passLab
Claude Sonnet 4.628/30$0.0514Anthropic
GLM-4.728/30$0.0038Zhipu AI
Ministral 3 8B28/30$0.00067Mistral
Mistral Large 327/30$0.00213Mistral
Devstral 227/30$0.0020Mistral
GPT-OSS 20B25/30$0.00048OpenAI
GLM-4.7-Flash25/30$0.00057Zhipu AI
Kimi K2.524/30$0.0044Moonshot AI
Magistral Small 250923/30$0.00270Mistral
Ministral 3 14B23/30$0.00103Mistral
Qwen3 32B23/30Alibaba
Amazon Nova Pro20/30$0.0068Amazon
Llama 3.3 70B14/30$0.0047Meta
Nemotron Super 3 120B12/30$0.0016NVIDIA
Kimi K2 Thinking12/30$0.00793Moonshot AI
Jamba 1.5 Large8/30$0.0044AI21

Ministral 8B joins the top tier. Mistral now has five entries in the dataset: Large 3 (27/30), Devstral 2 (27/30), Ministral 8B (28/30), Magistral Small (23/30), Ministral 14B (23/30). The 8B is the highest-scoring Mistral model in the dataset. The largest Mistral model (675B) scores one point lower.


What we don’t know

[Speculation]

The task_09 pass (1/3) is the most interesting open question. Three-point inference from run 1 is not enough to claim Ministral 8B has a reliable “refuse impossible requests” capability. It may be that the three-tool-call path in run 1 happened to land on a trajectory that triggered the right reasoning chain, and runs 2-3 took a different path. Without transcript inspection for runs 2 and 3, the mechanism is not confirmed.

The 14B vs 8B inversion needs a cleaner rerun. The pytest confound on task_01 makes the comparison noisier than it should be. If the 14B had passed task_01 in a clean environment, the gap between the siblings might be 2 points rather than 5. That is still an inversion, but a smaller one.

Ministral 3 3B is next in the campaign queue (TASK-62). If the 8B outperformed the 14B, the 3B question is whether there is a cliff below 8B or whether Mistral’s training is unusually stable at small sizes. The size-regression assumption has been wrong once — testing whether it is wrong twice is the obvious experiment.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.