The model fixed the bug. The test runner didn't have pytest.

Campaign: 2026-05-23-ministral-3-14b-agentic-core-v1
Model: Mistral Ministral 3 14B (mistral.ministral-3-14b-instruct, AWS Bedrock, us-east-1, ON_DEMAND)
Architecture: Dense transformer — 14B parameters
Harness: agentic-core-v1 (10 tasks × 3 runs = 30 total)
Campaign date: 2026-05-23


Mistral’s Ministral 3 family is designed for efficient, small-footprint deployment. The 14B model at the top of that family sits 48 times smaller than Mistral Large 3 (675B), which scored 27/30 on the same harness. The pre-run expectation was something modest: a score somewhere in the 17–22 range, a clean confirmation that 14B is viable at all, and a cost floor worth reporting.

Ministral 3 14B scored 23/30 (76.67%) at $0.00103 per passing task. The prediction was wrong by four points. The model passed 8 of 10 task types cleanly, including task_07 (multi-step sequential writes) where a model eight times its size failed every attempt. The only tasks it failed were the same two the rest of the dataset fails: the impossible-computation task that requires refusing to answer, and task_01, which failed for a reason unrelated to the model.


What the harness asks

[Observed — harness spec]

Ten tasks, three independent runs each. agentic-core-v1 covers software engineering work: fix a failing test, refactor duplicated code, investigate a log, trace execution through a codebase, make a targeted minimal fix, handle an ambiguous requirement, execute a multi-step sequential plan, recover from an injected tool error, recognize an impossible computation and refuse, run a SQL investigation.

Two tasks matter for context. task_07 asks the model to create four files in sequence (step1.txt through step4.txt), using fs_write only, in the prescribed order. No reading existing code, no error recovery. Pure sequential execution. task_09 presents a 7-row dataset and asks for a 10-day moving average; the correct answer is to recognize that seven data points cannot support a ten-day window and refuse. No model below Claude Sonnet 4.6 has passed it.

A pass is correct task completion. Failure modes are classified: wrong_answer, gave_up_mid_plan, tool_call_hallucinated, tool_call_malformed.


What happened

[Observed — data pack per_task_results, verified: pass_rate_by_task.sql]

TaskScoreAvg latencyNotes
task_01 fix_failing_test0/34.1sEnv confound — see below
task_02 refactor_duplicated_code3/35.5sClean
task_03 investigate_log3/31.9s
task_04 trace_through_codebase3/34.2s
task_05 minimal_fix3/33.9s
task_06 handle_ambiguous_requirement3/33.2sAbove prediction
task_07 multi_step_plan3/31.4s4 tool calls per run, no variation
task_08 recover_from_tool_error2/31.2s1 wrong_answer
task_09 know_when_to_stop0/31.5sExpected
task_10 sql_investigation3/31.6s

Total: 23/30. $0.0238 campaign cost. $0.00103/pass.

Failure mode histogram (verified: failure_mode_histogram.sql): wrong_answer 6, tool_call_hallucinated 1, gave_up_mid_plan 0, tool_call_malformed 0, infrastructure_error 0.


task_01: the environment was wrong, not the model

[Observed — transcript refs: 3c74cd7e, 1b88788e, 2a33baf3]

task_01 scored 0/3. In the evidence index this shows as tool_call_hallucinated (run 1) and wrong_answer (runs 2 and 3). Neither classification reflects a model capability failure.

Here is what happened in run 1 (transcript 3c74cd7e):

The model correctly identified the bug, correctly wrote the fix, and then hit a test runner that could not execute. Runs 2 and 3 followed the same diagnostic and fix path before failing on the same No module named pytest wall.

Post-run filesystem inspection confirms the edit was applied correctly in all three runs. src/add.py reads return a + b after each run. The checker rejected the result because bash run_tests.sh returned non-zero.

This is a harness environment hygiene failure. The python3 resolved by the shell tool did not have pytest installed. This is the first time this confound has appeared in the campaign dataset; prior campaigns ran with system-level pytest available. The latent risk was already there; this run surfaced it.

If task_01 is credited for code correctness rather than checker pass, the adjusted score is 26/30. That would place Ministral 14B between the 25/30 cluster (GPT-OSS 20B, GLM-4.7 Flash, DeepSeek V4-Flash) and the 27/30 tier (Devstral 2, Mistral Large 3). A 14B model sitting in that band is a data point worth noting even if the verified score stays 23.


task_07: 14B dense passes where 120B MoE fails

[Observed — data pack task_07_results, verified: tool_calls_by_task.sql; cross-campaign comparison]

task_07 scored 3/3. Four fs_write calls per run, average 1.44 seconds, total campaign cost $0.00061 across all three runs. No variation between runs. The model read the instructions, dispatched the writes in sequence, finished.

The comparison that matters here: Nemotron Super 3 120B, with 12B active parameters (MoE architecture), failed task_07 0/3 in its campaign. It is one of two models in the dataset to completely fail this task. Kimi K2 Thinking also scored 0/3 on task_07 in its campaign. Ministral 14B (dense) passes it without any apparent difficulty.

This is not a subtle difference. task_07 is explicitly specified: create four files in order, use only fs_write. The instruction tells you exactly what to do. Completing it requires following directions, not reasoning. Nemotron Super 3 120B failed because its post-training did not reliably produce explicit sequential dispatch in this context. Ministral 14B passed because its post-training did.

[Speculation]

The usual framing of “larger models perform better on agentic tasks” holds when tasks require inference under uncertainty. It does not hold when tasks require clean instruction-following. On explicit execution tasks (tasks where the full plan is in the prompt and the model just needs to execute it), post-training depth matters more than activation parameter budget. The Nemotron vs. Ministral comparison is a clean illustration.

Whether this generalizes beyond task_07 is unclear. We have one data point. But the data point is stark enough to flag.


task_06: the surprising pass

[Observed — data pack task_06_results]

The pre-run prediction had task_06 (handle ambiguous requirement) at around 2/3. It passed 3/3.

task_06 presents an ambiguous engineering request and asks the model to produce a clarification response: recognize the ambiguity and ask the right question rather than charge ahead with an assumption. At 14B, the expectation was that the model would sometimes miss the ambiguity and produce a direct implementation attempt instead.

It did not. Three clean passes, average latency 3.2 seconds.

[Unobserved]

We did not see any runs where the model correctly identified the ambiguity but then produced a follow-up clarification that was malformed or insufficiently specific. All three passes scored by the checker as complete. Whether the quality of the clarification varied across runs is not captured by the binary pass/fail.


The failure profile

[Observed — data pack failure_mode_histogram, verified: failure_mode_histogram.sql]

Six wrong_answer failures. One tool_call_hallucinated (the empty-name tool call from task_01 run 1 described above, a secondary effect of the pytest environment failure). Zero gave_up_mid_plan. Zero infrastructure errors.

The clean failure profile is more informative than the failures themselves. The model never stalled mid-task, never hit a context overflow, never refused a request it should have handled. Failures are concentrated in two specific structural locations: the task_01 env confound (all three) and task_09 (all three). The task_08 wrong_answer (1 run) is the only failure not explained by a known structural cause.

For a 14B model on agentic workloads, this is a consistent profile. It fails where expected and does not scatter random failures across the run.


Cost

[Observed — data pack summary, verified: cost_breakdown.sql]

$0.0238 total. $0.00103 per passing task.

Ministral 14B uses symmetric $0.20/$0.20/1M pricing, which is unusual in this dataset. Most models have higher output token rates. On tasks with significant output (task_02 refactoring ran 2,121 output tokens on average), the symmetric pricing provides some savings.

The nearest comparators:

ModelScore$/passParams
GPT-OSS 20B25/30$0.0004820B
GLM-4.7-Flash25/30$0.00057
Ministral 14B23/30$0.0010314B
GLM-4.728/30$0.0038
Kimi K2.524/30$0.0044

GPT-OSS 20B is cheaper and scores 2 points higher. The cost gap is real. Whether it matters depends on deployment context: symmetric pricing, Bedrock availability, and Mistral’s enterprise tooling are all factors that can shift the comparison.


Predictions

[Observed — brief predictions section]

PredictionClaimResult
P1Score ≤ 22/30WRONG — scored 23/30
P2task_09 0/3CORRECT — 0/3
P3Score ≥ 15/30, $/pass ≤ $0.0040CORRECT — 23/30, $0.00103/pass

2/3 correct. P1 missed by the narrowest possible margin on the verified score (one point). On the adjusted score (task_01 env confound credited), the miss grows to four points. The model simply performed better than expected on task_06 and task_07, tasks where 14B-sized models had shown degradation in prior campaigns.

P2 was the safe call. task_09 has been 0/3 for every model in the dataset except Claude Sonnet 4.6 (1/3). It held.


Leaderboard

[Observed — cross-campaign data]

ModelScore$/passLab
Claude Sonnet 4.628/30$0.0514Anthropic
GLM-4.728/30$0.0038Zhipu AI
Mistral Large 327/30$0.0021Mistral
Devstral 227/30$0.0020Mistral
GPT-OSS 20B25/30$0.00048OpenAI
GLM-4.7-Flash25/30$0.00057Zhipu AI
Kimi K2.524/30$0.0044Moonshot AI
Ministral 3 14B23/30$0.00103Mistral
GPT-OSS-120B23/30$0.0013OpenAI
Amazon Nova Pro20/30$0.0068Amazon
Llama 3.3 70B14/30$0.0047Meta
Nemotron Super 3 120B12/30$0.0016NVIDIA
Jamba 1.5 Large8/30$0.0044AI21

Ministral 14B sits at 23/30, the same score as GPT-OSS 120B (8 times larger). At $0.00103/pass, it costs more per pass than GPT-OSS 20B but less than anything above it on the leaderboard. On the adjusted score (26/30), it would sit between the 25/30 cluster and the 27/30 tier, a repositioning driven entirely by a missing pytest install.


What we don’t know

[Speculation]

task_01 will be re-run with pytest available (Ministral 8B campaign, next up). That will resolve whether the 3-point env gap is stable or whether Ministral 14B also has trouble with the test runner logic independent of the pytest issue.

task_08’s single wrong_answer failure is not explained in the transcripts by any obvious structural cause. Whether this is noise (1-in-3 run variance) or a specific failure mode requires more runs than we have.

The task_06 3/3 pass at 14B also needs more context. Is Mistral’s ambiguity-handling post-training unusually strong, or did the harness present a particularly legible form of ambiguity in this campaign? We’d need a different task_06 prompt to separate those hypotheses.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.