The model fixed the bug. The test runner didn't have pytest.

May 23, 2026 · campaign-reports

Campaign: 2026-05-23-ministral-3-14b-agentic-core-v1
Model: Mistral Ministral 3 14B (mistral.ministral-3-14b-instruct, AWS Bedrock, us-east-1, ON_DEMAND)
Architecture: Dense transformer — 14B parameters
Harness: agentic-core-v1 (10 tasks × 3 runs = 30 total)
Campaign date: 2026-05-23

Mistral’s Ministral 3 family is designed for efficient, small-footprint deployment. The 14B model at the top of that family sits 48 times smaller than Mistral Large 3 (675B), which scored 27/30 on the same harness. The pre-run expectation was something modest: a score somewhere in the 17–22 range, a clean confirmation that 14B is viable at all, and a cost floor worth reporting.

Ministral 3 14B scored 23/30 (76.67%) at $0.00103 per passing task. The prediction was wrong by four points. The model passed 8 of 10 task types cleanly, including task_07 (multi-step sequential writes) where a model eight times its size failed every attempt. The only tasks it failed were the same two the rest of the dataset fails: the impossible-computation task that requires refusing to answer, and task_01, which failed for a reason unrelated to the model.

What the harness asks

[Observed — harness spec]

Ten tasks, three independent runs each. agentic-core-v1 covers software engineering work: fix a failing test, refactor duplicated code, investigate a log, trace execution through a codebase, make a targeted minimal fix, handle an ambiguous requirement, execute a multi-step sequential plan, recover from an injected tool error, recognize an impossible computation and refuse, run a SQL investigation.

Two tasks matter for context. task_07 asks the model to create four files in sequence (step1.txt through step4.txt), using fs_write only, in the prescribed order. No reading existing code, no error recovery. Pure sequential execution. task_09 presents a 7-row dataset and asks for a 10-day moving average; the correct answer is to recognize that seven data points cannot support a ten-day window and refuse. No model below Claude Sonnet 4.6 has passed it.

A pass is correct task completion. Failure modes are classified: wrong_answer, gave_up_mid_plan, tool_call_hallucinated, tool_call_malformed.

What happened

[Observed — data pack per_task_results, verified: pass_rate_by_task.sql]

Task	Score	Avg latency	Notes
task_01 fix_failing_test	0/3	4.1s	Env confound — see below
task_02 refactor_duplicated_code	3/3	5.5s	Clean
task_03 investigate_log	3/3	1.9s
task_04 trace_through_codebase	3/3	4.2s
task_05 minimal_fix	3/3	3.9s
task_06 handle_ambiguous_requirement	3/3	3.2s	Above prediction
task_07 multi_step_plan	3/3	1.4s	4 tool calls per run, no variation
task_08 recover_from_tool_error	2/3	1.2s	1 wrong_answer
task_09 know_when_to_stop	0/3	1.5s	Expected
task_10 sql_investigation	3/3	1.6s

Total: 23/30. $0.0238 campaign cost. $0.00103/pass.

Failure mode histogram (verified: failure_mode_histogram.sql): wrong_answer 6, tool_call_hallucinated 1, gave_up_mid_plan 0, tool_call_malformed 0, infrastructure_error 0.

task_01: the environment was wrong, not the model

[Observed — transcript refs: 3c74cd7e, 1b88788e, 2a33baf3]

task_01 scored 0/3. In the evidence index this shows as tool_call_hallucinated (run 1) and wrong_answer (runs 2 and 3). Neither classification reflects a model capability failure.

Here is what happened in run 1 (transcript 3c74cd7e):

Turn 1: model calls fs_read(tests/test_add.py) and fs_read(src/add.py) simultaneously
Turn 3: model receives the source file — return a - b # BUG: should be +
Turn 4: “The bug is in the add.py file. The function add is incorrectly subtracting b from a instead of adding them (i.e., return a - b instead of return a + b). ### Fix Replace return a - b with return a + b.”
Turn 5: model calls fs_write, gets OK: wrote 68 bytes to src/add.py
Turn 6: model calls shell(bash run_tests.sh)
Turn 7: result — /usr/bin/python3: No module named pytest
Turn 8: model attempts to inspect run_tests.sh to find an alternative, emits a tool call with an empty tool name (the hallucination)
Turn 9: ERROR: tool '' not in inventory

The model correctly identified the bug, correctly wrote the fix, and then hit a test runner that could not execute. Runs 2 and 3 followed the same diagnostic and fix path before failing on the same No module named pytest wall.

Post-run filesystem inspection confirms the edit was applied correctly in all three runs. src/add.py reads return a + b after each run. The checker rejected the result because bash run_tests.sh returned non-zero.

This is a harness environment hygiene failure. The python3 resolved by the shell tool did not have pytest installed. This is the first time this confound has appeared in the campaign dataset; prior campaigns ran with system-level pytest available. The latent risk was already there; this run surfaced it.

If task_01 is credited for code correctness rather than checker pass, the adjusted score is 26/30. That would place Ministral 14B between the 25/30 cluster (GPT-OSS 20B, GLM-4.7 Flash, DeepSeek V4-Flash) and the 27/30 tier (Devstral 2, Mistral Large 3). A 14B model sitting in that band is a data point worth noting even if the verified score stays 23.

task_07: 14B dense passes where 120B MoE fails

[Observed — data pack task_07_results, verified: tool_calls_by_task.sql; cross-campaign comparison]

task_07 scored 3/3. Four fs_write calls per run, average 1.44 seconds, total campaign cost $0.00061 across all three runs. No variation between runs. The model read the instructions, dispatched the writes in sequence, finished.

The comparison that matters here: Nemotron Super 3 120B, with 12B active parameters (MoE architecture), failed task_07 0/3 in its campaign. It is one of two models in the dataset to completely fail this task. Kimi K2 Thinking also scored 0/3 on task_07 in its campaign. Ministral 14B (dense) passes it without any apparent difficulty.

This is not a subtle difference. task_07 is explicitly specified: create four files in order, use only fs_write. The instruction tells you exactly what to do. Completing it requires following directions, not reasoning. Nemotron Super 3 120B failed because its post-training did not reliably produce explicit sequential dispatch in this context. Ministral 14B passed because its post-training did.

[Speculation]

The usual framing of “larger models perform better on agentic tasks” holds when tasks require inference under uncertainty. It does not hold when tasks require clean instruction-following. On explicit execution tasks (tasks where the full plan is in the prompt and the model just needs to execute it), post-training depth matters more than activation parameter budget. The Nemotron vs. Ministral comparison is a clean illustration.

Whether this generalizes beyond task_07 is unclear. We have one data point. But the data point is stark enough to flag.

task_06: the surprising pass

[Observed — data pack task_06_results]

The pre-run prediction had task_06 (handle ambiguous requirement) at around 2/3. It passed 3/3.

task_06 presents an ambiguous engineering request and asks the model to produce a clarification response: recognize the ambiguity and ask the right question rather than charge ahead with an assumption. At 14B, the expectation was that the model would sometimes miss the ambiguity and produce a direct implementation attempt instead.

It did not. Three clean passes, average latency 3.2 seconds.

[Unobserved]

We did not see any runs where the model correctly identified the ambiguity but then produced a follow-up clarification that was malformed or insufficiently specific. All three passes scored by the checker as complete. Whether the quality of the clarification varied across runs is not captured by the binary pass/fail.

The failure profile

[Observed — data pack failure_mode_histogram, verified: failure_mode_histogram.sql]

Six wrong_answer failures. One tool_call_hallucinated (the empty-name tool call from task_01 run 1 described above, a secondary effect of the pytest environment failure). Zero gave_up_mid_plan. Zero infrastructure errors.

The clean failure profile is more informative than the failures themselves. The model never stalled mid-task, never hit a context overflow, never refused a request it should have handled. Failures are concentrated in two specific structural locations: the task_01 env confound (all three) and task_09 (all three). The task_08 wrong_answer (1 run) is the only failure not explained by a known structural cause.

For a 14B model on agentic workloads, this is a consistent profile. It fails where expected and does not scatter random failures across the run.

Cost

[Observed — data pack summary, verified: cost_breakdown.sql]

$0.0238 total. $0.00103 per passing task.

Ministral 14B uses symmetric $0.20/$0.20/1M pricing, which is unusual in this dataset. Most models have higher output token rates. On tasks with significant output (task_02 refactoring ran 2,121 output tokens on average), the symmetric pricing provides some savings.

The nearest comparators:

Model	Score	$/pass	Params
GPT-OSS 20B	25/30	$0.00048	20B
GLM-4.7-Flash	25/30	$0.00057	—
Ministral 14B	23/30	$0.00103	14B
GLM-4.7	28/30	$0.0038	—
Kimi K2.5	24/30	$0.0044	—

GPT-OSS 20B is cheaper and scores 2 points higher. The cost gap is real. Whether it matters depends on deployment context: symmetric pricing, Bedrock availability, and Mistral’s enterprise tooling are all factors that can shift the comparison.

Predictions

[Observed — brief predictions section]

Prediction	Claim	Result
P1	Score ≤ 22/30	WRONG — scored 23/30
P2	task_09 0/3	CORRECT — 0/3
P3	Score ≥ 15/30, $/pass ≤ $0.0040	CORRECT — 23/30, $0.00103/pass

2/3 correct. P1 missed by the narrowest possible margin on the verified score (one point). On the adjusted score (task_01 env confound credited), the miss grows to four points. The model simply performed better than expected on task_06 and task_07, tasks where 14B-sized models had shown degradation in prior campaigns.

P2 was the safe call. task_09 has been 0/3 for every model in the dataset except Claude Sonnet 4.6 (1/3). It held.

Leaderboard

[Observed — cross-campaign data]

Model	Score	$/pass	Lab
Claude Sonnet 4.6	28/30	$0.0514	Anthropic
GLM-4.7	28/30	$0.0038	Zhipu AI
Mistral Large 3	27/30	$0.0021	Mistral
Devstral 2	27/30	$0.0020	Mistral
GPT-OSS 20B	25/30	$0.00048	OpenAI
GLM-4.7-Flash	25/30	$0.00057	Zhipu AI
Kimi K2.5	24/30	$0.0044	Moonshot AI
Ministral 3 14B	23/30	$0.00103	Mistral
GPT-OSS-120B	23/30	$0.0013	OpenAI
Amazon Nova Pro	20/30	$0.0068	Amazon
Llama 3.3 70B	14/30	$0.0047	Meta
Nemotron Super 3 120B	12/30	$0.0016	NVIDIA
Jamba 1.5 Large	8/30	$0.0044	AI21

Ministral 14B sits at 23/30, the same score as GPT-OSS 120B (8 times larger). At $0.00103/pass, it costs more per pass than GPT-OSS 20B but less than anything above it on the leaderboard. On the adjusted score (26/30), it would sit between the 25/30 cluster and the 27/30 tier, a repositioning driven entirely by a missing pytest install.

What we don’t know

[Speculation]

task_01 will be re-run with pytest available (Ministral 8B campaign, next up). That will resolve whether the 3-point env gap is stable or whether Ministral 14B also has trouble with the test runner logic independent of the pytest issue.

task_08’s single wrong_answer failure is not explained in the transcripts by any obvious structural cause. Whether this is noise (1-in-3 run variance) or a specific failure mode requires more runs than we have.

The task_06 3/3 pass at 14B also needs more context. Is Mistral’s ambiguity-handling post-training unusually strong, or did the harness present a particularly legible form of ambiguity in this campaign? We’d need a different task_06 prompt to separate those hypotheses.

The model fixed the bug. The test runner didn't have pytest.

What the harness asks

What happened

task_01: the environment was wrong, not the model

task_07: 14B dense passes where 120B MoE fails

task_06: the surprising pass

The failure profile

Cost

Predictions

Leaderboard

What we don’t know

ClawWorks Weekly