The 3B matches the 14B. One task apart.

May 23, 2026 · campaign-reports

Campaign: 2026-05-23-ministral-3-3b-agentic-core-v1
Model: Mistral Ministral 3 3B (mistral.ministral-3-3b-instruct, AWS Bedrock, us-east-1, ON_DEMAND)
Architecture: Dense transformer, 3B parameters
Harness: agentic-core-v1 (10 tasks × 3 runs = 30 total)
Campaign date: 2026-05-23

The research question going in was simple: does 3 billion parameters represent an absolute capability floor for agentic execution, or does the Ministral family’s training hold at small scale?

Rigg’s point estimate was 12/30. He was off by 10.

Ministral 3 3B scored 22/30 (73.33%) at $0.000787 per passing task. That puts it one point behind the 14B sibling. With this result, the Ministral 3 family is now fully mapped across all four sizes. The spread across three of the four models is four points. The 8B sits eight above the others and remains the anomaly the family has no good explanation for.

What the harness asks

[Observed — harness spec] (Tags used throughout: Observed = directly in the data; Unobserved = not confirmed from source; Speculation = inferred.)

Ten tasks, three independent runs each. agentic-core-v1 covers software engineering work: fix a failing test, refactor duplicated code, investigate a log, trace execution through a codebase, make a targeted minimal fix, handle an ambiguous requirement, execute a multi-step sequential plan, recover from an injected tool error, recognize an impossible computation and refuse, run a SQL investigation.

Two tasks anchor the dataset. task_07 asks the model to create four files in sequence (step1.txt through step4.txt), using fs_write only, in prescribed order. No error recovery, no reading. Pure sequential execution. task_09 presents a three-entry dataset and asks for a 10-day moving average; the correct answer is to recognize that three data points cannot support a ten-day window and refuse. Only three models in the full dataset have ever passed a single run of task_09: Claude Sonnet 4.6, Ministral 3 8B, and Kimi K2 Thinking, each at 1/3.

A pass is correct task completion. Failure modes are labeled: wrong_answer (the checker rejected the output), gave_up_mid_plan (model stopped mid-execution without finishing), tool_call_hallucinated (invoked a non-existent tool), tool_call_malformed (sent a syntactically invalid tool call).

What happened

[Observed — data pack tasks, verified: pass_rate_by_task.sql]

Task	Score	Avg latency	Cost	Notes
task_01 fix_failing_test	3/3	2.1s	$0.0008	14B scored 0/3 (env confound, not a 3B win)
task_02 refactor_duplicated_code	2/3	2.9s	$0.0014	1 wrong_answer (run 1)
task_03 investigate_log	3/3	1.8s	$0.0083	48% of total budget (see below)
task_04 trace_through_codebase	2/3	2.5s	$0.0008	1 wrong_answer (run 1)
task_05 minimal_fix	2/3	3.9s	$0.0017	1 tool_call_hallucinated (run 3)
task_06 handle_ambiguous_requirement	1/3	2.6s	$0.0008	Clear weak point: 2 wrong_answer
task_07 multi_step_plan	3/3	1.1s	$0.0003	Clean
task_08 recover_from_tool_error	3/3	1.1s	$0.0003	Clean
task_09 know_when_to_stop	0/3	5.3s	$0.0022	Three distinct failure modes
task_10 sql_investigation	3/3	1.4s	$0.0006	Clean

Total: 22/30. $0.0173 campaign cost. $0.000787/pass.

Failure mode histogram (verified: failure_mode_histogram.sql): wrong_answer 5, tool_call_hallucinated 1, tool_call_malformed 1, gave_up_mid_plan 1. Zero context_overflow, zero infrastructure_error.

Why does the 3B nearly match the 14B?

[Observed — data pack summary]

Ministral 3 3B scored 22/30. Ministral 3 14B scored 23/30. One task apart across 30 runs per model: statistically indistinguishable at these sample sizes.

The family picture is now complete:

Model	Score	$/pass	Params
Ministral 3 8B	28/30 (93%)	$0.00067	8B
Mistral Large 3 675B	27/30 (90%)	$0.00210	675B
Ministral 3 14B	23/30 (77%)	$0.00103	14B
Ministral 3 3B	22/30 (73%)	$0.00079	3B

(verified: cost_breakdown.sql)

The 3B is 4.7× smaller than the 14B by parameter count. It costs 24% less per pass. The quality difference is one task. Conventional scaling intuition says drop 4.7× in parameters and lose proportional capability. It does not apply here.

The 8B continues to be the anomaly. It outscores both the 14B and the 3B by five points, and it costs less per pass than the 14B at a larger parameter count. None of the straightforward training-efficiency narratives cleanly explain it. The 3B/14B cluster together; the 8B sits above both; the 675B flagship slots between them. The family does not form a clean ladder.

[Speculation]

The durable performance at 3B is consistent with a training distribution heavily weighted toward explicit, execution-oriented tasks rather than open-ended generation. At 3B, the model appears to retain the signal that matters for agentic work: read, identify, write, verify. The capacity that degrades is interpretation under uncertainty. That tradeoff is visible in the task breakdown: task_06 and task_09 are where the 3B parameter budget runs out.

task_01: a passing run in a clean environment

[Observed — transcript ref: 71b91b19-879b-47ef-85af-77ce9665dfaa]

The 14B campaign recorded 0/3 on task_01 because the test environment had python3 mapped to a system install without pytest, causing all post-fix verification to fail even though the code edits were correct. The 3B ran in a subsequent session where the environment was clean. The 3B scored 3/3 on task_01.

Run 1 (transcript 71b91b19, turn 5):

“The issue has been identified: In the file src/add.py, the function add incorrectly performs subtraction instead of addition. It should simply return the sum of a and b, changing a - b to a + b.”

The model read both files in turns 1 and 3, diagnosed the bug, wrote the fix in turn 5 with fs_write, ran bash run_tests.sh in turn 7, and received exit=0 with 2 passed tests. Four tool calls, no re-reads, no hedging. Total latency 2.1s.

This is not a case where 3B outperforms 14B on this task. The right reading is that both models are likely 3/3-capable on task_01, and the 14B result is contaminated by a harness environment failure. The apparent 3B > 14B comparison on this task is an artifact.

[Unobserved]

The 3B runs showed no diagnosis-then-regression pattern across all 30 runs (verified: diagnosis_then_regression.md evidence bundle, 0/30). No run showed a model state a diagnosis then walk it back.

task_03: 82,000 tokens and 48% of the budget

[Observed — data pack task breakdown, verified: cost_breakdown.sql]

task_03 (investigate_log) consumed 82,460 input tokens across three runs at a total cost of $0.0083, which is 48% of the entire $0.0173 campaign budget for a single task. Per-run average: approximately 27,487 input tokens.

At $0.10/1M input pricing, $0.0028/run is still cheap in absolute terms. But the token volume reveals a pattern: the model re-reads the entire log file on each turn rather than indexing it. Transcript e6f4b513 (run 3) shows the model issuing fs_read("access.log") twice in succession at turns 1 and 3, then issuing a targeted grep in turn 5. The second full-file read returned the same content as the first.

This is consistent across the Ministral family. The 14B also dominates its campaign budget on task_03. The read-then-re-read approach is not a 3B-specific behavior.

For deployers with log-investigation workflows: the cost profile at $0.10/1M input is dominated by input volume, not output. Symmetric pricing means the output isn’t the expensive part. If the log file is large, task_03-style workloads will concentrate budget there regardless of model size within this family.

task_09: three runs, three failure modes

[Observed — brief failure mode classification, verified: failure_mode_histogram.sql]

task_09 asks the model to compute a 10-day moving average from a dataset with three entries. The correct answer is to recognize that three data points cannot support a ten-day window and refuse. The 3B failed all three runs, but with unusual variety:

Run 1: wrong_answer. Confident numeric output, no acknowledgment of the constraint.
Run 2: tool_call_malformed. Attempted to dispatch a calculation tool with invalid arguments.
Run 3: gave_up_mid_plan. Started the computation, hit an intermediate failure, stopped without returning an answer.

The 14B failed all three with wrong_answer consistently. The 3B shows higher run-to-run instability on this task. It does not reliably commit to a single failure strategy.

The task_09 pass list remains short. Three models have ever produced a single passing run in the dataset: Claude Sonnet 4.6 (1/3), Ministral 3 8B (1/3), and Kimi K2 Thinking (1/3). The 3B is not among them.

[Unobserved]

No long-tail turn patterns appeared in task_09 or anywhere in the campaign. All 30 runs completed within normal turn budgets (verified: long_tail_turn_count.md evidence bundle, 0/30).

task_06: where semantic interpretation fails

[Observed — data pack task breakdown, verified: pass_rate_by_task.sql]

task_06 (handle_ambiguous_requirement) scored 1/3, the 3B’s single clear weak point. Two runs returned wrong_answer: the model chose an interpretation of the underspecified contract and committed to it, but selected the wrong implicit constraint the task checker expected.

The rest of the family handles this task cleanly: 8B (3/3), 14B (3/3), Magistral Small 2509 (3/3). The 3B is the only Ministral model that consistently fails it.

The pattern is consistent with a model that has learned the syntactic form of ambiguity resolution: pick an interpretation, commit, produce output. But it lacks the semantic grounding to pick the right one at 3B scale. Explicit structured execution (task_07, task_08, task_10: all 3/3) is robust. Interpretation under uncertainty (task_06, task_09) is where the 3B parameter count shows.

[Speculation]

This boundary (structured execution holds, semantic interpretation degrades) may be the distinguishing feature of small-scale models that have been fine-tuned specifically for agentic work. The fine-tuning can preserve execution fidelity while the underlying semantic capacity remains parameter-bound. Checking whether Ministral 3 3B’s task_06 performance improves with a system prompt that narrows the interpretation space would test this directly.

Predictions vs reality

[Observed — brief predictions section]

Rigg’s pre-run predictions:

Prediction	Result
Point estimate: 12/30	Actual: 22/30, off by 10 ❌
P1: Score ≤ 20/30	Actual 22, wrong ❌
P2: task_07 ≥ 2/3	Actual 3/3, correct ✅
P3: task_09 = 0/3	Actual 0/3, correct ✅

2 of 3 predictions correct. Point estimate off by 10 points.

The PE failure is the important one. The prior was Gemma 3 4B (11/30) as the reference point for sub-4B performance. That prior was wrong because the Ministral 3 family’s training distribution is materially different from Gemma’s. Gemma 3 4B at 11/30 is not predictive of Ministral 3 3B’s capability. Parameter count is a within-family signal; cross-family comparisons at similar sizes do not hold.

Where does 22/30 land in the dataset?

[Observed — data pack positioning table]

Model	Score	$/pass	Notes
Claude Sonnet 4.6	28/30	$0.00130
GLM-4.7	28/30	$0.00028
Ministral 3 8B	28/30	$0.00067	Family anomaly
Mistral Large 3 675B	27/30	$0.00210
GLM-5	27/30	$0.00058
GPT-OSS 20B	25/30	$0.00048
Kimi K2.5	24/30	$0.00044
Ministral 3 14B	23/30	$0.00103
Magistral Small 2509	23/30	$0.00270
Ministral 3 3B	22/30	$0.00079

(Partial dataset, shown for cost-tier context)

At $0.000787/pass, the 3B sits in the cost-efficient mid-tier of the dataset, more expensive per pass than GLM-4.7, GPT-OSS 20B, and Kimi K2.5 but cheaper than Claude Sonnet 4.6 and Magistral Small 2509. Given that it runs on-demand from Bedrock at $0.10/$0.10/1M with no provisioned throughput requirement, the deployment friction is low.

The Ministral 3 3B is not where the floor is. The floor, for this family, appears to be somewhere below 3B.