The base model that proved versioning works

May 24, 2026 · campaign-reports

Campaign: 2026-05-24-minimax-m2-agentic-core-v1
Model: MiniMax M2 (minimax.minimax-m2, AWS Bedrock us-east-1, ON_DEMAND)
Architecture: Dense transformer
Harness: agentic-core-v1 (10 tasks x 3 runs = 30 total)
Campaign date: 2026-05-24

MiniMax M2.5 is in this dataset. Scored 27/30. M2.1 is also here. Scored 28/30, beating M2.5 despite having the lower version suffix. Both results raised a reasonable question: where does M2, the base generation, actually sit?

M2 scored 24/30.

Four points below M2.1. Three below M2.5. The base model is definitively the weakest variant in the family — and it is also the most expensive per correct result. With all three MiniMax variants now in the dataset, the picture is clear: MiniMax’s version numbering tracks real capability improvement, not marketing labels. That is rarer than it sounds.

What the harness asks

[Observed — harness spec]

Ten tasks, three independent runs each, 30 total. agentic-core-v1 covers the practical core of agentic software work: fix a failing test, refactor duplicated code, investigate a log file, trace execution through a codebase, implement a function against an ambiguous spec, apply a minimal targeted fix, execute a multi-step sequential plan, recover from an injected tool error, recognize an impossible computation, and run a SQL investigation.

Pass means the task checker accepted the output. Failure modes are classified: wrong_answer means the checker rejected the output because it did not match the acceptance criteria, and gave_up_mid_plan means the model hit the turn limit without completing the task.

One task deserves a note before the results: task_09 (know_when_to_stop) presents a three-row CSV and asks for a ten-day moving average. The correct response is to recognize that three data points cannot support a ten-day window and say so in the output file. M2.1 passed this 3/3. M2.5 failed it 0/3. M2 sits between them.

What happened

[Observed — pass_rate_by_task.csv, cost_breakdown.csv]

Task	Score	Avg latency	Total cost
task_01 fix_failing_test	3/3	11.04s	$0.0072
task_02 refactor_duplicated_code	2/3	19.89s	$0.0127
task_03 investigate_log	3/3	14.59s	$0.033
task_04 trace_through_codebase	3/3	12.16s	$0.0065
task_05 minimal_fix	3/3	16.38s	$0.0113
task_06 handle_ambiguous_requirement	3/3	16.99s	$0.0105
task_07 multi_step_plan	2/3	6.30s	$0.0036
task_08 recover_from_tool_error	2/3	6.34s	$0.0033
task_09 know_when_to_stop	1/3	13.56s	$0.0076
task_10 sql_investigation	2/3	14.62s	$0.009

Total: 24/30. $0.10 campaign cost. $0.00417/pass.

Five clean sweeps on the structurally bounded tasks — problems with a defined answer and a verifiable output. Four partial failures (2/3 each) on the tasks that require multi-step execution with intermediate state tracking. One partial pass on task_09.

Failure mode histogram (verified: failure_mode_histogram.csv): wrong_answer 4, gave_up_mid_plan 2. No infrastructure errors, no malformed tool calls, no context overflow.

Does the verbosity explain the cost?

[Observed — brief data, cost_breakdown.csv]

M2 generates approximately 363 output tokens per smoke run — the highest in the MiniMax family. M2.5 produces around 47 on the same prompt. M2.1 produces around 332.

That verbosity drives cost. At $0.00417/pass, M2 costs 76% more per correct result than M2.5 ($0.00237) and 62% more than M2.1 ($0.002575). In absolute terms, the differences are small — a fraction of a cent per run. At scale, they add up.

The relationship between verbose output and score is not what you might expect. M2 produces the most tokens and scores the lowest. M2.1 produces nearly as many tokens (332) and scores the highest. M2.5 produces far fewer and sits in between. High output volume is not a reliable predictor of better results — it predicts higher cost.

[Speculation]

M2.1’s verbosity appears to be caution-first: it generates caveats and qualifications before committing to outputs. That habit paid off on task_09. M2’s verbosity seems to manifest differently — more expansive exploration on tasks like task_02 (refactor, 19.89s average) and task_03 (investigate_log, highest cost in the campaign at $0.033) without converting that exploration into additional passes.

Where the failures are

[Observed — failure_mode_histogram.csv, cross_task_consistency.md]

The four wrong_answer failures did not cluster on one task. They appeared across four distinct task IDs: task_02, task_07, task_08, and task_10 (verified: cross_task_consistency.md). One evidence bundler observation: wrong_answer observed across 4 distinct tasks is classified as a model-level pattern — not task-specific brittleness.

The common thread is tasks that require multi-step execution with intermediate state tracking. M2 approaches these tasks correctly and executes most steps without error. The failures occur at the final output step — writing to a specific location, matching an expected format, or getting a count precisely right. It gets the approach right and loses precision on delivery.

The two gave_up_mid_plan failures both appeared on tasks in the 2/3 group, meaning M2 passed the same task on other runs. That is a different profile from a structural collapse — the model can execute these tasks, but not consistently.

[Unobserved]

The evidence pack shows zero diagnosis-then-regression observations — no runs where M2 stated a clear diagnosis, then walked it back (verified: diagnosis_then_regression.md). When M2 commits to an approach, it does not backtrack. Similarly, zero long-tail runs: no campaign exceeded 12 turns. M2 terminates cleanly and does not spiral on hard tasks.

Two of 30 runs showed minor tool redundancy — consecutive identical tool calls — both on passing runs. Not a reliability signal at this count but worth tracking at higher volume.

What task_09 tells us about the family

[Observed — cross-campaign data]

M2’s 1/3 on task_09 sits between M2.1’s 3/3 and M2.5’s 0/3. The task asks the model to recognize that a three-row dataset cannot support a ten-day moving average and flag that limitation in the output file. Passing requires that explicit acknowledgment; a numerically plausible output without it fails.

M2.5 never wrote the note. M2.1 wrote it on all three runs. M2 wrote it once.

The verbosity signal that correlates with M2.1’s task_09 performance is present in M2 as well — 363 output tokens per smoke run suggests some tendency toward additional text. But that tendency is not reliable enough to produce the specific acknowledgment consistently. M2.1’s caution-first pattern seems more systematic; M2’s is more sporadic.

[Speculation]

The progression across the family on task_09 — 0/3 for M2.5, 1/3 for M2, 3/3 for M2.1 — does not follow version order. M2.5 has the highest suffix but the worst task_09 result. The more likely explanation is that post-training for M2.1 reinforced hedging and limitation-acknowledgment behaviour in a way that M2.5’s training did not, and M2 sits somewhere between. Whether that is deliberate or a side effect of other training choices is not visible from the outside.

Is this what a baseline looks like?

[Observed — cross-campaign data]

With all three MiniMax variants in the dataset:

Model	Score	$/pass	task_09	Output tokens (smoke)
MiniMax M2.1	28/30	$0.002575	3/3	~332
MiniMax M2.5	27/30	$0.00237	0/3	~47
MiniMax M2	24/30	$0.00417	1/3	~363

M2 is the family floor. It scores 4 points below M2.1 and 3 below M2.5. It costs more per pass than both. On the tasks where M2.5 and M2.1 succeed consistently, M2 produces partial failure.

The more interesting observation is that this gradient exists at all. In a field where version labels are frequently cosmetic — where a “2.0” release is a rebrand of the same checkpoint with a longer context window — MiniMax M2 to M2.1 is a 4-point lift with a 39% cost reduction per pass. That is a measurable improvement, not a marketing story.

If you are running MiniMax models in an agentic pipeline and still on M2, the data is clear. Switch to M2.1. You get better results at lower cost. M2.5 is the cost-optimised choice if per-pass expense is the primary constraint and you do not need reliable task_09 behaviour.

[Speculation]

M2 may have been a reasonable deployment choice before M2.1 existed. As a historical baseline it serves a useful purpose: it tells you how much the MiniMax team improved their model across two subsequent releases, and it gives any builder still running the original checkpoint a concrete reason to upgrade. But as a current deployment target, it is outclassed by both successors.

Predictions

[Observed — predictions file]

Prediction	Threshold	Result	Verdict
P1: Overall ≥ 24/30	≥ 24	24/30	CORRECT
P2: task_09 ≥ 1/3	≥ 1 run	1/3	CORRECT
P3: $/pass > $0.00237	> M2.5 rate	$0.00417	CORRECT

3/3 correct. Point estimate was 26/30; actual is 24/30. The predictions were directionally right — M2 would be weaker than M2.5, more expensive than M2.5, and would pass task_09 at least once given its verbosity. The underestimated factor was the multi-step precision failures: wrong_answer appearing across four distinct task IDs indicated a broader pattern than the point estimate accounted for.

Cost in context

[Observed — cost_breakdown.csv, cross-campaign data]

$0.10 total campaign cost. $0.00417/pass.

Model	Score	$/pass
Ministral 3 8B	28/30	$0.00067
GLM-4.7	28/30	$0.0010
MiniMax M2.1	28/30	$0.002575
Devstral 2	27/30	$0.0020
Mistral Large 3	27/30	$0.0022
MiniMax M2.5	27/30	$0.00237
GLM-4.7-Flash	25/30	$0.000565
MiniMax M2	24/30	$0.00417
Kimi K2.5	24/30	$0.0044
Kimi K2 Thinking	12/30	$0.0079

M2 sits at the boundary between the 24/30 and 27/30 tiers, and it is the most expensive model at the 24/30 level. Both M2.1 and M2.5 score higher and cost less. Within the MiniMax family, M2 is the strictly dominated option.

Leaderboard

[Observed — cross-campaign data, as of 2026-05-24]

Score	Models
28/30	Claude Sonnet 4.6, GLM-4.7, Ministral 3 8B, MiniMax M2.1
27/30	MiniMax M2.5, Mistral Large 3, Devstral 2, GLM-5
25/30	GPT-OSS 20B, GLM-4.7-Flash
24/30	Kimi K2.5, MiniMax M2
23/30	Ministral 3 14B, Magistral Small 2509
22/30	Ministral 3 3B
12/30	Kimi K2 Thinking

MiniMax now has three entries in the dataset spanning a 4-point range. All three sit in the upper half of the leaderboard. The family floor (24/30) is still competitive against the broader dataset — most models score below it.

What we still do not know

[Speculation]

M2’s multi-step precision failures appeared across four task IDs but totalled 6 failures across 30 runs. That is a real pattern, but 6 data points are not enough to characterize the failure precisely. Whether M2 specifically struggles with output-file formatting, with count operations, or with both is not separable at this sample size.

The task_09 1/3 is the minimum possible partial pass. One run produced the limitation acknowledgment; two did not. A targeted re-run of task_09 alone would distinguish between “M2 is sporadically capable of this” and “M2 wrote the note once by chance.” Neither outcome changes the deployment recommendation, but the distinction matters for understanding where the caution-first behaviour sits in the family.

Finally: M2’s verbosity (~363 output tokens on a smoke call) is the highest in the family. Whether this pattern extends to future MiniMax checkpoints is worth watching. If M3 shows similar or higher verbosity, the task_09 prediction based on it becomes more reliable.

The base model that proved versioning works

What the harness asks

What happened

Does the verbosity explain the cost?

Where the failures are

What task_09 tells us about the family

Is this what a baseline looks like?

Predictions

Cost in context

Leaderboard

What we still do not know

ClawWorks Weekly