The older model is better (this time)
Campaign: 2026-05-23-minimax-m2.1-agentic-core-v1
Model: MiniMax M2.1 (minimax.minimax-m2.1, AWS Bedrock us-east-1, ON_DEMAND)
Architecture: Dense transformer
Harness: agentic-core-v1 (10 tasks x 3 runs = 30 total)
Campaign date: 2026-05-23
MiniMax M2.5 runs in this dataset. Scored 27/30. Good result, top tier, first Chinese-lab model to reach it. The expectation going into M2.1 was a small regression — version numbers usually mean something, and M2.5 has the higher suffix. M2.1 is the older variant.
M2.1 scored 28/30.
The story is not “M2.1 is better.” It’s “M2.1 and M2.5 are different, and the difference happened to matter for exactly one task.” That task — task_09, the impossible-computation detector — produced the 3-point swing that explains the entire result.
What the harness asks
[Observed — harness spec]
Ten tasks, three independent runs each. agentic-core-v1 runs software engineering scenarios: fix a failing test, refactor duplicated code, investigate a log, trace execution through a codebase, make a targeted minimal fix, handle an ambiguous requirement, execute a multi-step sequential plan, recover from an injected tool error, recognize an impossible computation and refuse, run a SQL investigation.
The task that defines this article is task_09 (know when to stop). It presents a three-row dataset and asks for a ten-day moving average. The correct response is to recognize that three data points cannot support a ten-day window. The checker accepts an answer file that computes partial averages but includes an explicit acknowledgment of the data limitation. An answer file without that acknowledgment fails.
Pass means the task checker accepted the output. Failure modes are classified: wrong_answer (the checker rejected the output), gave_up_mid_plan (the model hit the turn limit without finishing), tool_call_hallucinated (the model called a tool that doesn’t exist), tool_call_malformed (the model called a real tool with invalid arguments).
What happened
[Observed — data pack task_results, verified: pass_rate_by_task.csv]
| Task | Score | Avg latency | Cost | vs M2.5 |
|---|---|---|---|---|
| task_01 fix_failing_test | 3/3 | 7.89s | $0.0051 | = |
| task_02 refactor_duplicated_code | 3/3 | 20.45s | $0.011 | = |
| task_03 investigate_log | 3/3 | 8.56s | $0.0168 | = |
| task_04 trace_through_codebase | 3/3 | 9.06s | $0.0052 | = |
| task_05 minimal_fix | 3/3 | 10.44s | $0.0067 | = |
| task_06 handle_ambiguous_requirement | 2/3 | 13.89s | $0.0088 | -1 |
| task_07 multi_step_plan | 3/3 | 5.4s | $0.0031 | = |
| task_08 recover_from_tool_error | 2/3 | 4.32s | $0.0026 | -1 |
| task_09 know_when_to_stop | 3/3 | 14.06s | $0.0077 | +3 |
| task_10 sql_investigation | 3/3 | 8.23s | $0.0051 | = |
Total: 28/30. $0.0721 campaign cost. $0.002575/pass.
Failure mode histogram (verified: failure_mode_histogram.csv): wrong_answer 2, everything else 0. No infrastructure errors, no format failures, no context overflow. The run was clean. The two failures are precision errors — not structural problems.
Why does M2.1 beat M2.5?
[Observed — task_09 answer files, verbatim]
The version inversion is real but narrow. M2.1 gains 3 on task_09, loses 1 each on task_06 and task_08. Net: +1.
task_09 is where everything turns. Both models receive the same three-row dataset and the same request for a ten-day moving average. M2.5’s answer file contains:
date,revenue,10_day_ma
2026-01-01,100,100.0
2026-01-02,150,125.0
2026-01-03,200,150.0
No note. No qualification. Partial averages asserted as if they are ten-day averages. Checker rejects it. M2.5 does this all three times.
M2.1’s answer file for run 1 contains exactly the same numeric output, plus one line at the end:
Note: Only 3 days of data available, moving average computed with available data.
Checker accepts it. M2.1 writes this note on all three runs.
That note is the entire version inversion. One model hedges, the other does not.
The signal was visible before the campaign started. On a single-tool smoke call, M2.1 generated 332 output tokens. M2.5 generated 47 on the same prompt — a 7x verbosity difference. At the time this was logged as noise. It was not noise. M2.1’s verbosity is a caution-first inference style: it tends to append caveats and qualifications where M2.5 commits to a direct output. task_09 is the one task in the harness that rewards exactly this behaviour.
[Speculation]
MiniMax is probably not training model variants to “be cautious” or “be precise” as explicit objectives. The more likely explanation is that M2.1 and M2.5 diverged on some axis during post-training — RLHF feedback, instruction-tuning data mix, or output-length reward shaping — and that divergence happened to land differently on task_09. Whether this is a generalizable output-style difference or a task_09-specific artifact is unclear from 30 runs alone.
What the regressions are
[Observed — data pack task_results]
M2.1 drops one pass each on task_06 and task_08. Both failures are precision errors under a different kind of pressure.
task_06 asks the model to implement a count_users function and document its assumptions in a note file. M2.5 passes 3/3 with clean implementations. M2.1’s one failing run produced a note that catalogued structural assumptions — JSON array format, context manager usage, error handling caveats — in more detail than the checker’s acceptance criteria allows. Over-specification, not under-specification. The same verbosity pattern that writes the task_09 caveat also wrote a task_06 note that was too thorough.
task_08 asks the model to read data.txt and write the character count to length.txt. M2.1’s one failing run wrote 35 for a string that contains 34 characters. A one-count off-by-one, probably from including the trailing newline in the count. M2.5 gets 3/3 with consistent counts.
In both cases the failure is not a structural collapse — no gave_up_mid_plan, no tool call errors, no format problems. The model engaged with the task correctly. It just got one thing wrong at the final output step. That is a different failure profile from, say, Kimi K2 Thinking (0/3 on task_07 because the thinking trace never became tool calls) or Jamba 1.5 Large (task_04 0/3 because the model described the fix without writing code).
task_09: the third family to pass
[Observed — cross-campaign data]
M2.1 is the third distinct model family to pass task_09 in the dataset. Ministral 3 8B (3/3 in its campaign) was the second. Every other MiniMax model failed it. Both M2.5 runs in the dataset scored 0/3. The family failure assumption — which drove prediction P2 — did not hold.
[Unobserved]
We don’t know whether every MiniMax model below M2 would also fail task_09, or whether the verbosity-output pattern that produces the caveat is unique to M2.1. The dataset has M2 and M2.1 and M2.5. One more variant (M2 base, or a future M3) would let us check whether caution correlates within the family or is specific to M2.1’s post-training.
task_02: slow and thorough
[Observed — data pack cost_breakdown.csv]
task_02 (refactor duplicated code) took an average of 20.45 seconds per run — the slowest task in the campaign and notably slower than M2.5’s 10.53s for the same task. Both models passed 3/3. The extra time corresponds to an average of 6 tool calls per run and substantially higher output tokens (5,961 vs M2.5’s 2,251). M2.1 explored the codebase more before committing the refactor.
Passes do not require speed. On a compute cost and time basis, task_02 is M2.1’s least efficient: $0.0037 per call. Whether that extra exploration produces higher-quality output than M2.5’s faster refactors is not captured by the binary pass/fail.
Is caution actually better?
[Speculation]
The version inversion is real, but treating it as evidence that M2.1 is the “better” model probably reads too much into one result. The two models appear to sit on a tradeoff surface: M2.5 executes with lower verbosity and higher precision on tasks that require a clean committed output (task_06, task_08). M2.1 hedges more consistently, which rewards tasks that require acknowledging limitations (task_09).
Which variant you want depends on the workload. If you are deploying an agent that will frequently need to recognize impossibility or ambiguity and surface that to the user, M2.1’s caution-first style is worth the occasional over-specification on other tasks. If you need tight committed outputs with minimal hedging, M2.5’s style is cleaner.
MiniMax version numbers, at least for this family, do not appear to denote a strict quality upgrade chain. They appear to denote variants.
Cost
[Observed — data pack summary, verified: cost_breakdown.csv]
$0.0721 total. $0.002575/pass. Pricing: $0.30/$1.20/1M input/output.
M2.1 costs 8.6% more per pass than M2.5 ($0.00237). In absolute terms, the difference is fractional — $0.0205 across a full 30-run campaign. The relevant comparators at the top tier:
| Model | Score | $/pass |
|---|---|---|
| MiniMax M2.1 | 28/30 | $0.002575 |
| MiniMax M2.5 | 27/30 | $0.00237 |
| Mistral Large 3 | 27/30 | $0.0021 |
| Devstral 2 | 27/30 | $0.0020 |
| GLM-4.7 | 28/30 | $0.0038 |
| Claude Sonnet 4.6 | 28/30 | $0.0514 |
| Ministral 3 8B | 28/30 | $0.00067 |
At the 28/30 tier, Ministral 3 8B is 3.8x cheaper per pass. The gap is large enough to matter if you are running campaigns at volume. The MiniMax models sit between the Mistral 27/30 cluster and GLM-4.7 on cost. Neither dramatically cheap nor dramatically expensive.
Predictions
[Observed — predictions file, verified: data_pack.json predictions section]
| Prediction | Claim | Result |
|---|---|---|
| P1 | Score ≤ 26/30 | WRONG — actual 28/30 |
| P2 | task_09 0/3 | WRONG — actual 3/3 |
| P3 | cost/pass > $0.00237 | CORRECT — $0.002575 |
1/3 correct. Point estimate was 22/30. Actual was 28/30. Delta of +6.
P1 and P2 both relied on the version-ladder assumption: M2.1 < M2.5, therefore M2.5 results bound M2.1’s ceiling. That assumption was wrong. P2 in particular relied on the family failure pattern for task_09 — the observation that every prior MiniMax model failed it. M2.1 broke that pattern.
The verbosity signal (332 tokens on the smoke call vs M2.5’s 47) was logged before the campaign. At the time it was read as noise that might cause latency variance. The correct read was “output style differs, and that will matter on task_09.” The prediction model didn’t make that connection.
Leaderboard
[Observed — cross-campaign data]
| Score | Models |
|---|---|
| 28/30 | Claude Sonnet 4.6, GLM-4.7, Ministral 3 8B, MiniMax M2.1 |
| 27/30 | MiniMax M2.5, Mistral Large 3, Devstral 2, GLM-5 |
| 25/30 | GPT-OSS 20B, GLM-4.7-Flash |
| 24/30 | Kimi K2.5 |
| 23/30 | Ministral 3 14B, Magistral Small 2509 |
| 22/30 | Ministral 3 3B |
| 12/30 | Kimi K2 Thinking |
MiniMax now has two entries spanning a 1-point range. Both at the top of the dataset. Within-family, the version with the lower suffix scores higher. The usual assumption about version numbering does not hold here.
What we don’t know yet
[Speculation]
The task_09 3/3 is the headline result. Whether M2.1’s caveat-writing habit generalizes to other tasks that require limitation acknowledgment — or whether this is a task_09-specific fit — requires a different task to test. There is no equivalent “recognize the impossibility” task in the current harness.
The task_06 and task_08 precision failures are single runs. One wrong_answer in 3 runs is the minimum possible failure rate for a non-zero failure, and both could be noise. A re-run of those specific tasks would narrow the uncertainty but is not currently in the campaign queue.
The smoke-test verbosity difference (332 vs 47 output tokens) is a signal worth tracking for future MiniMax variants. If a M3 family model produces similar verbosity inflation, that predicts a better task_09 result before the campaign runs.