27/30 from a Beijing lab

May 20, 2026 · campaign-reports

Campaign: 2026-05-19-minimax-m2-5-agentic-core-v1
Model: MiniMax M2.5 (minimax.minimax-m2.5, AWS Bedrock us-east-1, ON_DEMAND)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks x 3 runs each)
Campaign date: 2026-05-19 (run at 20:25-20:29 UTC)

The 27/30 mark on agentic-core-v1 had belonged to exactly two models: Mistral Large 3, a dense 675B generalist, and Devstral 2, Mistral’s 123B coding specialist. Nine models total had run before this campaign. Claude Sonnet 4.6 cleared 27/30 at 28/30. Nothing else had reached that tier.

MiniMax is a Beijing-based AI lab. Before this campaign, they had zero data in our dataset. M2.5’s marketing is built around a 1M token context window and what the lab calls “deep reasoning” via an internal chain-of-thought mechanism. These kinds of claims appear on every frontier model launch page. Whether they translate to structured agentic performance is the question the harness answers.

Pre-run prediction was 24/30. Actual was 27/30.

The +3 overshoot traces to four tasks predicted at 2/3 each that all came in at 3/3. All four are tasks where pre-action deliberation appears to matter. The prediction model didn’t account for this, which turned out to be the interesting part.

One methodological note: the campaign was corrected mid-run. The original spec contained placeholder task IDs for tasks 05-10 (PR #52 fixed these to the actual scaffold names). All 30 runs here are from the corrected campaign only (2026-05-19T20:25-20:29Z, 266 seconds wall clock).

What the harness actually tests

[Observed: harness spec]

agentic-core-v1 has 10 tasks, each run 3 times, for 30 total. Every task has a deterministic checker. Output either clears acceptance criteria or it doesn’t. No partial credit.

The 10 task types: fix a failing test, refactor duplicated code, investigate a large log file, trace through a codebase, apply a minimal fix under a strict line-count constraint, handle an intentionally ambiguous requirement, execute a sequential multi-step plan using only file write calls, recover from a deliberate tool error, recognise that a computation is structurally impossible and decline to produce an answer, and run an SQL investigation using native database tools.

A run passes when the model’s committed output clears the checker within the 15-turn budget. Failure modes are wrong_answer (checker rejects the output) or gave_up_mid_plan (turn limit reached without a committed answer).

task_09 (know_when_to_stop) is structurally distinct. The model receives a 3-row CSV and is asked to compute a 10-day moving average. Three data points cannot produce a 10-day moving average. The correct response is to recognise this and decline to compute. No model in our dataset has scored 3/3 on task_09.

What MiniMax M2.5 did

[Observed: verification/pass_rate_by_task.csv, verification/failure_mode_histogram.csv]

27 of 30 runs passed. Pass rate: 90%. Nine of ten task types went 3/3. task_09 went 0/3. All three failures were wrong_answer. No infrastructure errors.

Task	Result	Avg tool calls	Avg latency
task_01 fix failing test	3/3	5.0	7.49s
task_02 refactor duplicated code	3/3	4.3	10.53s
task_03 investigate log	3/3	3.0	7.43s
task_04 trace through codebase	3/3	6.7	9.16s
task_05 minimal fix	3/3	4.7	9.20s
task_06 handle ambiguous requirement	3/3	6.3	12.13s
task_07 multi-step plan	3/3	4.0	4.66s
task_08 recover from tool error	3/3	2.0	3.43s
task_09 know when to stop	0/3	4.3	15.47s
task_10 SQL investigation	3/3	3.0	8.05s

(verified: verification/pass_rate_by_task.csv, verification/tool_calls_by_task.csv, verification/latency_distribution.csv)

One evidence flag: a single tool_call_redundancy on task_01 run2, a duplicate fs_write call that didn’t affect the pass outcome (verified: evidence_bundles). One non-impacting instance across 30 runs is well within normal range.

Why did the prediction miss by three?

[Observed: predictions/minimax-m2-5-agentic-core-v1.md, data_pack.json]

MiniMax M2.5 carries an internal reasoningContent block that fires before every tool call, surfaced as a chain-of-thought trace in the API response, separate from the text output. The pre-run prediction treated this as a marketing claim and didn’t weight it. That was wrong.

Four tasks predicted at 2/3 came in 3/3:

task_03 (investigate_log): 3.0 avg tool calls, 7.43s/run, clean sweep. The task involves a large log file where the relevant pattern requires reading the full content. task_03 is not universally easy: Kimi K2.5 (24/30 overall) scored 3/3 here, while DeepSeek V3.2 (19/30) scored 0/3 on this same task. The log investigation case is one where context capacity and attention matter, and MiniMax processed it without degradation. The total cost for task_03 was $0.0178 across 3 runs (52,064 input tokens), the most expensive task in the campaign (verified: verification/cost_breakdown.csv).

task_06 (handle_ambiguous_requirement): 6.3 avg tool calls, 12.13s/run. The requirement has an intentional gap: the model must surface an assumption note rather than proceed to implementation. Models that skip straight to code fail here. MiniMax included the assumption note in the correct location on all three runs. The reasoning trace appears to examine the requirement before any code action, though the trace content isn’t surfaced in the data we have.

task_07 (multi_step_plan): 4.0 avg tool calls, 4.66s/run. Sequential 4-step file creation, each file depending on the previous one, using only fs_write calls. NVIDIA Nemotron Super 3 120B (12/30 overall) scored 0/3 on this task, the only model in the dataset to completely fail it. MiniMax went 3/3 cleanly and quickly. The chain-of-thought before each write call appears to maintain step state across the sequence.

task_08 (recover_from_tool_error): 2.0 avg tool calls, 3.43s/run, the fastest multi-step task in the campaign. The harness delivers a deliberate path error. MiniMax read the error message and corrected the path on the next call, no redundant retries.

[Speculation]

The four tasks share a structural property: they all require deliberation before action. task_06 requires catching an ambiguity before touching code. task_07 requires committing to a correct sequence before writing. task_08 requires reading an error and revising the plan. task_03 requires sustained attention on a large input without a shortcut.

A reasoning block that runs before every tool call provides a slot for exactly this kind of pre-flight work. The +3 delta from prediction is consistent with the block doing what it claims. This is behavioural inference from outcomes, not a mechanism proof. We see the results; we don’t have the full trace content to verify the reasoning path.

The task_09 wall

[Observed: verification/pass_rate_by_task.csv, prior campaign articles]

task_09 (know_when_to_stop): 0/3, wrong_answer on every run. Avg tool calls: 4.3. Avg latency: 15.47s per run, the highest of any task in this campaign.

Nine models in this dataset have scored 0/3 on task_09. MiniMax is the ninth. The eight prior: llama3.3-70b-agentic-core-v1-run3-arc-2026, gemma-4-31b-agentic-core-v1-2026, gpt-5.5-instant-agentic-core-v1-2026, mistral-large-3-agentic-core-v1-2026, deepseek-v3-2-agentic-core-v1-bedrock-2026, openai-gpt-oss-120b-agentic-core-v1-2026, qwen3-next-80b-a3b-agentic-core-v1-2026, kimi-k2-5-agentic-core-v1-2026. Nemotron Super 3 120B scored 1/3 on one run and is not in this list; Devstral 2, DeepSeek V4-Flash, and Claude Sonnet 4.6 also scored at least 1/3 once.

The latency comparison with Mistral Large 3 is the clearest data point here. Mistral commits a wrong answer on task_09 in 1.3 seconds (prior campaign article: mistral-large-3-agentic-core-v1-2026): one tool call, one write, done. MiniMax takes 15.47 seconds and 4.3 tool calls per run before arriving at the same wrong_answer. The reasoningContent block is running at length. It does not produce a different outcome.

[Unobserved]

We have not seen any model in this dataset consistently refuse to answer task_09 across all three runs. Four models caught the impossibility once each (Devstral 2, DeepSeek V4-Flash, Claude Sonnet 4.6, Nemotron Super 3 120B); none caught it reliably across a full run series. We have not tested whether explicit prompting about input validity before computation changes this.

[Speculation]

Internal reasoning does not automatically produce stop-signal awareness. MiniMax reasons for 15 seconds on task_09 and returns wrong_answer each time. This is consistent with a model that has strong deliberation capacity for how to execute a task, but no reliable mechanism for detecting when a task is unsolvable from the available data. These appear to be structurally different capabilities. The P2 prediction assumed that deliberation before acting would surface data insufficiency. It didn’t, and that failure is what sharpens the claim: task_09-class problems may require post-training specifically on “I cannot compute this” signals, not just general chain-of-thought capacity.

What we were wrong about: prediction P2

[Observed: predictions/minimax-m2-5-agentic-core-v1.md]

Rigg filed six predictions. Five were correct. One failed:

Prediction	Outcome	Actual
P1: score ≥ 22/30	PASS	27/30
P2: task_09 ≥ 1/3	FAIL	0/3
P3: task_07 ≥ 2/3	PASS	3/3
P4: cost <= $0.10	PASS	$0.064
P5: zero infrastructure_error	PASS	0 errors
P6: outscores DeepSeek V3.2 (19/30)	PASS	27 > 19

P2 failed. F1 triggered. The prediction assumed that a model with an internal reasoning block would be more likely to detect the impossibility in task_09. The data says otherwise. 15.47 seconds of reasoning per run, and the model commits the same wrong answer every time as a model that takes 1.3 seconds. P2 is now evidence against the “reasoning helps with task_09” hypothesis, not just absent evidence.

Cost position

[Observed: verification/cost_breakdown.csv, prior campaign articles]

$0.064 total. $0.0024 per passing run.

Model	Score	Cost/pass	Notes
Claude Sonnet 4.6	28/30	$0.0514	Anthropic
Devstral 2	27/30	$0.0019	Mistral, coding-specialist
Mistral Large 3	27/30	$0.0022	Mistral, dense 675B
MiniMax M2.5	27/30	$0.0024	MiniMax, dense
Kimi K2.5	24/30	$0.0044	Moonshot AI
GPT-OSS 120B	23/30	$0.0013	OpenAI
Qwen3 Next 80B A3B	21/30	$0.0012	Alibaba
DeepSeek V3.2	19/30	$0.0142	DeepSeek
NVIDIA Nemotron Super 3 120B	12/30	$0.0016	NVIDIA

(verified: verification/cost_breakdown.csv for MiniMax; prior campaign articles for other models. Table shows a subset of the full dataset.)

Pricing is $0.30 per million input tokens and $1.20 per million output tokens. The task_03 log investigation consumed 52,064 input tokens across 3 runs, making it the most expensive task in the campaign at $0.0178 (verified: verification/cost_breakdown.csv).

The 27/30 tier now has three members with meaningfully different architectures: a coding specialist at 123B (Devstral 2), a dense generalist at 675B (Mistral Large 3), and a dense model with internal reasoning at an unknown parameter count (MiniMax M2.5). Their cost per passing run sits within a 26% range of each other. At this tier, model selection isn’t a quality decision: quality is essentially equivalent on this suite. It’s an architecture and pricing decision.

What we don’t know yet

[Speculation]

The reasoningContent block is visible in the API response but its content isn’t surfaced in the data we have. We can see that it fired before every tool call; we can’t read what it said. The +3 prediction delta is consistent with the block doing useful pre-action work, but testing MiniMax M2.5 with the reasoning block disabled (if that configuration is possible) would separate the contribution cleanly. We have not done that.

task_09 latency at 15.47s is based on 3 runs. That’s enough to establish that MiniMax reasons considerably longer on task_09 than Mistral Large 3, but not enough to characterise the full latency distribution on that task. Whether MiniMax catches the impossibility occasionally on a longer run series is unknown.

The 1M context window wasn’t meaningfully tested here. task_03 consumed roughly 50K input tokens across 3 runs. The long-context capability this lab markets most heavily is untested at scale on this harness.

27/30 from a Beijing lab

What the harness actually tests

What MiniMax M2.5 did

Why did the prediction miss by three?

The task_09 wall

What we were wrong about: prediction P2

Cost position

What we don’t know yet

ClawWorks Weekly