The easy ones aren't free

Campaign: 2026-05-19-kimi-k2-5-agentic-core-v1
Model: Kimi K2.5 (moonshotai.kimi-k2.5, AWS Bedrock us-east-1)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-05-19


Moonshot AI built its reputation on long-context document Q&A. Kimi Chat grew to millions of users by doing one thing well: ingest a long document and track it without losing the thread. K2.5 is the agentic follow-on, the model Moonshot is shipping on AWS Bedrock as an API product, positioned against DeepSeek and Qwen for multi-step code and tool-use work.

This campaign is Moonshot AI’s first appearance in the modelbattles dataset. The question going in was whether long-context heritage translates to practical sequential execution, the kind of work agentic-core-v1 actually measures. The answer: mostly yes. The interesting result is the failure that shouldn’t have happened.


What the harness actually tests

[Observed: harness spec]

agentic-core-v1 runs each model on 10 tasks, 3 times each, 30 total runs. The tasks cover the core loop of an agentic software workflow: fix a failing test, refactor duplicated code, investigate a log file, trace through a codebase, handle an ambiguous requirement, execute a minimal fix, plan and complete a multi-step sequential write, recover from a tool error, identify an impossible computation, and run a SQL investigation.

A pass requires completing the task correctly. Failures are classified as wrong_answer (incorrect output), gave_up_mid_plan (model abandoned the task mid-execution), or a tool loop that never resolved. Two tasks are structural traps. Task_09 supplies 3 rows of data and asks for a 10-day moving average; the correct response is to refuse, and every non-reasoning model so far has failed it. Task_07 requires four sequential file writes with verification after each step; it tests whether a model holds a plan without collapsing partway through.


What Kimi K2.5 did

[Observed]

24 of 30 runs passed. Pass rate: 80.0% (verified: pass_rate_by_task.csv). Eight task types were clean. Two weren’t.

TaskResultNotes
task_01 fix failing test3/3
task_02 refactor duplicated code3/3
task_03 investigate log2/3Run 3 hit max_turns (17); runs 1–2 correct
task_04 trace through codebase3/3
task_05 minimal fix3/3
task_06 handle ambiguous requirement1/3Runs 1–2 wrong_answer
task_07 multi-step plan3/3
task_08 recover from tool error3/3
task_09 know when to stop0/3Wrong answers × 2, then correct reasoning rejected on format
task_10 SQL investigation3/3

(verified: pass_rate_by_task.csv)

Total cost: $0.10558 | $0.0044/pass | avg latency: ~8s (verified: cost_breakdown.csv)


Why did task_06 burn two runs?

[Observed]

Task_06 is not one of the hard tasks. Implement count_users() against an ambiguous spec and document your assumptions in note.txt. Every prior model on agentic-core-v1 has scored 2/3 or 3/3 on it. Kimi K2.5 scored 1/3.

The failure is reproducible. Runs 1 and 2 each used exactly 2 tool calls: one fs_write to src/counter.py, one fs_write to note.txt. Both were rejected by the checker. Run 3 used 4 tool calls: the same two writes, plus a shell step to run python3 against the function before completing. Run 3 passed.

The checker validates actual function behavior, not file presence. Runs 1 and 2 wrote files that looked plausible but weren’t verified. Run 3 checked its own work first.

Task_06 averaged 2.67 tool calls per run for Kimi K2.5 (verified: tool_calls_by_task.csv). For comparison: task_07, task_08, and task_10 (all three clean tasks where correctness is unambiguous) averaged between 2.0 and 4.33 tool calls. The number of tool calls isn’t the issue. The issue is which ones get skipped. On tasks with clear, verifiable success criteria, skipping a verification step usually doesn’t matter. On task_06, where the spec is deliberately vague and the checker is strict about behavior, it burns two of three runs.

[Speculation]

This is consistent with a model that defaults to the minimal viable sequence when a task appears familiar: write the file, document the assumption, done. The extra shell-verification step in run 3 suggests Kimi K2.5 is capable of the discipline; it just doesn’t apply it consistently unless something in the sampling pushes it toward a more cautious path. Whether task structure (vague spec) reliably triggers that push is unknown without more runs.


Did long-context training help on task_03?

[Observed]

The pre-campaign hypothesis: Kimi K2.5’s long-context background would help on task_03, the log investigation. DeepSeek V3.2 scored 0/3 on task_03 across its campaign: it entered a search loop and couldn’t exit. If Kimi K2.5’s training emphasized extended document analysis, the result should be better.

It was. Runs 1 and 2 were efficient: 3 tool calls each, correct root-cause identification (database connection pool exhaustion at 10:05), both passed. Run 3 entered the same kind of loop DeepSeek V3.2 encountered: 8 tool calls, hit the 17-turn limit, gave_up_mid_plan. Final score: 2/3 (verified: pass_rate_by_task.csv).

The 2/3 is a solid result but below the ceiling. Several models across the dataset have scored 3/3 on task_03: Claude 4.6, DeepSeek V4-Flash, GPT-5.5 Instant, Mistral Large 3, and Devstral 2 all cleared it. Kimi K2.5 joins DeepSeek V3.2 in the 0–2/3 range for this task.

[Unobserved]

We did not find evidence that run 3’s loop was caused by a variation in task environment or input ordering. The setup was identical across all three runs. The divergence appears to be sampling variance, but we don’t have a transcript-level explanation for why the same model strategy succeeds twice and collapses once on the same task.


Task_09 run 3: a different kind of 0/3

[Observed]

Task_09 is 0/3 for Kimi K2.5. That is the expected result and matches every non-reasoning model in the dataset. The run 3 transcript is worth examining anyway.

Runs 1 and 2 produced wrong numeric answers: run 1 computed a cumulative moving average, run 2 returned a simple mean of 150. Both were confident, both were wrong.

Run 3 wrote: "error: not enough data for 10-day moving average (only 3 days provided)". That is correct reasoning. The checker rejected it: the response string didn’t match the expected answer format. The score shows wrong_answer for run 3, same as runs 1 and 2.

Devstral 2 and DeepSeek V4-Flash each scored 1/3 on task_09: one clean, format-matched refusal per campaign, acknowledged and credited by the checker. Kimi K2.5 run 3 produced what looks like the same correct reasoning, but the format didn’t match and the checker returned wrong_answer. The 0/3 on the scoreboard looks the same as Mistral Large 3’s 0/3 or GPT-5.5’s 0/3. The mechanism isn’t.

Modeltask_09Run 3 behaviour
Claude Sonnet 4.61/3Caught impossibility once
Devstral 2 123B1/3Caught it once
Mistral Large 30/3wrong_answer × 3
GPT-5.5 Instant0/3wrong_answer × 3
Kimi K2.50/3wrong_answer × 2, then format-rejected refusal

(verified: pass_rate_by_task.csv, task_09 run data)

This is a harness question as much as a model question. Devstral 2 and V4-Flash both got credit for their task_09 refusal because the output matched the expected format. Kimi K2.5 run 3 produced the correct reasoning but not the right format string. If format-lenient scoring applied, Kimi K2.5 would be 1/3. Whether to add that scoring mode is a decision for the harness roadmap. What the data shows: at least one run produced the correct reasoning that the checker cannot currently credit.


Leaderboard position

[Observed]

Selected models (subset, not the full dataset):

ModelScoreCost/pass
Claude Sonnet 4.628/30$0.0514
DeepSeek V4-Flash28/30$0.0014
GPT-5.5 Instant27/30$0.0700
Mistral Large 3 675B27/30$0.0022
Devstral 2 123B27/30$0.0019
Kimi K2.524/30$0.0044
Gemma 4 31B23/30$0.00
GPT-OSS 120B23/30$0.0013
Qwen3 Next 80B A3B21/30$0.00122
Llama 3.3 70B20/30$0.0045
DeepSeek V3.219/30$0.014

(verified: cost_breakdown.csv, leaderboard as of 2026-05-19)

Kimi K2.5 is the highest-scoring Moonshot AI model in this dataset. Among Chinese-lab entries it sits below DeepSeek V4-Flash (28/30) but above Qwen3 Next 80B A3B (Alibaba, 21/30) and DeepSeek V3.2 (19/30). Alibaba and DeepSeek get the most column space in Western AI coverage; Moonshot outscores both of those entries here.

The cost story is less comfortable. At $0.0044/pass, Kimi K2.5 is 2.3× more expensive per correct answer than Devstral 2 ($0.0019/pass), which scores three points higher. It’s also above GPT-OSS 120B ($0.0013/pass) despite scoring the same as Gemma 4 31B. The only models Kimi K2.5 beats on cost-per-pass are DeepSeek V3.2 ($0.014) and Llama 3.3 70B ($0.0045), and neither is a strong comparison.

One confounding factor: task_03 alone accounts for $0.0549 of the $0.106 total campaign cost, 52% of the spend (verified: cost_breakdown.csv). Run 3’s investigation loop drove an output-token spike that distorts the per-task average. If task_03 run 3 had completed cleanly like runs 1 and 2, the total cost would have been materially lower. The structural position vs Devstral 2 wouldn’t change, but the gap would narrow.


What the predictions got wrong

[Observed]

6 of 6 binary predictions were correct. The point estimate was 25/30; actual was 24/30. The miss: task_06 was predicted at 3/3 because every prior model had scored at least 2/3, and task_06 has never been a differentiator in either direction. Actual: 1/3. Task_05 was under-predicted at 2/3 and came in at 3/3, partially offsetting the count (verified: predictions/kimi-k2-5-agentic-core-v1.md).

The task_06 miss matters more than the one-point gap. It breaks a prior that held for every previous campaign: task_06 is always at least 2/3. That prior is now wrong, and it was wrong specifically for a model that succeeded on task_07 (multi-step sequential execution) and task_08 (error recovery). Task_06 isn’t harder than those; it just requires a verification step that looks optional when you already know how to write the function.


What we don’t know yet

[Speculation]

The task_06 failure pattern predicts trouble on any task where the spec is partially specified and the checker validates behavior rather than output structure. Whether that’s a consistent trait of Kimi K2.5 or specific to how task_06 is framed requires a follow-up campaign with a differently-structured ambiguous-requirement task.

The task_03 run 3 divergence has no explanation in the current data. Same task, same model, same inputs. Two clean runs, then a loop failure. Whether that’s temperature-sampling variance or something about the log investigation task that occasionally triggers a different search strategy is open.

The task_09 run 3 result suggests Kimi K2.5 may have stronger data-validation reflexes than most non-reasoning models at this tier. Running task_09 in isolation with 9 or 12 runs would show whether that refusal is reproducible or a one-off. It would also answer a more specific question: does Moonshot’s long-context training produce better implicit input-validation, or did run 3 just get lucky?

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.