The plan stayed in its head
Campaign: 2026-05-22-kimi-k2-thinking-agentic-core-v1
Model: Kimi K2 Thinking (moonshot.kimi-k2-thinking, AWS Bedrock eu-west-1)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-05-22
Three weeks ago Kimi K2.5 scored 24/30 here and landed solidly in the middle of the leaderboard. Moonshot AI’s second Bedrock entry, Kimi K2 Thinking, adds an explicit reasoning trace: chain-of-thought fires before every tool call, and you can see the model working through the problem before it acts. The question was whether that reasoning would close the gaps K2.5 couldn’t.
The answer is 12/30. Twelve points below its predecessor. Eighty percent more expensive per correct answer. The first clear regression from a reasoning variant in the dataset.
That’s not the only thing worth noting. There’s one task where the reasoning trace did exactly what it was supposed to do. Understanding both results is the point of this article.
What agentic-core-v1 actually measures
[Observed: harness spec]
Ten tasks, three runs each, 30 total. The tasks span the practical core of agentic software work: fix a failing test, refactor duplicated code, investigate a log file, trace execution through a codebase, implement a function against an ambiguous spec, apply a minimal fix, execute a multi-step sequential plan, recover from a tool error, identify an impossible computation, and run a SQL investigation.
A pass requires completing the task correctly. Failures are classified as wrong_answer (model produced incorrect output), gave_up_mid_plan (model abandoned mid-execution), or infra_error. Two tasks are structural traps built into the design. Task_09 gives the model 3 rows of data and asks for a 10-day moving average; correct response is to refuse. No non-reasoning model has passed task_09 more than once in a single campaign. Task_07 requires creating four files under steps/ sequentially; it has never produced a 0/3 in the dataset before this run.
What happened
[Observed]
12 of 30 runs passed. Pass rate: 40.0% (verified: pass_rate_by_task.csv). For comparison, K2.5 passed 24/30.
| Task | Score | K2.5 baseline | Delta |
|---|---|---|---|
| task_01 fix failing test | 1/3 | 3/3 | −2 |
| task_02 refactor duplicated code | 1/3 | 3/3 | −2 |
| task_03 investigate log | 2/3 | 2/3 | 0 |
| task_04 trace through codebase | 3/3 | 3/3 | 0 |
| task_05 minimal fix | 2/3 | 3/3 | −1 |
| task_06 handle ambiguous requirement | 0/3 | 1/3 | −1 |
| task_07 multi-step plan | 0/3 | 3/3 | −3 |
| task_08 recover from tool error | 1/3 | 3/3 | −2 |
| task_09 know when to stop | 1/3 | 0/3 | +1 |
| task_10 SQL investigation | 1/3 | 3/3 | −2 |
(verified: pass_rate_by_task.csv)
Total cost: $0.095 | $0.0079/pass | latency range: 0.77s–22.7s per run (verified: cost_breakdown.csv, latency_distribution.csv)
All 18 failures are wrong_answer. Zero infra_error, zero gave_up_mid_plan. The model never abandoned a task. It finished each one with an answer. The answers were wrong.
Task_07: the one that broke cleanly
[Observed]
Task_07 asks the model to create four files under steps/: step1.txt through step4.txt. It is one of the simpler structural tasks. Every previous model in the dataset, including models that scored below 20/30 overall, has scored at least 2/3 on task_07. Kimi K2.5 did it 3/3. Kimi K2 Thinking: 0/3. First model in the dataset to achieve that.
The tool call data tells you why. Average 0.7 tool calls per run for task_07 (verified: tool_calls_by_task.csv). Runs 2 and 3 each produced 0 tool calls: the model submitted without creating any files. Run 1 created 2 of the 4 required files before stopping. The task requires 4 sequential fs_write calls. Kimi K2 Thinking made an average of less than one across three attempts.
The latency data adds to the picture. Task_07 average latency was 1.75 seconds (verified: latency_distribution.csv). That is fast for a task that should require four file writes. Runs 2 and 3 completed in 0.77 and 0.83 seconds respectively. For comparison, task_04 (trace through codebase, which the model handled perfectly) averaged 14.41 seconds.
[Speculation]
The most straightforward reading: the reasoning trace worked through the plan internally, the model treated the reasoning output as the deliverable, and the tool calls never fired. There is direct support for this in how reasoningContent behaves in Bedrock’s extended thinking architecture: chain-of-thought output precedes action generation, and a model that completed its reasoning may not generate subsequent tool use if its training emphasized reasoning-as-output rather than reasoning-then-act.
This is a hypothesis. We have the tool call counts and the latency. We do not have visibility into the reasoningContent blocks themselves, which are not included in the agentic-core-v1 transcript schema. Whether the reasoning trace shows the complete plan is not verifiable from the current dataset.
Task_04: intact, which is informative
[Observed]
Task_04 requires reading 3–4 files and tracing execution through a codebase to a specific output value. No writes. No multi-step plan. Read, reason, answer. Kimi K2 Thinking scored 3/3.
Average tool calls: 6.3 per run (all reads). Average latency: 14.41 seconds, with a max of 22.7s (verified: latency_distribution.csv, tool_calls_by_task.csv). The reasoning trace is visibly doing work here. The output is correct every time.
The task_04 vs task_07 contrast is the sharpest signal in this campaign. Task_04: read everything, reason, conclude. The model excels. Task_07: plan a sequence of writes, execute them one by one. The model averages 0.7 tool calls and scores 0/3. These are not structurally similar tasks, and the reasoning variant’s performance diverges exactly where you’d expect if the hypothesis above holds.
Task_09: the one the reasoning trace was built for
[Observed]
Task_09 has been 0/3 for every non-reasoning model in the dataset. No exceptions. The task gives the model a 3-row CSV and asks for a 10-day moving average; the correct response is to refuse.
Kimi K2.5, in its run 3, produced the correct diagnosis ("not enough data for 10-day moving average"), but the output format didn’t match the checker’s expected shape and the score came back wrong_answer. The reasoning was there. The format wasn’t.
Kimi K2 Thinking scored 1/3 on task_09. Run 3 passed (transcript ref: data/transcripts/5f678dc6-706e-4807-80a1-a92ab433b20a.jsonl). The transcript shows two sequential reads of data.csv (the tool_call_redundancy evidence flags this at turn 2), followed by writing answer.txt with the content: "Note: Only 3 days of data available (2026-01-01 to 2026-01-03). A complete 10-day moving average requires at least 10 days of data." The checker credited it.
Runs 1 and 2 did not refuse. Run 1 computed a partial moving average treating 3 rows as sufficient. Run 2 did the same. Only run 3 applied the arithmetic gate (data rows < window size, refuse) before producing the answer.
This is what a reasoning trace is built for in this context. A bounded arithmetic check before action. Task_09 is the only task in agentic-core-v1 where the correct answer is nothing, and the reasoning variant is the first model in the dataset to produce that answer in a format the checker accepts.
The absolute improvement is small: 1/3 vs 0/3. It is the only positive delta in this campaign.
Task_06: reasoning satisfied internally, never written to disk
[Observed]
Task_06 requires implementing count_users(path) against a deliberately ambiguous spec and writing assumptions to note.txt. K2.5 scored 1/3 here; its run 3 self-corrected with a shell verification call and passed. Kimi K2 Thinking scored 0/3.
The consistent pattern: all three runs created counter.py correctly. None created note.txt. Average tool calls: 1.3 per run (min 0, max 2). One write call per run, for the Python file (verified: tool_calls_by_task.csv).
[Speculation]
The note.txt requirement asks the model to externalise its assumptions: write down what you assumed about the ambiguous spec. If the reasoning trace already contains that externalisation (the model writes out its assumptions in the reasoning pass before deciding to call a tool), the model may have satisfied the requirement internally and treated the output as complete. The tool call to write note.txt never fired.
This is untestable without the reasoningContent blocks. What the transcript data shows is that the model stopped after one write, every time, on a task where the spec explicitly requires two.
Cost math
[Observed]
$0.0079/pass vs K2.5’s $0.0044/pass (verified: cost_breakdown.csv). 80% more per correct answer, with 50% fewer correct answers.
Kimi K2 Thinking’s output token price is actually lower than K2.5: $2.50/1M vs $3.00/1M. Reasoning tokens add to output volume and inflate per-run cost even on failed runs. When the pass rate halved, the per-pass cost compounded in the wrong direction.
At $0.0079/pass, Kimi K2 Thinking sits between GLM-4.7-Flash ($0.0005/pass) and DeepSeek V3.2 ($0.014/pass) on cost, but at 40% pass rate it doesn’t belong in that tier.
Predictions vs actuals
[Observed]
1 of 3 predictions correct.
| Prediction | Claim | Result | Outcome |
|---|---|---|---|
| P1 | ≥26/30 overall | 12/30 | WRONG |
| P2 | task_09 ≥1/3 | 1/3 | CORRECT |
| P3 | task_06 ≥2/3 | 0/3 | WRONG |
P1 was the confident call: Kimi K2.5 at 24/30 is a strong baseline, and reasoning models have consistently matched or improved on their non-reasoning counterparts in this dataset. Actual: 12/30. The regression wasn’t within the range that would count as a minor underperformance. It’s the worst score in the Moonshot family and the first clear drop from a reasoning variant.
P3 was wrong in both direction and magnitude. Task_06 was predicted at ≥2/3 because it has never been a differentiating task. Every prior model scored at least 1/3. The reasoning-task interaction turned it into a 0/3.
Where this leaves the dataset
[Observed]
Every reasoning model tested on agentic-core-v1 before this campaign matched or improved on its non-reasoning counterpart. Kimi K2 Thinking is the first exception. The regression is specific: tasks that require sequential write execution (task_07, task_08) dropped the most, while tasks that require sustained read-and-reason work (task_04) held.
The failure mode is consistent across the campaign. 18 wrong answers, no gave_up runs, average latency on failed tasks is lower than on passed tasks. The model completes every run quickly and confidently. It is not uncertain. It is wrong.
[Unobserved]
We did not see any gave_up_mid_plan failures in this campaign. No run ran long. Task_07’s 0-tool-call runs completed in under 1 second. Whatever the cause, it is not hesitation or overextension.
We looked for the diagnosis-then-regression pattern (model states a diagnosis, then walks it back) across all 30 transcripts and found zero matches (verified: evidence/diagnosis_then_regression.md). The model does not second-guess itself.
What we don’t know yet
The reasoning trace for this model uses Bedrock’s reasoningContent block, which is not captured in the agentic-core-v1 transcript schema. The hypothesis (that plans stay in the model’s reasoning output and don’t generate downstream tool calls) is consistent with the tool call counts and latency distributions. It is not confirmed. Logging reasoningContent blocks alongside tool calls is the experiment that would settle it.
Task_09’s 1/3 is the first clean task_09 pass from a Moonshot model and the clearest example in the dataset of a reasoning trace providing correct pre-action validation. Whether that’s reproducible in a second campaign, or whether run 3 was a high-temperature sampling outcome where the arithmetic gate happened to fire, is an open question. Task_09 in isolation with more runs would answer it.
The task_06 note.txt omission pattern appeared in K2.5 on two of three runs, and in K2 Thinking on all three. This is now two consecutive Moonshot campaigns showing the same task_06 failure shape. Whether that’s a model-family trait or a task-framing issue is worth testing with a differently-structured ambiguous-requirement task.
Leaderboard
[Observed]
| Model | Score | Cost/pass |
|---|---|---|
| GLM-4.7 | 28/30 | $0.0010 |
| GLM-4.7-Flash | 28/30 | $0.0005 |
| Devstral 2 123B | 27/30 | $0.0019 |
| Mistral Large 3 675B | 27/30 | $0.0022 |
| Kimi K2.5 | 24/30 | $0.0044 |
| Kimi K2 Thinking | 12/30 | $0.0079 |
(verified: cost_breakdown.csv, leaderboard as of 2026-05-22)
Kimi K2 Thinking scores 12 points below its predecessor and 15 points below the current leaders. On write operations, multi-step plans, and error recovery, every other model in the top half of the leaderboard succeeds consistently.