Zero redundancy at the top tier

Campaign: 2026-05-20-glm-5-agentic-core-v1
Model: GLM-5 (zai.glm-5, AWS Bedrock us-east-1, ON_DEMAND)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks x 3 runs each)
Campaign date: 2026-05-20 (run at 12:22–12:32 UTC, 567s wall clock)


Zhipu AI is a Beijing-based lab founded in 2019 as a Tsinghua University spin-out. Their ChatGLM series launched publicly before ChatGPT existed. Five years of iteration later, GLM-5 is their current model, and it scored 27/30 on agentic-core-v1, matching MiniMax M2.5, Mistral Large 3, and Devstral 2 at the highest tier any model in this dataset reaches outside Claude Sonnet 4.6’s 28/30.

Before this campaign, Zhipu AI had no data in our dataset. The pre-run prediction was 23/30. Actual was 27/30. Two predictions were falsified, both in the positive direction. F1 triggered.

The efficiency profile is what stands out. Zero tool redundancy across all 30 runs. No diagnosis-then-regression anywhere in the campaign. No long-tail turn counts. And on task_04, the codebase trace, GLM-5 used exactly 7 tool calls in each of the three runs. That kind of within-task consistency across separate runs is not typical.


What the harness tests

[Observed: harness spec]

agentic-core-v1 has 10 tasks, each run 3 times independently, for 30 total runs. Every task has a deterministic checker. The output clears acceptance criteria or it does not. No partial credit.

The 10 task types: fix a failing test, refactor duplicated code, investigate a large log file, trace through a codebase, apply a minimal fix under a strict line-count constraint, handle an intentionally ambiguous requirement, execute a sequential multi-step plan using only file write calls, recover from a deliberate tool error, recognise that a computation is structurally impossible and refuse to produce an answer, and run an SQL investigation using native database tools.

A run passes when committed output clears the checker within the 15-turn budget. Failure modes are wrong_answer (checker rejects) or gave_up_mid_plan (turn limit reached without a committed answer).

task_09 (know_when_to_stop) is structurally distinct. The model receives a 3-row CSV and is asked to compute a 10-day moving average. Three data points cannot produce a 10-day moving average. The correct response is to recognise this and decline. No model in this dataset has scored 3/3 on task_09.


What GLM-5 did

[Observed: verification/pass_rate_by_task.csv, verification/failure_mode_histogram.csv]

27 of 30 runs passed. Pass rate: 90%. Nine of ten task types went 3/3. task_09 went 0/3. All three failures were wrong_answer. No infrastructure errors.

TaskResultAvg tool callsAvg latency
task_01 fix failing test3/35.318.57s
task_02 refactor duplicated code3/35.722.60s
task_03 investigate log3/33.317.10s
task_04 trace through codebase3/37.016.86s
task_05 minimal fix3/36.318.17s
task_06 handle ambiguous requirement3/38.731.55s
task_07 multi-step plan3/38.016.71s
task_08 recover from tool error3/32.09.79s
task_09 know when to stop0/33.319.41s
task_10 SQL investigation3/33.316.69s

(verified: verification/pass_rate_by_task.csv, verification/tool_calls_by_task.csv, verification/latency_distribution.csv)

All four evidence bundles came back clean: tool_call_redundancy (calling a tool whose result is already in context) 0/30, diagnosis_then_regression (correct identification of a fix followed by reverting to a wrong approach) 0/30, long_tail_turn_count (a run using significantly more turns than the per-task median) 0/30, cross_task_consistency (variance in tool-call depth or approach across the 3 independent runs of the same task) 0/30 (verified: evidence_bundles, data_pack.json). No other model at the 27/30 tier has a completely clean sheet across all four. MiniMax M2.5 had one tool_call_redundancy instance on task_01 run 2; GLM-5 had none.


Why did the prediction miss by four?

[Observed: predictions/glm-5-agentic-core-v1.md, data_pack.json]

Four tasks predicted at risk all came in at 3/3:

task_03 (investigate_log): GLM-5 swept this at 3.3 avg tool calls, 17.10s/run, and $0.0588 total across 3 runs (verified: verification/cost_breakdown.csv). That $0.059 is 33% of the entire campaign budget, driven by one run that used 29,993 input tokens at roughly $0.031 to read deeper into the access log. The model still produced the correct finding. task_03 is not uniformly easy in this dataset: DeepSeek V3.2 (19/30 overall) scored 0/3 on this task. GLM-5’s handling of a 30K-token read without stalling or looping is what shifted P4 from a miss to a pass.

task_04 (trace_through_codebase): 7.0 avg tool calls, which is the per-run figure too: each of the three runs used exactly 7 calls. Read entry point, trace each hop, write result. Three independent runs, identical tool-call depth. The cross_task_consistency bundle found nothing unusual here because the pattern is clean, not erratic. Compare this with MiniMax M2.5’s task_04, which averaged 6.7 calls with some variance. GLM-5’s task_04 trace looks like a model that has a stable internal plan for what code-tracing requires, not a model that arrives at the answer by trial and adjustment.

task_06 (handle_ambiguous_requirement): 8.7 avg tool calls, 31.55s/run. This is the slowest task in the campaign. The requirement has an intentional gap: the model must surface an assumption note rather than proceed to implementation. Shortcuts fail here. GLM-5 read the spec, wrote the implementation, and logged assumptions correctly on all three runs. The latency is the tell: 31.55s is deliberation, not hesitation.

task_07 (multi_step_plan): 8.0 avg tool calls across 3 runs, but individual run variance was the widest of any task in the campaign: the brief records run 2 at 12 calls and run 3 at 4 calls. The sequential 4-step plan executed correctly regardless of how many calls it took. For comparison, NVIDIA Nemotron Super 3 120B (12/30 overall) scored 0/3 on task_07, the only model in the dataset to completely fail it (prior campaign article: nemotron-super-3-120b-agentic-core-v1-2026). GLM-5 passed all three runs with the widest tool-call variance in its own campaign, which is the honest description of what happened.

[Speculation]

The four missed tasks share a structural trait: they each require reading carefully before acting. task_03 demands reading a large log without truncating. task_04 demands a stable trace plan. task_06 demands catching an ambiguity before touching code. task_07 demands committing to a sequence before writing. None of these require reasoning per se; they require that the model does not skip the reading step.

GLM-5 is a dense general-purpose model with no explicit reasoning mode (reasoning_content: false in the data pack). The pre-run prediction assumed that without an internal chain-of-thought mechanism, the model would shortcut on these tasks. It didn’t. What GLM-5 appears to have is stable attention across reading-heavy tasks, not pre-action deliberation. Those can produce the same outcome profile while being architecturally different.


The task_09 wall

[Observed: verification/pass_rate_by_task.csv, prior campaign articles]

task_09 (know_when_to_stop): 0/3, wrong_answer on every run. Avg tool calls: 3.3. Avg latency: 19.41s per run. Run 3 used 5 tool calls over 31.58s before returning wrong_answer.

Ten models in this dataset have scored 0/3 on task_09. GLM-5 is the latest. The nine prior: llama3.3-70b-agentic-core-v1-run3-arc-2026, gemma-4-31b-agentic-core-v1-2026, gpt-5.5-instant-agentic-core-v1-2026, mistral-large-3-agentic-core-v1-2026, deepseek-v3-2-agentic-core-v1-bedrock-2026, openai-gpt-oss-120b-agentic-core-v1-2026, qwen3-next-80b-a3b-agentic-core-v1-2026, kimi-k2-5-agentic-core-v1-2026, minimax-m2-5-agentic-core-v1-2026. Five models scored at least 1/3 on one run: DeepSeek V4-Flash, Claude Sonnet 4.6, Devstral 2, Nemotron Super 3 120B (1/3), and Qwen3-Coder 30B. None scored 3/3 across a full run series.

Run 3’s 31.58s is the highest single-run latency in the dataset on task_09. GLM-5 tried the hardest and arrived at the same place as Mistral Large 3, which commits a wrong answer in 1.3 seconds via a single tool call (prior campaign article: mistral-large-3-agentic-core-v1-2026). Effort on task_09 produces no differentiation in outcome.

[Unobserved]

We have not seen any model consistently detect the impossibility in task_09 across all three runs of a campaign. Five models scored 1/3 on one run each (DeepSeek V4-Flash, Claude Sonnet 4.6, Devstral 2, Nemotron Super 3 120B, Qwen3-Coder 30B); none reproduced it consistently across a full run series. We have not tested whether explicit prompting about input validity before computation changes this behaviour.

[Speculation]

Ten models in, task_09 is now a wall in the dataset. The question has moved from whether a specific model will catch it to whether any current architecture catches it reliably without post-training on data-insufficiency examples. GLM-5’s attempt on run 3 spent 31.58s and 5 tool calls on a problem that has no valid answer. Knowing when to stop appears to require something distinct from knowing how to proceed, and that distinction is not resolved by general capability.


What we were wrong about: P1 and P4

[Observed: predictions/glm-5-agentic-core-v1.md]

Six predictions filed. Four confirmed. Two falsified (both positive):

PredictionOutcomeActual
P1: score 22–26/30FAIL (F1 triggered)27/30
P2: task_09 0/3 wrong_answerPASS0/3
P3: task_07 2–3/3PASS3/3
P4: task_03 1–2/3FAIL (positive)3/3
P5: zero infrastructure_errorPASS0 errors
P6: total cost $0.05–$0.20PASS$0.176

P1 put the ceiling at 26/30. P4 put task_03 at risk. Both assumed GLM-5 would behave like the mid-tier Chinese-lab cluster in this dataset. GLM-5 sat well above that cluster. The prediction used the wrong prior: Alibaba and DeepSeek models in the dataset scored 19–22/30; Zhipu AI’s five years of iteration and academic lineage gave GLM-5 a different capability floor. That’s the part the prediction missed.


Cost position

[Observed: verification/cost_breakdown.csv, prior campaign articles]

$0.176 total. $0.0065 per passing run (verified: verification/cost_breakdown.csv).

GLM-5’s pricing ($1.00/$3.20 per million input/output tokens via the ZAI Bedrock Converse adapter) puts it at a premium relative to its score peers. The $0.0065 per pass compares to $0.0024 for MiniMax M2.5 at the same 27/30 score. At scale, that 2.7x gap matters.

ModelScoreCost/passLab
Claude Sonnet 4.628/30$0.0514Anthropic
Devstral 227/30$0.0019Mistral
Mistral Large 327/30$0.0022Mistral
MiniMax M2.527/30$0.0024MiniMax
GLM-527/30$0.0065Zhipu AI
Kimi K2.524/30$0.0044Moonshot AI
GPT-OSS 120B23/30$0.0013OpenAI
Qwen3-Coder 30B A3B22/30$0.0018Alibaba
Qwen3 Next 80B A3B21/30$0.0012Alibaba
DeepSeek V3.219/30$0.0142DeepSeek
NVIDIA Nemotron Super 3 120B12/30$0.0016NVIDIA

(verified: verification/cost_breakdown.csv for GLM-5; prior campaign articles for all other models. Table shows a subset of the full dataset.)

The cost structure skews toward task_03: $0.059 for three runs, 33% of total spend. Strip task_03 out, and the implied cost per pass on the remaining 27 runs drops to roughly $0.004, which would sit between MiniMax ($0.0024) and Kimi ($0.0044). The cost premium is real but concentrated: log investigation, a read-heavy task, consumes a disproportionate share of the budget at GLM-5’s pricing tier.

For one-off agentic work where consistency matters more than cost, GLM-5’s zero-redundancy profile and exact-repeat task_04 traces make it a clean choice. For high-volume workloads where the log investigation is a recurring step, the 33% cost concentration warrants attention.


How does GLM-5 compare within the Chinese-lab sub-leaderboard?

[Observed: prior campaign articles for all models in table]

ModelScoreCost/passLab
MiniMax M2.527/30$0.0024MiniMax (Beijing)
GLM-527/30$0.0065Zhipu AI (Beijing)
Kimi K2.524/30$0.0044Moonshot AI (Beijing)
Qwen3-Coder 30B A3B22/30$0.0018Alibaba
Qwen3 Next 80B A3B21/30$0.0012Alibaba
DeepSeek V3.219/30$0.0142DeepSeek

GLM-5 ties MiniMax M2.5 at the top of this sub-leaderboard. The two Beijing labs at 27/30 represent different approaches: MiniMax M2.5 has an internal reasoning block that fires before every tool call; GLM-5 has no reasoning mode in the data pack. Both clear the same 27/30 mark. The mechanism produces similar outcomes on this harness, but MiniMax at $0.0024 per pass is 2.7x cheaper.

Zhipu AI’s five-year lineage from ChatGLM to GLM-5 produced a model that matches the newest Chinese reasoning architecture on agentic task performance. The cost structure is the differentiator where the two diverge.


What we don’t know yet

[Speculation]

GLM-5 is a dense general-purpose model with no explicit reasoning mode. Its zero-redundancy profile and task_04 consistency suggest a stable internal execution strategy, but we have no transcript-level view of how it maintains plan state across multi-step tasks. Whether the task_04 consistency reflects a fixed strategy or adaptive convergence is not resolvable from the data we have.

task_03’s variable-depth reading, including the one high-token run at ~30K input tokens, was not pre-planned in the prediction model. Whether GLM-5 consistently scales read depth based on log complexity across a larger run series, or whether the high-token run was stochastic, would require more runs than 3 to characterise.

task_09 run 3’s 31.58s is the highest observed latency on that task in the dataset. Whether that latency represents a generalised failure mode or specific to the run-3 random seed is unknown. We have not tested whether multi-run exposure to the same task changes GLM-5’s behaviour on know-when-to-stop cases.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.