Version inversion

Campaign: 2026-05-21-glm-4.7-agentic-core-v1
Model: Zhipu AI GLM-4.7 (zai.glm-4.7, AWS Bedrock us-east-1, ON_DEMAND)
Harness: agentic-core-v1 (10 tasks × 3 runs = 30 total)
Campaign date: 2026-05-21


GLM-5 ran on agentic-core-v1 and scored 27/30. Upper tier, cost-competitive, clean execution. That was the baseline heading into this campaign.

The expectation was that GLM-4.7, the version GLM-5 replaced, would come in lower. Pricing is 50% cheaper on input, 37.5% cheaper on output. Models that cost less usually score less. In the dataset, that correlation holds often enough that cheaper-from-the-same-lab is a reasonable predictor of a lower ceiling.

GLM-4.7 scored 28/30.

That is one pass above GLM-5, at $0.0038 per passing task versus GLM-5’s $0.0065. GLM-4.7 is now joint-first in the agentic-core-v1 dataset alongside Claude Sonnet 4.6 and DeepSeek V4 Flash, on the score axis. It is the cheapest of those three by a wide margin.


What the harness asks

[Observed — harness spec]

Ten tasks, three runs each. agentic-core-v1 covers a set of software engineering problems a deployed agent would actually encounter: fix a failing test, refactor duplicated code, investigate a log file, trace execution paths through a codebase, make a targeted minimal fix, handle an ambiguous requirement, execute a four-step sequential plan, recover from an injected tool error, detect when a requested computation is impossible, and run a SQL investigation.

A pass requires correct task completion. Failure modes are classified: wrong_answer (completed incorrectly), gave_up_mid_plan (abandoned mid-execution), tool_call_hallucinated (fabricated tool or arguments), tool_call_redundancy (loop without progress).

Two tasks are structural traps. task_09 presents a CSV with three rows and asks for a 10-day moving average. The correct response is to state the data is insufficient. task_08 injects a file-not-found error on the first tool call; the model must detect it, locate the correct path, and produce a verified correct output.


The results

[Observed — data pack per_task_results]

TaskScoreCostAvg latency
task_01 fix_failing_test3/3$0.00678.5s
task_02 refactor_duplicated_code3/3$0.007010.8s
task_03 investigate_log3/3$0.04789.3s
task_04 trace_through_codebase3/3$0.006515.0s
task_05 minimal_fix3/3$0.008712.0s
task_06 handle_ambiguous_requirement3/3$0.009513.7s
task_07 multi_step_plan3/3$0.00253.8s
task_08 recover_from_tool_error3/3$0.00255.2s
task_09 know_when_to_stop1/3$0.010313.5s
task_10 sql_investigation3/3$0.00455.9s

Total: 28/30 (93.3%). Two failures, both on task_09.

Nine tasks swept at 3/3. task_08 (the injected error recovery task that has produced wrong_answer at 0/3 for several models in the dataset) was clean. GLM-4.7 detected the injected error and produced a correct output across all three runs. The only failure point was task_09.


task_09: the one crack

[Observed — data pack task_09_results, run_metrics]

task_09 is structurally the hardest task in the harness. Across the full dataset, most models either fail all three runs or score 1/3 on noise. GLM-5 scored 0/3. Most models scoring at the top of the leaderboard score 0/3 or 1/3 here. Consistent 3/3 on task_09 requires the model to reliably refuse an impossible request rather than attempt it.

GLM-4.7’s three runs showed three distinct outcomes:

One correct, one exhausted, one confidently wrong. This is an inconsistent response profile for a single task. GLM-4.7 occasionally recognises the impossibility, occasionally runs out of turns trying to find a way through, and occasionally just answers.

GLM-5 scored 0/3 on task_09. GLM-4.7 scored 1/3. The 1/3 is likely sampling noise rather than a real capability difference between the two versions. A single pass in three runs does not establish reliable know-when-to-stop behaviour. But it is worth noting: the predecessor model’s weaker instruction-following may sometimes let it arrive at the correct refusal that the more polished successor misses by always pushing through.

[Speculation] Whether the difference between GLM-4.7’s 1/3 and GLM-5’s 0/3 on task_09 reflects different training on impossible-task detection or is pure sampling variance is not established. The sample is three runs on one task. Both interpretations are consistent with the data.


The cost-performance inversion

[Observed — cross-campaign data, pricing documentation]

MetricGLM-4.7GLM-5
Score28/3027/30
Total campaign cost$0.1059$0.1759
Cost/pass$0.0038$0.0065
Input pricing$0.50/1M$1.00/1M
Output pricing$2.00/1M$3.20/1M

The pricing gap between the two models is real and substantial. The capability gap the pricing implies does not show up in agentic task performance. On 29 of the 30 tasks where direct comparison is straightforward (excluding task_09’s variability), both models performed at the same level.

This pattern has appeared before in the dataset (particularly in cost-competitive Chinese lab models), but the inversion here is among the sharper ones. GLM-5 is not just marginally more expensive for a marginal gain; it is 71% more expensive per passing task for a result that is one pass lower.

For a builder choosing between these two via Bedrock: GLM-4.7 is the call. At 10,000 tasks, the cost difference is roughly $27 saved.


Execution quality

[Observed — data pack run_metrics, tool_call_analysis]

GLM-4.7 runs cleanly across the full campaign:

The task_03 cost spike is worth noting. task_03 (investigate_log) cost $0.0478, which is 45% of total campaign spend despite being 10% of runs. This is consistent with every model in the dataset: task_03 involves reading a large access log, and GLM-4.7’s 3/3 score means it processed that log thoroughly all three times, at roughly 92,000 input tokens total across the task. task_03 is a budget line item regardless of which model you’re evaluating.

[Unobserved] There were no tool_call_hallucinated or gave_up_mid_plan failures outside of task_09 run 2. Every other failure mode class returned zero across the campaign.


Where it sits

[Observed — leaderboard, cross-campaign data]

ModelScore$/passLab
GLM-4.728/30$0.0038Zhipu AI
DeepSeek V4 Flash28/30$0.0015DeepSeek
Claude Sonnet 4.628/30$0.0514Anthropic
GPT-5.527/30$0.0699OpenAI
Devstral 227/30$0.0020Mistral
Mistral Large 327/30$0.0021Mistral
MiniMax M2.527/30$0.0024MiniMax
GLM-527/30$0.0065Zhipu AI
GPT-OSS 20B25/30$0.0005OpenAI
Kimi K2.524/30$0.0044Moonshot AI
GPT-OSS 120B23/30$0.0013OpenAI
Qwen3 32B23/30$0.0010Alibaba
Qwen3-Coder 30B A3B22/30$0.0018Alibaba
Qwen3 Next 80B A3B21/30$0.0012Alibaba
Llama 3.3 70B14/30$0.0047Meta
Nemotron Super 3 120B12/30$0.0016NVIDIA

GLM-4.7 is joint-first on score. Among the three models at 28/30, it sits between DeepSeek V4 Flash ($0.0015/pass, cheapest) and Claude Sonnet 4.6 ($0.0514/pass, most expensive). The score-band between 27/30 and 28/30 contains eight models from six labs. The distinction between 27 and 28 at this level of the dataset is within the noise range of a three-run task, but GLM-4.7’s execution profile (no redundancy, no hallucinated calls, full coverage across all non-task_09 tasks) supports the result as repeatable.


What we got wrong

[Observed — data pack predictions_scoring]

PredictionExpectedActualResult
P1 Overall score20–24/3028/30Wrong (beat upper bound by 4)
P2 task_090/31/3Wrong (one pass)
P3 task_07≥2/33/3Correct

P1 missed by 4 passes. The prediction range of 20–24 was based on the assumption that a 40–50% pricing reduction from a successor model implies a meaningful capability reduction on the same task set. That assumption failed.

The update is specific: within a single lab’s model family, pricing revisions do not reliably predict agentic task performance direction. The two Zhipu models have the same tool call quality profile. They fail on the same task. The pricing gap reflects something other than agentic capability: architecture changes, context window, or commercial positioning. Treating cost as a proxy for this class of capability comparison is a mistake we should not repeat.

P2 also missed. The task_09 0/3 prediction was consistent with the whole-dataset pattern for non-top-performing models. GLM-4.7’s 1/3 is the exception.


What we don’t know yet

[Speculation]

GLM-4.7’s task_09 performance across three runs (pass, gave_up_mid_plan, wrong_answer) raises a question that is not answered by this data: whether the 1/3 pass reflects a real know-when-to-stop signal that surfaces inconsistently, or is pure sampling variance. A follow-up campaign with a larger sample (say, 10 runs on task_09 alone) would clarify whether there is a useful probability here or just noise.

The mechanism by which GLM-5 became less capable on this harness than GLM-4.7 is also unknown from outside the lab. Fine-tuning emphasis, RLHF reward shaping, or instruction-following improvements that help in most contexts but slightly hurt agentic planning are all plausible. None are confirmed.

The practical relevance: whatever the mechanism, it produced a model that scores lower and costs more for this class of task. For users on Bedrock evaluating between these two, the reason does not change the decision.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.