Version inversion
Campaign: 2026-05-21-glm-4.7-agentic-core-v1
Model: Zhipu AI GLM-4.7 (zai.glm-4.7, AWS Bedrock us-east-1, ON_DEMAND)
Harness: agentic-core-v1 (10 tasks × 3 runs = 30 total)
Campaign date: 2026-05-21
GLM-5 ran on agentic-core-v1 and scored 27/30. Upper tier, cost-competitive, clean execution. That was the baseline heading into this campaign.
The expectation was that GLM-4.7, the version GLM-5 replaced, would come in lower. Pricing is 50% cheaper on input, 37.5% cheaper on output. Models that cost less usually score less. In the dataset, that correlation holds often enough that cheaper-from-the-same-lab is a reasonable predictor of a lower ceiling.
GLM-4.7 scored 28/30.
That is one pass above GLM-5, at $0.0038 per passing task versus GLM-5’s $0.0065. GLM-4.7 is now joint-first in the agentic-core-v1 dataset alongside Claude Sonnet 4.6 and DeepSeek V4 Flash, on the score axis. It is the cheapest of those three by a wide margin.
What the harness asks
[Observed — harness spec]
Ten tasks, three runs each. agentic-core-v1 covers a set of software engineering problems a deployed agent would actually encounter: fix a failing test, refactor duplicated code, investigate a log file, trace execution paths through a codebase, make a targeted minimal fix, handle an ambiguous requirement, execute a four-step sequential plan, recover from an injected tool error, detect when a requested computation is impossible, and run a SQL investigation.
A pass requires correct task completion. Failure modes are classified: wrong_answer (completed incorrectly), gave_up_mid_plan (abandoned mid-execution), tool_call_hallucinated (fabricated tool or arguments), tool_call_redundancy (loop without progress).
Two tasks are structural traps. task_09 presents a CSV with three rows and asks for a 10-day moving average. The correct response is to state the data is insufficient. task_08 injects a file-not-found error on the first tool call; the model must detect it, locate the correct path, and produce a verified correct output.
The results
[Observed — data pack per_task_results]
| Task | Score | Cost | Avg latency |
|---|---|---|---|
| task_01 fix_failing_test | 3/3 | $0.0067 | 8.5s |
| task_02 refactor_duplicated_code | 3/3 | $0.0070 | 10.8s |
| task_03 investigate_log | 3/3 | $0.0478 | 9.3s |
| task_04 trace_through_codebase | 3/3 | $0.0065 | 15.0s |
| task_05 minimal_fix | 3/3 | $0.0087 | 12.0s |
| task_06 handle_ambiguous_requirement | 3/3 | $0.0095 | 13.7s |
| task_07 multi_step_plan | 3/3 | $0.0025 | 3.8s |
| task_08 recover_from_tool_error | 3/3 | $0.0025 | 5.2s |
| task_09 know_when_to_stop | 1/3 | $0.0103 | 13.5s |
| task_10 sql_investigation | 3/3 | $0.0045 | 5.9s |
Total: 28/30 (93.3%). Two failures, both on task_09.
Nine tasks swept at 3/3. task_08 (the injected error recovery task that has produced wrong_answer at 0/3 for several models in the dataset) was clean. GLM-4.7 detected the injected error and produced a correct output across all three runs. The only failure point was task_09.
task_09: the one crack
[Observed — data pack task_09_results, run_metrics]
task_09 is structurally the hardest task in the harness. Across the full dataset, most models either fail all three runs or score 1/3 on noise. GLM-5 scored 0/3. Most models scoring at the top of the leaderboard score 0/3 or 1/3 here. Consistent 3/3 on task_09 requires the model to reliably refuse an impossible request rather than attempt it.
GLM-4.7’s three runs showed three distinct outcomes:
- Run 1 (pass): Correctly identified insufficient data. 6 tool calls, 14.6s.
- Run 2 (gave_up_mid_plan): Hit 13 turns without committing to a “not enough data” conclusion. 6 tool calls, 11.8s. The model kept working.
- Run 3 (wrong_answer): Computed and wrote a number. 4 tool calls, 14.1s.
One correct, one exhausted, one confidently wrong. This is an inconsistent response profile for a single task. GLM-4.7 occasionally recognises the impossibility, occasionally runs out of turns trying to find a way through, and occasionally just answers.
GLM-5 scored 0/3 on task_09. GLM-4.7 scored 1/3. The 1/3 is likely sampling noise rather than a real capability difference between the two versions. A single pass in three runs does not establish reliable know-when-to-stop behaviour. But it is worth noting: the predecessor model’s weaker instruction-following may sometimes let it arrive at the correct refusal that the more polished successor misses by always pushing through.
[Speculation] Whether the difference between GLM-4.7’s 1/3 and GLM-5’s 0/3 on task_09 reflects different training on impossible-task detection or is pure sampling variance is not established. The sample is three runs on one task. Both interpretations are consistent with the data.
The cost-performance inversion
[Observed — cross-campaign data, pricing documentation]
| Metric | GLM-4.7 | GLM-5 |
|---|---|---|
| Score | 28/30 | 27/30 |
| Total campaign cost | $0.1059 | $0.1759 |
| Cost/pass | $0.0038 | $0.0065 |
| Input pricing | $0.50/1M | $1.00/1M |
| Output pricing | $2.00/1M | $3.20/1M |
The pricing gap between the two models is real and substantial. The capability gap the pricing implies does not show up in agentic task performance. On 29 of the 30 tasks where direct comparison is straightforward (excluding task_09’s variability), both models performed at the same level.
This pattern has appeared before in the dataset (particularly in cost-competitive Chinese lab models), but the inversion here is among the sharper ones. GLM-5 is not just marginally more expensive for a marginal gain; it is 71% more expensive per passing task for a result that is one pass lower.
For a builder choosing between these two via Bedrock: GLM-4.7 is the call. At 10,000 tasks, the cost difference is roughly $27 saved.
Execution quality
[Observed — data pack run_metrics, tool_call_analysis]
GLM-4.7 runs cleanly across the full campaign:
- 0/30 tool_call_redundancy: No duplicate or looping tool calls. Matches GLM-5’s result and ties the cleanest execution profile in the dataset.
- 0/30 diagnosis_then_regression: The model does not second-guess its own earlier analysis within a run.
- 0/30 long_tail_turns: No run hit the 15-turn hard limit. task_09 run 2 reached 13 turns before timing out; the model was still making forward progress, just slow progress toward the wrong answer.
- 145 total tool calls across 30 runs.
- Average latency 9.8s/run.
The task_03 cost spike is worth noting. task_03 (investigate_log) cost $0.0478, which is 45% of total campaign spend despite being 10% of runs. This is consistent with every model in the dataset: task_03 involves reading a large access log, and GLM-4.7’s 3/3 score means it processed that log thoroughly all three times, at roughly 92,000 input tokens total across the task. task_03 is a budget line item regardless of which model you’re evaluating.
[Unobserved] There were no tool_call_hallucinated or gave_up_mid_plan failures outside of task_09 run 2. Every other failure mode class returned zero across the campaign.
Where it sits
[Observed — leaderboard, cross-campaign data]
| Model | Score | $/pass | Lab |
|---|---|---|---|
| GLM-4.7 | 28/30 | $0.0038 | Zhipu AI |
| DeepSeek V4 Flash | 28/30 | $0.0015 | DeepSeek |
| Claude Sonnet 4.6 | 28/30 | $0.0514 | Anthropic |
| GPT-5.5 | 27/30 | $0.0699 | OpenAI |
| Devstral 2 | 27/30 | $0.0020 | Mistral |
| Mistral Large 3 | 27/30 | $0.0021 | Mistral |
| MiniMax M2.5 | 27/30 | $0.0024 | MiniMax |
| GLM-5 | 27/30 | $0.0065 | Zhipu AI |
| GPT-OSS 20B | 25/30 | $0.0005 | OpenAI |
| Kimi K2.5 | 24/30 | $0.0044 | Moonshot AI |
| GPT-OSS 120B | 23/30 | $0.0013 | OpenAI |
| Qwen3 32B | 23/30 | $0.0010 | Alibaba |
| Qwen3-Coder 30B A3B | 22/30 | $0.0018 | Alibaba |
| Qwen3 Next 80B A3B | 21/30 | $0.0012 | Alibaba |
| Llama 3.3 70B | 14/30 | $0.0047 | Meta |
| Nemotron Super 3 120B | 12/30 | $0.0016 | NVIDIA |
GLM-4.7 is joint-first on score. Among the three models at 28/30, it sits between DeepSeek V4 Flash ($0.0015/pass, cheapest) and Claude Sonnet 4.6 ($0.0514/pass, most expensive). The score-band between 27/30 and 28/30 contains eight models from six labs. The distinction between 27 and 28 at this level of the dataset is within the noise range of a three-run task, but GLM-4.7’s execution profile (no redundancy, no hallucinated calls, full coverage across all non-task_09 tasks) supports the result as repeatable.
What we got wrong
[Observed — data pack predictions_scoring]
| Prediction | Expected | Actual | Result |
|---|---|---|---|
| P1 Overall score | 20–24/30 | 28/30 | Wrong (beat upper bound by 4) |
| P2 task_09 | 0/3 | 1/3 | Wrong (one pass) |
| P3 task_07 | ≥2/3 | 3/3 | Correct ✅ |
P1 missed by 4 passes. The prediction range of 20–24 was based on the assumption that a 40–50% pricing reduction from a successor model implies a meaningful capability reduction on the same task set. That assumption failed.
The update is specific: within a single lab’s model family, pricing revisions do not reliably predict agentic task performance direction. The two Zhipu models have the same tool call quality profile. They fail on the same task. The pricing gap reflects something other than agentic capability: architecture changes, context window, or commercial positioning. Treating cost as a proxy for this class of capability comparison is a mistake we should not repeat.
P2 also missed. The task_09 0/3 prediction was consistent with the whole-dataset pattern for non-top-performing models. GLM-4.7’s 1/3 is the exception.
What we don’t know yet
[Speculation]
GLM-4.7’s task_09 performance across three runs (pass, gave_up_mid_plan, wrong_answer) raises a question that is not answered by this data: whether the 1/3 pass reflects a real know-when-to-stop signal that surfaces inconsistently, or is pure sampling variance. A follow-up campaign with a larger sample (say, 10 runs on task_09 alone) would clarify whether there is a useful probability here or just noise.
The mechanism by which GLM-5 became less capable on this harness than GLM-4.7 is also unknown from outside the lab. Fine-tuning emphasis, RLHF reward shaping, or instruction-following improvements that help in most contexts but slightly hurt agentic planning are all plausible. None are confirmed.
The practical relevance: whatever the mechanism, it produced a model that scores lower and costs more for this class of task. For users on Bedrock evaluating between these two, the reason does not change the decision.