The flash tax on GLM-4.7 is 3 points and 85%

May 22, 2026 · campaign-reports

Campaign: 2026-05-21-glm-4.7-flash-agentic-core-v1
Model: Zhipu AI GLM-4.7-Flash (zai.glm-4.7-flash, AWS Bedrock us-east-1, ON_DEMAND)
Harness: agentic-core-v1 (10 tasks × 3 runs = 30 total)
Campaign date: 2026-05-21

GLM-4.7 ran on agentic-core-v1 yesterday and scored 28/30, joint-first in the dataset alongside Claude Sonnet 4.6 and DeepSeek V4 Flash. Cost-per-pass: $0.0038. That result was the baseline heading into this campaign.

The flash variant is a different commercial proposition. Input pricing drops from $0.50/1M to $0.07/1M (86% cheaper on input). Output pricing drops from $2.00/1M to $0.40/1M (80% cheaper). Flash models in this dataset have produced a wide range of outcomes: some hold near-parent capability, some fall off a cliff. Nemotron Nano 3 30B, the most recent flash-adjacent example, scored 10/30 against the Nemotron Super’s 12/30 (practically the same, both in the lower tier). GLM-4.7-Flash is a different question because the parent starts at the top.

GLM-4.7-Flash scored 25/30 at $0.000565/pass.

Three points below the non-flash parent. 85% cheaper per passing task. Runs in 3.1s average versus the parent’s 9.8s.

What the harness asks

[Observed — harness spec]

Ten tasks, three runs each. agentic-core-v1 covers software engineering scenarios a deployed agent would actually encounter: fix a failing test, refactor duplicated code, investigate a log file, trace execution paths through a codebase, make a targeted minimal fix, handle an ambiguous requirement, execute a four-step sequential plan, recover from an injected tool error, detect when a requested computation is impossible, and run a SQL investigation.

A pass requires correct task completion. Failure modes are classified: wrong_answer (completed incorrectly), gave_up_mid_plan (abandoned mid-execution), tool_call_hallucinated (fabricated tool or arguments), tool_call_malformed (structurally broken tool call output), tool_call_redundancy (loop without progress).

Two tasks are structural traps. task_09 presents a CSV with three rows and asks for a 10-day moving average. The correct response is to refuse on the grounds that the data is insufficient. task_08 injects a file-not-found error on the first tool call; the model must detect it, find the correct path, and produce a verified output.

What happened

[Observed — data pack per_task_results]

Task	Score	Cost	Avg latency
task_01 fix_failing_test	3/3	$0.00095	3.0s
task_02 refactor_duplicated_code	3/3	$0.00175	4.5s
task_03 investigate_log	3/3	$0.00511	3.0s
task_04 trace_through_codebase	3/3	$0.00098	3.0s
task_05 minimal_fix	3/3	$0.00122	3.4s
task_06 handle_ambiguous_requirement	2/3	$0.00123	4.2s
task_07 multi_step_plan	3/3	$0.00054	2.1s
task_08 recover_from_tool_error	2/3	$0.00058	2.2s
task_09 know_when_to_stop	0/3	$0.00102	3.3s
task_10 sql_investigation	3/3	$0.00074	2.2s

Total: 25/30 (83.3%). Seven tasks swept at 3/3. Three failures: task_09 entirely, task_06 and task_08 one run each.

The three failures

[Observed — data pack task_09_results, task_06_results, task_08_results, run_metrics]

task_09 (know_when_to_stop): 0/3. All three runs returned wrong_answer. The model computed and wrote a 10-day moving average from a three-row CSV each time, without pausing to check whether the window was satisfiable. Two to four tool calls per run, and a confident number at the end of each. This is the most common failure mode across the full dataset, and GLM-4.7-Flash reproduces it cleanly and consistently. The non-flash parent (GLM-4.7) scored 1/3 on task_09 (one pass in three runs), which itself is likely noise rather than signal. The flash model shows no indication of the intermittent hesitation the parent occasionally exhibited.

task_06 (handle_ambiguous_requirement): 2/3. Run 3 failed with tool_call_malformed. The flash model emitted a structurally broken tool call mid-generation: the output block terminated early, producing invalid schema output. Runs 1 and 2 were clean. This is the first tool_call_malformed in the Zhipu AI family across agentic-core-v1: GLM-4.7 and GLM-5 both returned zero malformed calls across their full campaigns. One occurrence in 30 runs is not a disqualifying rate, but it is a new failure mode for this family.

task_08 (recover_from_tool_error): 2/3. Run 1 failed with wrong_answer. The task injects a file-not-found error on the first tool call; the model must detect it, locate the correct path, and produce a verified correct output. Run 1 did not recover correctly. Runs 2 and 3 passed with 3 and 2 tool calls respectively. The parent (GLM-4.7) swept task_08 at 3/3, so this is a regression, though a one-run regression. Sampling variance is the more likely explanation than a structural gap.

The efficiency trade

[Observed — cross-campaign data, pricing documentation]

Metric	GLM-4.7	GLM-4.7-Flash	Delta
Score	28/30	25/30	−3 pts
Total campaign cost	$0.1059	$0.0141	−87%
Cost/pass	$0.0038	$0.000565	−85%
Avg latency	9.8s	3.1s	−68%
Input pricing	$0.50/1M	$0.07/1M	−86%
Output pricing	$2.00/1M	$0.40/1M	−80%

The flash model runs 3× faster and costs 85% less per passing task for a 3-point score drop. At 10,000 tasks, the saving is roughly $33 ($0.0038 − $0.000565 = $0.003235 × 10,000). Whether that trade is worth the 17% pass-rate reduction depends on which tasks the deployment encounters. For workloads where task_09-class impossible-task detection is critical, the flash model is the wrong choice. For workloads where tasks 1–8 and 10 are representative, 83% at 15% of the cost is a serious option.

The most concrete individual saving is task_03. The non-flash parent spent $0.0478 on the investigate-log task (45% of its total campaign budget) because it processed ~92,000 input tokens across three runs to investigate a large access log. GLM-4.7-Flash spent $0.00511 on the same task: a 9.4× reduction. The flash model used a more surgical read strategy, processing fewer tokens per run while still passing all three. That is not an accuracy loss. It is execution efficiency.

The `tool_call_malformed` signal

[Observed — data pack tool_call_analysis]

One tool_call_malformed across 30 runs is the lowest rate at which this failure mode can appear in a campaign. It is not statistically significant on its own. But it is worth flagging because neither GLM-4.7 nor GLM-5 produced a single malformed call in their campaigns.

The likely mechanism: flash architectures operate with a tighter compute budget per forward pass. Under complex tool-call schemas, the model occasionally hits an internal cutoff mid-generation of the structured output block. The result is a valid-looking partial response that fails schema validation. The parent model, running with more headroom, doesn’t hit this boundary.

[Speculation] At scale, tool_call_malformed errors from flash models are recoverable via retry-with-validation in the caller. A wrapper that detects malformed tool responses and resubmits the turn would likely recover most of these. At 1/30 in this dataset, the expected rate in production is low, but not zero.

Leaderboard position

[Observed — leaderboard, cross-campaign data]

Model	Score	$/pass	Lab
GLM-4.7	28/30	$0.0038	Zhipu AI
DeepSeek V4 Flash	28/30	$0.0015	DeepSeek
Claude Sonnet 4.6	28/30	$0.0514	Anthropic
GPT-5.5	27/30	$0.0699	OpenAI
Devstral 2	27/30	$0.0020	Mistral
Mistral Large 3	27/30	$0.0021	Mistral
MiniMax M2.5	27/30	$0.0024	MiniMax
GLM-5	27/30	$0.0065	Zhipu AI
GPT-OSS-20B	25/30	$0.000481	OpenAI
GLM-4.7-Flash	25/30	$0.000565	Zhipu AI
Kimi K2.5	24/30	$0.0044	Moonshot AI
GPT-OSS-120B	23/30	$0.0013	OpenAI
Qwen3 32B	23/30	$0.0010	Alibaba
Qwen3-Coder 30B A3B	22/30	$0.0018	Alibaba
Qwen3 Next 80B A3B	21/30	$0.0012	Alibaba
Llama 3.3 70B	14/30	$0.0047	Meta
Nemotron Super 3 120B	12/30	$0.0016	NVIDIA
Nemotron Nano 3 30B	10/30	$0.00079	NVIDIA

GLM-4.7-Flash ties GPT-OSS-20B at 25/30. The cost difference between them is small: $0.000565/pass versus $0.000481/pass. GPT-OSS-20B is 15% cheaper per passing task at the same score. Both sit in what is shaping up as an efficient-API tier at 25/30, separated from the top band (27–28) by a meaningful score gap.

The Zhipu family across the full dataset:

Model	Score	$/pass
GLM-4.7	28/30	$0.0038
GLM-5	27/30	$0.0065
GLM-4.7-Flash	25/30	$0.000565

GLM-5 remains off the Pareto frontier: lower score than GLM-4.7 at higher cost. GLM-4.7-Flash occupies a distinct cost tier below both. For a builder deploying at scale: GLM-4.7 for maximum score, GLM-4.7-Flash for maximum cost efficiency, GLM-5 for neither.

Execution profile

[Observed — data pack run_metrics, tool_call_analysis]

Avg latency 3.1s/run, down from 9.8s for the parent (3.2x faster)
Total wall clock 97 seconds for 30 runs
Very low token counts: avg ~4,400 input tokens/run (excluding task_03). The flash model resolves most tasks with minimal context accumulation.
0 infrastructure errors across 30 runs

task_03 latency showed high run-to-run variance: 1.8s to 5.2s. Run 2 processed 41,937 input tokens in 5.2s; runs 1 and 3 were more surgical. The flash model does not consistently choose the same investigation strategy across runs, but it passes all three. High-variance latency on a single task in a 3-run sample is expected noise.

[Unobserved] No gave_up_mid_plan, tool_call_hallucinated, or tool_call_redundancy failures across the campaign. The one tool_call_malformed instance on task_06 run 3 accounts for the only non-wrong_answer failure mode.

What we got wrong

[Observed — data pack predictions_scoring]

Prediction	Expected	Actual	Result
P1 Overall score	20–26/30	25/30	Correct ✅
P2 task_09	0/3	0/3	Correct ✅
P3 task_07	≥2/3	3/3	Correct ✅
P4 task_03	≥1/3	3/3	Correct ✅

4/4 correct. P4 was the notable risk: task_03 is the dataset’s highest-cost task due to a large access log. Flash architectures that read aggressively could spike cost and fail if their truncation strategy misses key events. GLM-4.7-Flash’s surgical read approach passed all three runs at 9.4× lower cost than the parent. The risk didn’t materialise, but flagging it in advance was the right call.

The range on P1 (20–26) was wide to account for flash architecture uncertainty. 25 landed comfortably within it. The prediction framework correctly captured the direction of the efficiency trade; the specific score came in at the upper end of the range.

What we don’t know yet

[Speculation]

Whether the 1/30 tool_call_malformed rate is stable or context-dependent is not established from this campaign. A longer run (60 or 90 tasks) would clarify whether this is a low-but-consistent rate or a one-off from a particular token-boundary edge case in task_06. Until that data exists, treating it as a real-but-low rate is the prudent default.

The mechanism by which GLM-4.7-Flash handles task_03 more efficiently than the non-flash parent (whether it’s an explicit shorter-context strategy baked into the fine-tune, or just lower maximum token generation limits forcing earlier decisions) is not visible from the outside. The result is a 9.4× cost saving with no accuracy loss, and the structural reason is unknown.

The task_08 failure on run 1 also remains unexplained. The model passed task_08 in runs 2 and 3. Whether run 1’s failure reflects a specific bad token sequence, sampling variation in the tool-call generation, or something else is not resolvable from three runs.