We gave Alibaba's biggest Qwen3 model a text-only benchmark. It scored 40%.

May 22, 2026 · campaign-reports

Campaign: 2026-05-22-qwen3-vl-235b-a22b-agentic-core-v1
Model: Alibaba Qwen3 VL 235B A22B (qwen.qwen3-vl-235b-a22b, AWS Bedrock us-east-1, ON_DEMAND)
Harness: agentic-core-v1 (10 tasks × 3 runs = 30 total)
Campaign date: 2026-05-22

Three Qwen3 models had already run on agentic-core-v1 before this campaign. Qwen3 Next 80B A3B scored 21/30. Qwen3-Coder 30B A3B scored 22/30. Qwen3 32B scored 23/30. All three activate roughly 3 billion parameters per forward pass (or 32 billion for the dense model). The research question for Qwen3 VL 235B A22B was whether 22 billion active parameters, 7× more than the MoE siblings, would close the gap to the top tier.

It did not.

Qwen3 VL 235B A22B scored 12/30 (40.0%) at $0.004978 per passing task. That is the weakest result in the Qwen3 family despite the largest activation budget in the family. The prediction was 22–26/30. F4 triggered. VL pre-training extracted a penalty that no amount of active compute recovered from.

What the harness asks

[Observed — harness spec]

Ten tasks, three runs each, thirty total. agentic-core-v1 covers software engineering work a deployed agent would actually encounter: fix a failing test, refactor duplicated code, investigate a log file, trace execution through a codebase, make a targeted minimal fix, handle an ambiguous requirement, execute a four-step sequential plan, recover from an injected tool error, detect when a requested computation is impossible with the given data, and run a SQL investigation.

A pass requires correct task completion. Failure modes are classified: wrong_answer (completed incorrectly), gave_up_mid_plan (abandoned mid-execution), tool_call_hallucinated (fabricated tool or arguments), tool_call_malformed (structurally broken output).

Two tasks are structural traps. task_09 presents a CSV with three rows and asks for a 10-day moving average. The correct response is to refuse on the grounds that the data is insufficient. task_08 injects a file-not-found error on the first tool call; the model must detect it, find the correct path, and produce a verified output.

agentic-core-v1 does not include any image inputs. Every task is text and tool calls only.

What happened

[Observed — data pack per_task_results]

Task	Score	Cost	Avg latency	Notes
task_01 fix_failing_test	2/3	$0.00462	11.1s	1 run: 1 tool call, underfired
task_02 refactor_duplicated_code	1/3	$0.00461	10.8s	2 failures: incomplete refactor
task_03 investigate_log	3/3	$0.02169	8.3s	13K tokens/run, clean
task_04 trace_through_codebase	1/3	$0.00689	20.4s	Run 1: gave up after 1 tool call
task_05 minimal_fix	1/3	$0.00388	7.9s	2 failures: wrong correction
task_06 handle_ambiguous_requirement	0/3	$0.00446	12.2s	Wrong answer all 3 runs
task_07 multi_step_plan	3/3	$0.00309	8.0s	Exactly 4 tool calls, all 3 runs
task_08 recover_from_tool_error	0/3	$0.00169	3.4s	Identical output all 3 runs
task_09 know_when_to_stop	1/3	$0.00532	10.4s	Run 1 passed, runs 2 and 3 failed
task_10 sql_investigation	0/3	$0.00349	7.4s	Identical output all 3 runs

Total: 12/30 (40.0%). $0.059732 campaign cost. $0.004978/pass.

Two tasks swept at 3/3. Three at 0/3. Five partial. The profile is scattered: not the gradual capability floor of a model that is consistently weak, but the inconsistency of a model that fires cleanly on some tasks and completely misses on others.

The VL penalty: confirmed and measurable

[Observed — intra-family data pack, cross-campaign]

The central research question was whether VL pre-training imposes a meaningful penalty on pure text-and-tool tasks. The data is clear:

Model	Score	Active params	Type
Qwen3 32B	23/30	32B	Dense
Qwen3-Coder 30B A3B	22/30	3B	MoE
Qwen3 Next 80B A3B	21/30	3B	MoE
Qwen3 VL 235B A22B	12/30	22B	VL MoE

Qwen3 VL 235B A22B has 7× more active parameters than both 3B-active MoE siblings. It scores 9 points lower than either of them. It has 69% of the activation budget of the 32B dense model but scores 11 points lower. More active compute did not help.

The prediction applied the commonly cited assumption that VL models retain 85–95% of comparable text model performance on pure text tasks. The actual retention here is closer to 52% (12/30 versus 23/30 for the Qwen3 32B dense). That assumption did not hold.

[Speculation] One plausible mechanism: VL pre-training may bias the model’s tool dispatch toward visual-context operations. When only text tools are available, the model’s exploration behaviour could systematically undershoot on tasks that, in a vision context, would prompt more active tool use. This is consistent with the campaign’s two complete 3/3 successes (task_03 and task_07), both of which have explicit, structured input that leaves little room for visual-cue disambiguation. On tasks requiring open-ended reasoning or evidence synthesis, the model appears to commit early and stop.

[Unobserved] This entire campaign ran without any image inputs. agentic-core-v1 is text and tool calls only. Whether providing visual context alongside the same task prompts would change the model’s behaviour is unknown. The performance gap observed here is the penalty on pure text-and-tool tasks, not a claim about the model’s multimodal capability.

Deterministic locked outputs: task_08 and task_10

[Observed — data pack task_outcomes, run_metrics]

The most striking finding in this campaign is not the score. It is the token-level evidence on two specific tasks.

task_08 (recover_from_tool_error): All three runs produced identical outputs. Each run: 1 tool call, 799 input tokens, 53 output tokens, and fs_write("length.txt", "35") as the final result. The task is designed to inject a file-not-found error on the first tool call and require the model to detect it, locate the correct path, and verify the output. This model never attempted the error-recovery step. It issued a single file-write with a wrong answer and stopped, three times in a row, with no variation in any measurable output.

task_10 (sql_investigation): Same pattern. All three runs: 2 tool calls, 1773 input tokens, 84 output tokens, and fs_write("finding.txt", "Query 4: phone column doesn't exist") in every run. The model landed on a wrong answer about which query was failing and wrote it three consecutive times without deviation.

Token counts and tool call counts are identical across all three runs for both tasks. This is not sampling variance. The model has a locked-in strategy for these specific task types and does not shift at all across independent invocations.

[Speculation] Both tasks require reading and synthesising evidence before writing a conclusion. task_08 requires acting on an explicit error signal; task_10 requires querying a database and identifying the correct failing query from results. The locked output pattern suggests the model resolves both tasks from first-pass pattern-matching on the prompt text rather than exploration, commits to an answer without tool-driven verification, and reproduces that answer exactly because the resolution pathway is fully determined from the initial context. Whether this is a VL-specific failure mode or a general Qwen3 characteristic is not determinable from a single campaign.

Where the model works

[Observed — data pack task_03_results, task_07_results, run_metrics]

Two tasks scored 3/3 with clean execution.

task_03 (investigate_log): 3/3 across all three runs. Each run processed 13,081 to 13,094 input tokens, used 2 tool calls, and produced the correct finding. Average latency 8.3 seconds. Total cost $0.02169, which is 36% of the campaign’s total spend. The model read a large access log, extracted the relevant signal, and wrote the correct conclusion. It did this consistently, every run.

task_07 (multi_step_plan): 3/3 with exactly 4 tool calls per run. Costs were $0.001030, $0.001031, and $0.001031 across the three runs: essentially identical. The model issued 4 sequential fs_write calls, wrote all 4 required files correctly, and finished in under 11 seconds on every run. This is the most reproducible execution profile in the entire campaign.

The common thread: both tasks have explicit, concrete inputs with deterministic correct outputs. task_03 contains a large log file that directly encodes the answer. task_07 has explicit step-by-step instructions. The model does well when the task collapses to “find and report” or “follow explicit instructions.” It fails on tasks that require reasoning about ambiguity (task_06), error recovery (task_08), or identifying the right pattern across a schema (task_10).

task_09: a pass on run 1

[Observed — data pack per_task_results, run_metrics]

The prediction was 0/3. The actual result was 1/3. Run 1 passed. Runs 2 and 3 failed.

Several non-reasoning models in the dataset have produced at least one task_09 pass across a 3-run campaign: DeepSeek V4 Flash, GPT-OSS-20B, Qwen3-Coder 30B A3B, and GLM-4.7 among them. Qwen3 VL 235B A22B joins this group.

Run 1 used 2 tool calls and 289 output tokens. Runs 2 and 3 used 328 and 351 output tokens respectively, both failing. The pass on run 1 came with fewer output tokens, not more. The model’s stochastic output generated the right reasoning path once: it recognised the 10-row window was impossible on a 3-row dataset and refused rather than computing. The same prompt on two subsequent independent runs produced wrong answers.

[Speculation] task_09 is the most meta-cognitive task in the suite: it tests whether a model recognises a structurally impossible request rather than attempting it anyway. A VL model trained on multimodal data may have encountered underspecified image description tasks during post-training, where the correct response is to flag missing context rather than confabulate an answer. If that generalises to text, it would explain an occasional correct refusal on task_09. One pass in three runs is weak evidence for any mechanism; this is a hypothesis, not a finding.

Leaderboard position

[Observed — leaderboard, cross-campaign data]

At 12/30, Qwen3 VL 235B A22B ties NVIDIA Nemotron Super 3 120B at the bottom of the mid-tier:

Model	Score	$/pass	Lab
GLM-4.7	28/30	$0.0038	Zhipu AI
DeepSeek V4 Flash	28/30	$0.0015	DeepSeek
Claude Sonnet 4.6	28/30	$0.0514	Anthropic
GPT-5.5	27/30	$0.0699	OpenAI
Devstral 2	27/30	$0.0020	Mistral
Mistral Large 3	27/30	$0.0021	Mistral
MiniMax M2.5	27/30	$0.0024	MiniMax
GLM-5	27/30	$0.0065	Zhipu AI
GLM-4.7-Flash	25/30	$0.000565	Zhipu AI
GPT-OSS-20B	25/30	$0.000481	OpenAI
Kimi K2.5	24/30	$0.0044	Moonshot AI
GPT-OSS-120B	23/30	$0.0013	OpenAI
Qwen3 32B	23/30	$0.0010	Alibaba
Qwen3-Coder 30B A3B	22/30	$0.0018	Alibaba
Qwen3 Next 80B A3B	21/30	$0.0012	Alibaba
DeepSeek V3.2	19/30	—	DeepSeek
Llama 3.3 70B	14/30	—	Meta
Nemotron Super 3 120B	12/30	$0.0016	NVIDIA
Qwen3 VL 235B A22B	12/30	$0.0050	Alibaba
Nemotron Nano 3 30B	10/30	$0.00079	NVIDIA

The complete Qwen3 family in one view:

Model	Score	Active params	Type
Qwen3 32B	23/30	32B	Dense
Qwen3-Coder 30B A3B	22/30	3B	MoE
Qwen3 Next 80B A3B	21/30	3B	MoE
Qwen3 VL 235B A22B	12/30	22B	VL MoE

The largest total-parameter model in the Qwen3 family is the weakest on this harness. The most active-parameter model in the Qwen3 family is the weakest. This is not a scale story. The VL architecture, not compute, is the deciding variable.

Execution profile

[Observed — data pack run_metrics, tool_call_analysis]

Avg latency 10.0s/run, comparable to GLM-4.7 (9.8s/run)
Total wall clock approximately 5 minutes (30 runs, 08:30:05Z to 08:35:06Z)
0 tool_call_malformed across 30 runs: the model accepted text-and-tool Converse API calls cleanly despite being a VL architecture
0 infrastructure errors across 30 runs
0 diagnosis_then_regression patterns in any transcript

When the model fails, it fails fast. task_08 and task_10 each consumed 1 and 2 tool calls respectively and finished in 3.4s average. No loops. No stalling. The model commits to a wrong answer quickly, rather than burning turns and then failing.

task_04 was a latency outlier. Run 2 took 48.5 seconds with 6 tool calls. Runs 1 and 3 used 1 and 5 tool calls at 1.9s and 10.7s. Only run 2 passed. The model’s exploration depth varied substantially on this task, and the deeper exploration was required to succeed. Across the rest of the campaign, latency was consistent.

A note on transcript count: 31 transcript files exist in the campaign data versus 30 counted runs. One aborted run leaked into the output directory. The data totals are correct; the extra file is from an infrastructure abort before scoring.

Predictions

[Observed — data pack predictions_scoring]

Prediction	Expected	Actual	Result
P1 Overall score	22–26/30	12/30	Wrong ❌ (F4 triggered)
P2 task_09	0/3	1/3	Wrong ❌ (run 1 passed)
P3 task_07	>=2/3	3/3	Correct ✅
P4 VL penalty within 2 pts of Qwen3 32B (23/30)	<=2 pts	11 pts	Wrong ❌

1/4 correct. The prediction framework applied the 85–95% text retention hypothesis to a VL model and placed the score range too high. The actual retention was 52%. P4 was the core calibration bet on that hypothesis, and it was falsified by a wide margin.

P2 missed in the positive direction: task_09 produced a pass on run 1, which was unexpected. P3 (task_07) was the safest call in the set and held.

The corrected prior for future VL model predictions on text-and-tool harnesses: do not apply an 85–95% text performance retention assumption. This campaign puts an upper bound on text retention for this specific model at approximately 52%, and the mechanism (VL pre-training displacing text-and-tool specialisation) is consistent with a structural penalty rather than a sampling artefact.

What we don’t know yet

[Speculation]

Whether the task_08 and task_10 locked-output pattern is a VL-specific failure mode or a general characteristic of models with this activation budget and training lineage is not established. The Qwen3 32B dense model also produced 0/3 on task_08 with a specific 2-call completion pattern, though its token counts across runs were not identical in the same way. Comparing transcript-level token counts across both campaigns would clarify whether the locked-output behaviour is unique to the VL variant.

The task_06 0/3 result (handle_ambiguous_requirement) is unexplained at the mechanism level. The other Qwen3 models scored 3/3 on task_06. Whether the VL model’s different reading of ambiguous requirements is a post-training artifact from multimodal instruction data, a capacity effect, or something else is not determinable from the data pack.

Running this model with image inputs on a subset of the agentic-core-v1 tasks would test whether visual context recovers any of the 11-point gap. That experiment has not been run.