Dense enough

Campaign: 2026-05-21-qwen3-32b-agentic-core-v1
Model: Alibaba Qwen3 32B (qwen.qwen3-32b-v1:0, AWS Bedrock us-east-1, ON_DEMAND)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-05-21


The Qwen3 family already had two models in the dataset before this campaign ran. Qwen3 Next 80B A3B scored 21/30. Qwen3-Coder-30B-A3B scored 22/30. Both are mixture-of-experts models; despite large total parameter counts, each activates around 3 billion parameters per forward pass.

Qwen3 32B is dense. All 32 billion parameters are active on every forward pass. The controlled question was whether that compute difference (roughly 10× more active parameters) translates into meaningfully better agentic task completion.

The answer: 23/30. One pass above Coder, two above Next. The dense advantage is real. It’s also narrow.


What the harness asks

[Observed — harness spec]

Ten tasks, three runs each, thirty total. agentic-core-v1 covers the everyday work of a software agent on a real codebase: fix a failing test, refactor duplicated logic, investigate a log file, trace execution paths through the codebase, make a targeted minimal fix, handle an ambiguous requirement, execute a four-step sequential plan, recover from an injected tool error, detect when a computation is impossible with the given data, and run a SQL investigation.

A pass requires the model to complete the task correctly. Failure modes are classified: wrong_answer (model attempted the task and produced incorrect output), gave_up_mid_plan (abandoned mid-execution), tool_call_hallucinated (called a nonexistent tool or fabricated arguments), tool_call_redundancy (loop without progress).

Two tasks are structural traps. task_09 gives the model a CSV with fewer than 10 rows and asks for a 10-day moving average. The correct response is refusal. task_08 injects a path error on the first tool call; the model must detect it, adapt, and produce a correct result, with the checker verifying the final output, not just the recovery attempt.


The results

[Observed — data pack per_task_results]

TaskScoreFailure mode
task_01 fix_failing_test3/3
task_02 refactor_duplicated_code3/3
task_03 investigate_log3/3
task_04 trace_through_codebase3/3
task_05 minimal_fix2/3wrong_answer (run 1)
task_06 handle_ambiguous_requirement3/3
task_07 multi_step_plan3/3
task_08 recover_from_tool_error0/3wrong_answer (all 3)
task_09 know_when_to_stop0/3wrong_answer (all 3)
task_10 sql_investigation3/3

Total: 23/30 (76.67%). Seven failures, all wrong_answer (verified: data pack task_outcomes.failure_mode). No infrastructure errors, no gave_up_mid_plan, no tool call formatting failures.

Seven out of seven failures are the same failure mode. The model called tools, produced output, and produced incorrect output. It did not stall, loop, or break loudly.


What the MoE comparison shows

[Observed — intra-family data pack, cross-campaign]

ModelArchitectureActive paramsScore$/pass
Qwen3 Next 80B A3BMoE~3B21/30$0.00122
Qwen3-Coder 30B A3BMoE~3B22/30$0.0018
Qwen3 32BDense32B23/30$0.000955

Same lab, same general training lineage, same Bedrock deployment region. Dense activates ten times as many parameters per forward pass. The score improvement is 1–2 passes.

If parameter count were the primary driver of agentic task completion, ten times the active compute should produce substantially more than a 5–10% pass rate lift. What we see instead is consistent with a threshold effect: beyond the level of active compute both MoE models commit (~3B), additional capacity yields diminishing returns on the tasks agentic-core-v1 measures.

The Qwen3 family ceiling on this harness is 23/30, held by the most compute-expensive configuration in the family.

[Speculation] Whether 3B active parameters is close to the threshold where capacity stops being the bottleneck, with training recipe, data quality, or fine-tuning direction taking over, is not established by this campaign alone. The Qwen3 data points are consistent with that interpretation. A 5B or 8B dense variant from the same family would be a clean test.


The task_08 failure

[Observed — data pack task_outcomes, run_metrics]

task_08 injects a path error on the first tool call. The model receives a file-not-found response and must handle it: locate the correct path, retry, and write the right answer. Qwen3 32B made exactly two tool calls in all three runs and declared success each time (verified: data pack run_metrics.tool_calls_per_run). The checker rejected all three outputs as incorrect.

The sequence in each run: first call fails with the injected error, second call produces output, model writes a completion. The error-recovery step, locating the correct path and verifying the result, was skipped. The model processed the error signal and moved forward without checking what it produced.

This is a specific failure mode: false confidence under adversarial input. The model is not confused about the tool interface. It is not stalling. It is completing with a wrong answer, at normal latency, with a completion that reads successful.

The same wrong_answer pattern appears in task_05 (run 1) and task_09 (all 3). Three different task types, same failure shape. The cross-task consistency means this is a model-level pattern, not a gap in one specific skill.


task_09: the expected zero

[Observed — data pack per_task_results]

task_09 asks for a 10-day moving average from a three-row CSV. The correct response is to state the data is insufficient. Qwen3 32B computed and returned a numeric result in all three runs. Qwen3-Coder 30B A3B scored 1/3 on task_09; the other non-reasoning Qwen models in the dataset scored 0/3. Qwen3 32B joins the 0/3 group (verified: cross-campaign task_09_results view).

Dense full-activation did not change this. The structural failure (executing the impossible rather than questioning the premise) is not a compute problem. It is a training problem, and the 32B dense configuration inherits it from the family.


The predictions

[Observed — data pack predictions_scoring]

PredictionExpectedActualResult
P1 Overall score24–27/3023/30Wrong (one pass below range)
P2 task_090/30/3Correct
P3 task_03≥ 2/33/3Correct

P1 was wrong by one pass. The prediction range of 24–27 assumed the dense architecture would push the model into the upper-mid tier. It didn’t, and the reason is specific: task_08’s 0/3 cost three passes that had been counted as likely wins in the range.

The P1 miss is meaningful. GPT-OSS-20B at 20B dense scored 25/30. Qwen3 32B at 32B dense scored 23/30. Qwen’s dense model carries 60% more parameters than the OpenAI OSS model and scores two passes lower. Parameter count does not fully explain score. The training recipe matters at least as much.


Cost and latency

[Observed — data pack run_summary, cost_breakdown]

$0.0220 total campaign cost. $0.000955 per passing run.

task_03 (log investigation) was the most expensive task: $0.0061, 28% of total spend. The spike comes from large scaffold context: approximately 39K input tokens across three runs at Qwen3 32B’s pricing of $0.15/$0.60 per 1M input/output tokens. The campaign is priced at standard tier; the task_03 spike is present but does not destabilise the overall cost profile.

task_04 run 2 was a latency outlier at 67 seconds versus the 3–6 second average for other runs (verified: data pack run_metrics.latency_by_run). No infrastructure error was logged; likely a Bedrock cold-start or transient throttle event. All other runs completed within 7 seconds. Campaign wall clock was approximately 3 minutes.

At $0.000955/pass, Qwen3 32B is price-competitive:

ModelScore$/pass
GLM-527/30$0.0065
MiniMax M2.527/30
GPT-OSS 20B25/30$0.000481
Kimi K2.524/30
Qwen3 32B23/30$0.000955
Qwen3-Coder 30B A3B22/30$0.0018
Qwen3 Next 80B A3B21/30$0.00122
DeepSeek V3.219/30
Nemotron Super 120B12/30$0.0016
Nemotron Nano 3 30B10/30$0.00079

Mid-tier cost, mid-tier performance. Neither a cost floor (Qwen3-Coder-30B-A3B is sharply cheaper per pass) nor a ceiling (five models score higher).


What we don’t know yet

[Speculation]

The task_08 false-confidence failure is consistent across runs but its mechanism is not documented at the transcript level for this campaign. Whether the 2-call completion pattern reflects a training-level shortcut on error-recovery sequences (the model learned “handle error, produce output” without a verify step) or an inference-time artifact would require transcript analysis. All we have from the data pack is the tool call count and the failure mode classification. Both are confirmed; the cause is a hypothesis.

The per-parameter gap between Qwen3 32B and GPT-OSS-20B is also unexplained. OpenAI’s smaller dense model scores higher with fewer parameters. If the gap is training data, fine-tuning methodology, or RLHF emphasis on agentic tasks is not available from external documentation.


What it means for builders

[Observed — cross-campaign data]

For a builder choosing between Qwen3 32B and a Qwen3 MoE variant: the dense model is marginally more reliable on the tasks agentic-core-v1 measures. The score difference is 1–2 passes over 30 runs. Whether that margin justifies the memory and serving cost difference of dense over MoE depends entirely on deployment constraints. At similar cost per pass and similar performance, serving infrastructure dominates the decision.

The task_08 result is the more actionable finding. A model that confidently completes with a wrong answer under an injected tool error is harder to catch in production than one that fails loudly. If your agentic pipeline involves tool calls that can fail, and the correct behaviour is detection-then-retry, Qwen3 32B’s three-run zero on task_08 is a first-pass filter, not a footnote.

task_07 (3/3) and task_10 (3/3) show the model handles structured sequential plans and SQL investigation without issue. On tasks with clear output formats and no adversarial input, 23/30 is a credible mid-tier result. The failures are concentrated in the two tasks where the correct response is something other than confident completion.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.