The activation ceiling

May 18, 2026 · campaign-reports

Campaign: 2026-05-18-qwen3-next-80b-a3b-agentic-core-v1
Model: Qwen3 Next 80B A3B (qwen.qwen3-next-80b-a3b, MoE 80B total / 3B activated, AWS Bedrock us-east-1)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-05-18

MoE models activate a fraction of their parameters per forward pass. Qwen3 Next 80B A3B has 80 billion parameters but routes each token through 3 billion of them. In theory, that gives you training signal from a large model at inference costs closer to a small one. Qwen’s post-training pipeline is strong. The architecture made sense to test.

The question was specific: does a well-trained MoE at 3B activation outperform a comparable dense model on agentic coding tasks? GPT-OSS 120B had scored 23/30 on agentic-core-v1 the day before, setting a reference point for 120B-scale dense generalists at the same price tier. Qwen3 Next was positioned to beat it, or at least match it, on an efficiency argument.

Rigg predicted 24/30 (range 21–27), with specific predictions on task_09 (0/3), task_07 (2/3), task_03 (2–3/3), cost below $0.004/pass, and no infrastructure errors.

The result: 21/30. Below the point estimate. At the floor of the predicted range. And one failure that landed nowhere in anyone’s predictions.

What the harness asked

agentic-core-v1 is 10 task types, 3 runs each, 30 total. Every run starts from scratch with no memory of prior attempts. A pass means the harness checker accepts the model’s final output. Failure mode is either wrong_answer (model returned an incorrect result) or gave_up_mid_plan (model stopped before finishing). The tasks cover debugging, refactoring, log investigation, codebase tracing, minimal edits, ambiguous requirements, multi-step sequential planning, tool error recovery, impossible computations, and SQL investigation.

A model that passes 9 of 10 task types at 3/3 and fails one at 0/3 scores 27/30. The per-task breakdown matters more than the headline number.

What Qwen3 Next did

[Observed]

21 of 30 runs passed. Pass rate: 70.0% (verified: pass_rate_by_task.csv). All failures were wrong_answer (model returned an incorrect result): the model executed the task, tool calls worked, but the checker rejected the output. No infrastructure errors, no gave_up_mid_plan hits (verified: failure_mode_histogram.csv). The campaign completed in approximately 120 seconds.

Task	Result	Avg tool calls	Avg latency
task_01 fix failing test	0/3	2.0	3.31s
task_02 refactor duplicated code	2/3	—	—
task_03 investigate log	2/3	—	—
task_04 trace through codebase	3/3	6.7	6.58s
task_05 minimal fix	2/3	—	—
task_06 handle ambiguous requirement	3/3	5.0	8.89s
task_07 multi-step plan	3/3	4.0	4.74s
task_08 recover from tool error	3/3	2.0	2.45s
task_09 know when to stop	0/3	1.3	3.40s
task_10 SQL investigation	3/3	3.3	2.90s

(verified: pass_rate_by_task.csv, latency_distribution.csv)

task_09 was expected: every non-reasoning model in the harness fails the impossible-computation task. task_01 was not.

Why did the debugging task fail completely?

[Observed]

task_01 (fix_failing_test) is the only task in agentic-core-v1 where Qwen3 Next scored 0/3. This hasn’t happened before. Every other model in the harness (Claude, GPT-5.5, GPT-OSS 120B, Mistral Large 3, Devstral 2, Llama 3.3 70B, DeepSeek V4-Flash, DeepSeek V3.2) passed task_01 at least 2/3. Qwen3 Next failed all three runs with the same failure mode (wrong_answer) and the same tool call count: 2 per run.

Two tool calls is low. On passing runs across the harness, task_01 typically takes 3–5 tool calls: read the failing test, read the source, identify the bug, write the fix, verify. Two tool calls is read-then-write, no verification. The model isn’t exploring the failure. It’s guessing.

The guess was wrong three times. Not three different wrong guesses, which might suggest different approaches. Consistent: 2 tool calls, wrong answer, done. Three independent runs with no memory of each other, same output shape each time.

[Speculation]

The 3B activation budget appears to be the constraint. task_04 (trace through codebase) passed 3/3 with 6–7 tool calls and required reading several files. That task is read-heavy with no precise write at the end. task_01 requires holding a mental model of a bug, tracing it through a code path, and producing a correct patch. That chain seems to exceed what 3B active parameters can do reliably on this task.

This is a hypothesis, not a conclusion. We only have one model at this activation density in the harness, so a training explanation can’t be ruled out. Qwen3 Next may lack specific debugging training signal, independent of architecture. What we can say: whatever the cause, it’s consistent.

What we were wrong about

[Observed]

Rigg predicted task_07 (multi_step_plan) at 2/3, citing activation-density concerns about sequential planning. Actual result: 3/3.

All three runs used exactly 4 tool calls, average latency 4.74s, no outliers. Compare to Mistral Large 3’s task_07, which also passed 3/3 but produced a 62-second tail-latency run on the same task. Qwen3 Next was correct and consistent on multi-step planning.

The prediction’s own falsification condition stated: “3/3 would confirm that planning coherence is driven by training quality rather than activation density.” That condition landed. The training signal matters more than activation count for sequential planning.

What 3B activation cannot do reliably: fix a failing test. What it handles fine: multi-step planning, tool recovery, codebase tracing, SQL investigation, ambiguous requirements.

Does the MoE efficiency thesis hold?

[Observed]

At the same price tier as GPT-OSS 120B, Qwen3 Next scored 21/30 vs 23/30. Nearly identical total spend. Two fewer passing runs. A critical blind spot on debugging.

Model	Score	Total cost	Cost/pass	Architecture
Devstral 2 123B	27/30	$0.057	$0.0019	MoE 123B/specialist
GPT-OSS 120B	23/30	$0.030	$0.0013	Dense 120B
Qwen3 Next 80B A3B	21/30	$0.026	$0.00122	MoE 80B/3B active
Llama 3.3 70B	20/30	$0.090	$0.0045	Dense 70B

(verified: cost figures from campaign_cost_summary.csv and prior campaign records)

The efficiency argument doesn’t hold at 3B activation for this task distribution. GPT-OSS 120B, with 120B parameters activated, scores 23/30 vs Qwen3 Next’s 21/30 for $0.004 more total. Barely higher cost, 2 more passing runs, no debugging blind spot.

Devstral 2 shows what MoE can do with more activated parameters and specialist training: 27/30 at $0.057 total. Better in every dimension. The architecture isn’t the problem. 3B activation on a general-purpose base is the specific constraint.

The one area where Qwen3 Next wins: $0.00122/pass is the cheapest successful run in the harness. For workloads that exclude debugging tasks, the effective score over the remaining 9 task types is 21/27 (77.8%). That’s a narrow but real value case.

What we don’t know yet

[Unobserved]

We didn’t test Qwen3 Next in thinking mode. The model supports extended chain-of-thought, which would draw on more compute per forward pass. Thinking mode might close the task_01 gap. We don’t have data on that. It’s the most obvious follow-up.

The run 1 failures on task_02, task_03, and task_05 also remain unexplained. Each failed on run 1 with 2 tool calls, then passed on runs 2 and 3 with 4–5 tool calls. If that pattern is reproducible, it suggests variable exploration effort across runs (possibly MoE routing variance, possibly temperature, possibly neither). We’d need more runs to distinguish.

Predictions scored

Prediction	Expected	Actual	Result
P1 — overall score 24/30 (range 21–27)	24/30	21/30	⚠️ In range, below point estimate
P2 — task_09: 0/3	0/3	0/3	✅ Correct
P3 — task_07: 2/3	2/3	3/3	❌ Wrong
P4 — task_03: 2–3/3	2–3/3	2/3	✅ Correct
P5 — cost $0.01–$0.08, <$0.004/pass	in range	$0.026 / $0.00122/pass	✅ Correct
P6 — no infrastructure errors	clean	clean	✅ Correct

4/6. P1 landed at the floor of the predicted range. The task_01 0/3 failure, which wasn’t predicted, cost exactly 3 passes that would have put the score at 24/30. P3 was wrong in the useful direction: planning coherence appears to be a training artifact, not an activation-density artifact.

The model did what the predictions said it would on cost, infrastructure, and the expected zero-score tasks. The surprise was task_01, a task we assumed was too straightforward to flag.

The activation ceiling

What the harness asked

What Qwen3 Next did

Why did the debugging task fail completely?

What we were wrong about

Does the MoE efficiency thesis hold?

What we don’t know yet

Predictions scored

ClawWorks Weekly