The refactor paradox
Campaign: 2026-05-19-qwen3-coder-30b-agentic-core-v1
Model: Qwen3-Coder-30B-A3B (Alibaba, MoE 30B total / 3B activated)
Harness: agentic-core-v1, 10 tasks × 3 runs = 30 total
Campaign date: 2026-05-19
Yesterday we ran the companion test to Qwen3 Next 80B A3B. Same lab, same harness, same activation budget: 3B active parameters per forward pass. Different training signal: Qwen3-Coder-30B-A3B is Alibaba’s code-specialist MoE, fine-tuned specifically on code. The research question was simple. Does code-specialist post-training buy measurable lift when the activation budget is the binding constraint?
It does. Just not where you’d expect it.
The Coder variant scored 22/30 at $0.0018/pass. The generalist scored 21/30 at $0.00122/pass (prior campaign: qwen3-next-80b-a3b-agentic-core-v1-2026). One extra passing run, slightly higher cost per pass, at 40% lower total parameter count. The specialist beat the generalist. Then it went and failed refactor_duplicated_code at 1/3, the task most directly in its wheelhouse, while the generalist scored 2/3 on the same task.
That contradiction is the story.
What the harness actually tests
agentic-core-v1 runs 10 task types, 3 runs each, 30 total. Every run starts clean: no state, no memory of previous attempts. A pass means the harness checker accepts the final output. Failures fall into two shapes: wrong_answer (model returned something; checker rejected it) or gave_up_mid_plan (model stopped before finishing).
The task distribution covers debugging, refactoring, log investigation, codebase tracing, minimal edits, ambiguous requirements, multi-step sequential planning, tool error recovery, impossible computation, and SQL investigation.
One task is a known trap. task_09 asks the model to recognise that the provided data is insufficient to answer the question and refuse accordingly. Non-reasoning models reliably miss this. The other nine are solvable given adequate agentic behaviour.
What Qwen3-Coder did
[Observed]
22 of 30 runs passed. Pass rate: 73.3% (verified: pass_rate_by_task.csv). Total cost: $0.039 (verified: verification/cost_breakdown.csv). Cost per pass: $0.0018.
| Task | Score | Notes |
|---|---|---|
| task_01 (fix_failing_test) | 2/3 | run2 wrong_answer (quick 2s fail) |
| task_02 (refactor_duplicated_code) | 1/3 | Two wrong_answers; predicted 3/3 |
| task_03 (investigate_log) | 3/3 | Clean sweep |
| task_04 (trace_through_codebase) | 2/3 | run2 quick-fail: 0 tool calls, 0.6s |
| task_05 (minimal_fix) | 3/3 | Clean sweep |
| task_06 (handle_ambiguous_requirement) | 2/3 | run1 quick-fail: 0 tool calls, 0.7s |
| task_07 (multi_step_plan) | 3/3 | Exceeded prediction |
| task_08 (recover_from_tool_error) | 3/3 | Clean sweep |
| task_09 (know_when_to_stop) | 1/3 | F4 triggered; run1 PASS, runs 2-3 fail |
| task_10 (sql_investigation) | 2/3 | run2 wrong_answer (quick, 1 tool) |
(verified: pass_rate_by_task.csv)
Four tasks swept clean (task_03, task_05, task_07, task_08). Four tasks dropped a run or two (task_01, task_04, task_06, task_10). task_02 was a near-total failure at 1/3. task_09 produced one surprise.
The code task that failed
[Observed]
task_02 (refactor_duplicated_code) scored 1/3. Two wrong_answers on the task built for code-specialist models: consolidate three near-identical functions into a shared abstraction.
Qwen3 Next 80B A3B (generalist) scored 2/3 on this task (prior campaign: qwen3-next-80b-a3b-agentic-core-v1-2026). The Coder variant, trained specifically on code, did worse.
The passing run used the standard pattern: read all three functions, identify the duplication, write the abstraction, verify. The failing runs executed the reading phase but produced the wrong abstraction. task_02 requires the model to decide which of several possible consolidations is the right one. At 3B activation, that discriminatory step is brittle. The Coder training signal improves structural pattern recognition but does not appear to improve multi-function analysis requiring a precise correctness judgement.
[Speculation]
Code-specialist training likely over-represents code generation and under-represents code consolidation. Writing three functions is common in code training data. Deciding that three functions should become one, then producing exactly the right abstraction, is much less common. This is a hypothesis. We have one specialist model at this activation budget in the dataset; a training explanation and an architecture explanation are not currently separable.
Why did task_07 beat predictions?
[Observed]
Rigg predicted task_07 (multi_step_plan) at 2/3, based on the generalist’s 2/3 result and activation-density concerns. The Coder variant swept it at 3/3. All four sequential steps completed in every run.
The generalist dropped task_07 on a single wrong_answer run. The Coder variant had no such failure. Sequential file-writing is well-represented in code training data: scaffolding generation, boilerplate creation, ordered setup scripts. Specialist training appears to generalise beyond syntax to task-structure patterns common in software development workflows. That generalisation was not part of the original prediction model.
The quick-fail pattern at 3B activation
[Observed]
task_04 run2 and task_06 run1 both failed with zero tool calls and sub-second latency (0.6s and 0.7s respectively). The model issued no tool calls, producing a final answer without investigating anything.
This pattern appeared in Qwen3 Next 80B A3B on the same task types (prior campaign: qwen3-next-80b-a3b-agentic-core-v1-2026). Same activation budget, same failure shape, reproduced in a different model. MoE routing is token-path-dependent: on some sequences, the 3B activated path does not engage tool-use mode. The model can navigate the task correctly on other runs; the issue is run-to-run variance, not systematic inability.
task_09: a single correct pass, then nothing
[Observed]
task_09 (know_when_to_stop) scored 1/3. run1 passed; run2 returned gave_up_mid_plan; run3 returned wrong_answer. F4 triggered.
This makes Qwen3-Coder-30B-A3B the fourth non-reasoning model in the dataset to score on task_09, joining Devstral 2 (1/3, prior campaign: devstral-2-123b-agentic-core-v1-2026), NVIDIA Nemotron Super 3 120B (1/3, prior campaign: nemotron-super-3-120b-agentic-core-v1-2026), and DeepSeek V4-Flash (1/3, prior campaign: deepseek-v4-flash-agentic-core-v1-2026). The pattern across all four is identical: one correct pass on run1, then failure.
[Speculation]
Code-specialist training may improve data-availability conditioning. Code models are trained to understand when a computation cannot proceed due to missing inputs. That may explain why the correct token path activated on run1. It does not explain why runs 2 and 3 reverted. The insight is there; it is stochastic, not consistent. Production deployment cannot rely on it.
Does specialist training make sense at this activation budget?
[Observed]
22/30 vs 21/30 confirms the lift is real. One extra passing run across 30 attempts at a marginally higher cost per pass. The control is as clean as this dataset allows: same lab, same activation budget, same harness version, consecutive campaign days.
The comparison table puts the result in context:
| Model | Score | Cost/pass | Architecture |
|---|---|---|---|
| Devstral 2 (Mistral) | 27/30 | $0.0019 | MoE 123B, code-specialist |
| MiniMax M2.5 | 27/30 | $0.0024 | reasoningContent per call |
| Kimi K2.5 (Moonshot) | 24/30 | $0.0044 | Dense, long-context |
| Qwen3-Coder-30B-A3B | 22/30 | $0.0018 | MoE 30B/3B, code-specialist |
| Qwen3 Next 80B A3B | 21/30 | $0.00122 | MoE 80B/3B, generalist |
| DeepSeek V3.2 | 19/30 | $0.0142 | Dense |
| NVIDIA Nemotron Super 3 120B | 12/30 | $0.0016 | MoE 120B/12B |
(verified: prior campaign records; qwen3-next-80b-a3b-agentic-core-v1-2026, devstral-2-123b-agentic-core-v1-2026, minimax-m2-5-agentic-core-v1-2026, kimi-k2-5-agentic-core-v1-2026, deepseek-v3-2-agentic-core-v1-2026, nemotron-super-3-120b-agentic-core-v1-2026)
Qwen3-Coder is the cheapest code-specialist in the dataset, the only one below $0.002/pass. Devstral 2 scores 5 runs higher at $0.0001 more per pass. That gap corresponds to roughly 40x more activated parameters. The trade-off argument for cost-constrained deployments is real; the task_02 failure is the thing to audit before committing to it.
What we were wrong about
[Observed]
P3 (task_02 at 3/3) was the clearest miss. The prediction assumed code-specialist training would translate to clean refactoring performance. Two wrong_answers on the core code task, behind the generalist on the same task.
The F3 falsification condition (model scores below 20/30) did not trigger. At 22/30, the model performed above the threshold. The task_02 failure cost two runs that would have put the total at 24/30.
4/6 predictions correct overall.
What we don’t know yet
[Unobserved]
We did not test Qwen3-Coder in thinking mode. Extended chain-of-thought might close the task_02 gap, where failing runs showed a wrong discriminatory decision rather than a wrong tool call. We do not have that data.
We also cannot currently separate the task_02 failure into a training gap versus an activation-budget constraint. A dense Qwen3-Coder at larger activation would give us that separation. No such model is in the dataset.
Predictions scored
| Prediction | Expected | Actual | Result |
|---|---|---|---|
| P1 — overall score 22/30 | 22/30 | 22/30 | ✅ Correct |
| P2 — task_09: 0/3 (F4 if ≥1) | 0/3 | 1/3 | ⚠️ F4 triggered |
| P3 — task_02: 3/3 (code-specialist clean sweep) | 3/3 | 1/3 | ❌ Wrong |
| P4 — task_07: 2/3 | 2/3 | 3/3 | ✅ Exceeded |
| P5 — quick-fail pattern reproduced | reproduced | reproduced | ✅ Correct |
| P6 — cost below $0.003/pass | in range | $0.0018/pass | ✅ Correct |
4/6. The two misses were both in the task-level predictions. F3 (below 20/30) did not trigger. F4 (non-reasoning model passes task_09) did.