The coder trap

Campaign: 2026-05-25-qwen3-coder-next-agentic-core-v1
Model: Alibaba Qwen3 Coder Next (qwen.qwen3-coder-next, AWS Bedrock, TEXT-only)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-05-25


The Qwen3 family had been trending upward in our dataset. Qwen3 Next 80B A3B scored 21/30, Qwen3-Coder-30B-A3B scored 22/30, and Qwen3 32B dense scored 23/30. Each new entry in the family either held or improved on the last. When Alibaba’s premium coder specialisation entered the queue, the predictions clustered in the 24-27 range: higher price, higher capability.

The actual result was 20/30.

That is below every other Qwen3 model we have run. The coder fine-tuning did not improve agentic performance. It made it worse.


What the harness measures

[Observed — harness spec]

Ten tasks, three runs each, thirty total. agentic-core-v1 tests the everyday work of a software agent on a live codebase: fix a failing test, refactor duplicated code, trace a log to its root cause, follow a value through several modules, apply a minimal targeted fix, handle a specification that is deliberately underspecified, execute a four-step sequential plan, recover after a tool call returns an error, recognise when a computation cannot be completed with the data provided, and run a SQL investigation.

A pass requires a correct answer. Failure modes are classified: wrong_answer, gave_up_mid_plan, tool_call_hallucinated, tool_call_redundancy. The harness does not penalise models for taking extra steps, only for producing wrong output or abandoning the task.

Two tasks are structural tests rather than capability tests. task_09 gives the model a three-row CSV and asks for a 10-day moving average. The correct response is to flag the data as insufficient. task_06 presents a feature specification with a deliberate gap and expects the model to produce an assumptions note alongside working code. Executing without interrogating is a failure on task_06 even if the code runs.


The results

[Observed — data pack per_task_results]

TaskScoreFailure mode
task_01 fix_failing_test3/3
task_02 refactor_duplicated_code3/3
task_03 investigate_log3/3
task_04 trace_through_codebase1/3wrong_answer (R2, R3)
task_05 minimal_fix3/3
task_06 handle_ambiguous_requirement0/3wrong_answer (all 3)
task_07 multi_step_plan3/3
task_08 recover_from_tool_error2/3wrong_answer (R3)
task_09 know_when_to_stop0/3wrong_answer (all 3)
task_10 sql_investigation2/3wrong_answer (R2)

Total: 20/30 (66.7%). Ten failures, all wrong_answer (verified: data pack task_outcomes.failure_mode). No timeouts, no tool crashes, no infrastructure errors.

The model attempted every task and produced output every time. It did not freeze. Every one of its failures came from completing with an incorrect answer, not from failing to engage.


task_06: the coder trap

[Observed — data pack task_outcomes, run_metrics]

task_06 is the clearest diagnostic in this campaign. The task presents a feature request where the spec has a deliberate gap: one input case is not described. Success requires producing working code and a written assumptions note that covers the gap. A model that implements without addressing the ambiguity fails, even if the code is syntactically clean and runs.

Qwen3 Coder Next scored 0/3 on task_06 (verified: data pack per_task_results). All three runs produced code without an adequate assumptions note. The model executed on the specification as written, treating the ambiguous portion as an edge case to handle rather than a gap to surface.

This is the pattern that coder training makes worse, not better. A model fine-tuned to generate code from specifications learns to act on specifications. Interrogating the specification before acting is a different reflex, and it does not get stronger from coding more. task_06 is specifically designed to catch models that have been trained too hard toward execution.

The earlier Qwen3-Coder-30B-A3B scored 22/30 and handled task_06 at 2/3 (verified: campaign 2026-05-19-qwen3-coder-30b-agentic-core-v1, data pack task_results). The smaller, cheaper coder model passed the ambiguity task twice. The premium coder variant, at 5x the cost-per-pass, passed it zero times. The capability did not scale with price.


Where does Coder Next fit in the Qwen3 family arc?

[Observed — cross-campaign data]

ModelScoreInput price (per 1M)Output price (per 1M)Cost/pass
Qwen3 Next 80B A3B21/30$0.14$1.20$0.00122
Qwen3-Coder-30B-A3B22/30$0.15$0.62$0.00007
Qwen3 32B dense23/30$0.15$0.60$0.00096
Qwen3 Coder Next20/30$0.50$1.20$0.00495

Every other Qwen3 entry improved on or matched the one before it. Qwen3 Coder Next is the first to go backwards, and it does so at the highest cost-per-pass in the family by a factor of 5.

The claim that coder specialisation should produce better agentic performance does not survive this dataset. It may improve generative coding tasks (completion, documentation, code-from-spec). This harness tests a different question: can the model operate as an agent on a codebase over multiple steps, handle unexpected results, and know when the task specification is the problem rather than the implementation? Those are not the same skills as being good at code generation, and this result suggests they are not correlated.

[Speculation] Whether Qwen3 Coder Next would outperform Qwen3 32B on a harness that tested pure code generation (speed, correctness on well-specified tasks, completion quality) is not known from this campaign. The regression we see is specific to agentic-core-v1 and its mix of structural traps and ambiguous requirements. A coder specialisation may still be the right choice for a narrow generative coding pipeline. On this suite it is the wrong choice.


task_09: no surprises

[Observed — data pack per_task_results, cross-campaign task_09_results]

task_09 scored 0/3, which is what we expected. Every non-reasoning model in our dataset fails this task. The correct response is to flag that three data points are insufficient for a 10-day calculation and stop. Qwen3 Coder Next computed a numeric result all three times (verified: data pack per_task_results).

The family context is the only frame where this result matters. Qwen3-Coder-30B-A3B scored 1/3 on task_09, an unusual result for a non-reasoning model in our dataset. The larger, more expensive Qwen3 Coder Next matched the family baseline at 0/3. The cheaper sibling was more cautious here. That may be noise across three runs rather than a systematic capability difference.


task_04 and task_10: inconsistency under repetition

[Observed — data pack per_task_results, run_metrics]

task_04 scored 1/3. The task requires tracing a value from entry() to report() across multiple modules and writing the full call chain. Run 1 succeeded; runs 2 and 3 produced wrong answers. task_10 scored 2/3 with run 2 failing. Both tasks show correct performance on some runs and failures on others, using identical inputs.

This is the variance pattern the harness is designed to catch. A model running at a capability boundary will produce inconsistent results under repeated sampling. The failures on task_04 and task_10 are not attributable to the task being too hard; the model solved each of them at least once. The inconsistency itself is the finding.


We were wrong about the prediction

[Observed — data pack predictions_scoring]

PredictionExpectedActualResult
P1 Overall score24-27/3020/30Wrong
P2 task_090/30/3Correct
P3 task_03≥ 2/33/3Correct
P4 task_013/33/3Correct

P1 was wrong by four passes at minimum. The prediction assumed the premium price-tier reflected stronger agentic capability. It does not, at least not on this harness.

The task-level predictions held because they tested behaviours that are relatively stable across model quality: log investigation is a tractable retrieval task (task_03), test fixing is template-like (task_01), and task_09 consistently fails non-reasoning models. What we got wrong was the aggregate. The assumption that specialised coding training would add capability rather than trade it was wrong.


What does the cost premium actually buy?

[Observed — data pack cost_breakdown]

$0.0990 total campaign cost. $0.00495 per passing run (verified: data pack cost_breakdown.per_pass_cost).

ModelScoreCost/pass
Qwen3-Coder-30B-A3B22/30$0.00007
Qwen3 32B dense23/30$0.00096
Qwen3 Next 80B A3B21/30$0.00122
Qwen3 Coder Next20/30$0.00495

Qwen3-Coder-30B-A3B scores two points higher and costs 70x less per correct pass. Qwen3 32B dense scores three points higher and costs 5x less. For agentic workloads that resemble this harness, the cost-per-pass comparison does not favour the Coder Next tier.

The pricing structure assumes the coder specialisation delivers capability above the base models. On this suite it delivers less capability at a higher price.


What we do not know yet

[Speculation]

The task_06 failure is three-out-of-three, but the mechanism is described at the output level, not the inference level. We know the model produced code without an adequate assumptions note each time. Whether the fine-tuning suppressed the ambiguity-handling behaviour entirely or whether the model considered the ambiguity and proceeded anyway is not visible from the data pack. Transcript-level analysis would be needed to distinguish between those two scenarios, and that analysis was not included in this campaign.

The wider family claim, that coder fine-tuning degrades agentic performance on tasks requiring ambiguity handling and self-stopping, is supported by the comparison between Qwen3 Coder Next (0/3 on task_06) and Qwen3-Coder-30B-A3B (2/3 on task_06), while Qwen3 32B dense scored 3/3 on task_06 with no coder fine-tuning. One family comparison is not a controlled study of fine-tuning effects. It is suggestive, not definitive.


Current leaderboard position

[Observed — cross-campaign data]

20/30 places Qwen3 Coder Next in the same score tier as Claude Opus 4.7 (21/30) and Qwen3 Next 80B A3B (21/30), one below Qwen3-Coder-30B-A3B (22/30), and three below its own smaller dense sibling. For a model at the top of the Qwen3 pricing tier, that position is not what the name implies.

The models above it in the table include several that cost substantially less to operate on a per-pass basis. Builders choosing between Qwen3 options for agentic workloads are better served by Qwen3 32B dense on this evidence: higher score, lower cost, and task_06 handled at 3/3.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.