The architecture question, revisited

May 25, 2026 · campaign-reports

Campaign: 2026-05-25-nvidia-nemotron-nano-3-30b-agentic-core-v1
Prior campaign: 2026-05-21-nvidia-nemotron-nano-3-30b-agentic-core-v1
Model: NVIDIA Nemotron Nano 3 30B (nvidia.nemotron-nano-3-30b, AWS Bedrock us-east-1)
Harness: agentic-core-v1 (10 tasks × 3 runs = 30 total)
Campaign date: 2026-05-25

The first run of the Nano scored 10/30. At the time, that placed it below the Super 120B’s 12/30 — the dense 30B model came in two points behind its larger MoE sibling — and the architecture hypothesis collapsed. Dense parameters, the argument went, aren’t the bottleneck. NVIDIA’s post-training is.

The second run changes the numbers. 15/30 (50.0%), $0.0084 total, $0.00056/pass. Five passes above the first run’s result. The Nano now sits clearly above the Super’s confirmed 11/30 from its own replication. The architecture conclusion is updated: dense 30B does beat MoE 120B in this family. The gap is +4 points. Both are still in the bottom third of the dataset.

What stayed the same matters as much as what moved. task_07 (multi-step planning) is 3/3 for the second consecutive run. task_09 (detect an impossible computation) is 0/3 again. task_10 (SQL investigation) is 0/3 again. wrong_answer accounts for 14 of 15 failures — the same confident-and-incorrect pattern that defined the first run. The Nano scores higher, but the failure mode is structurally identical.

What agentic-core-v1 tests

[Observed — harness spec]

The benchmark runs 10 tasks three times each, for 30 total runs. The tasks cover what a real software agent needs to do on a codebase: fix a failing test, refactor duplicated code, read a log and find the anomaly, trace execution paths through a multi-file codebase, make a minimal targeted fix, handle an ambiguous specification, build a multi-step plan and execute it, recover from a tool error, detect an impossible computation, and investigate a SQL log.

A run passes if the output is correct and complete. Failures are classified by what went wrong: wrong_answer means the model attempted the task and produced incorrect output — it understood what was being asked, called the right tools, and got it wrong. gave_up_mid_plan means the model abandoned mid-execution without a final answer. tool_call_hallucinated means the model tried to call a tool that doesn’t exist or passed fabricated arguments. tool_call_redundancy means consecutive identical tool calls — the model repeated the same action without making progress.

Two tasks have structural traps. task_09 provides fewer data points than the requested moving average window and asks the model to compute the result anyway. The correct response is to refuse, explaining why the computation is impossible. task_07 requires four sequential file writes with specified content in a specified order — it tests whether the model can hold a plan without deviating.

What changed between run 1 and run 2

[Observed — verified: pass_rate_by_task.csv, cost_breakdown.csv]

Task	2026-05-25	2026-05-21	Change
task_01 fix failing test	3/3	1/3	up
task_02 refactor duplicated code	1/3	0/3	up
task_03 investigate log	2/3	1/3	up
task_04 trace through codebase	0/3	1/3	down
task_05 minimal fix	1/3	0/3	up
task_06 handle ambiguous requirement	2/3	1/3	up
task_07 multi-step plan	3/3	3/3	flat
task_08 recover from tool error	3/3	3/3	flat
task_09 know when to stop	0/3	0/3	flat
task_10 SQL investigation	0/3	0/3	flat
Total	15/30	10/30	+5

Eight tasks improved or held. Two declined (task_04 dropped from 1/3 to 0/3; task_10 was already 0/3 and stayed there). The five net new passes came from six different tasks — no single task drove the improvement. That spread makes the +5 look more like natural variance resolving upward than a coherent capability gain.

task_07 and task_08 are the only tasks that hit 3/3 in both runs. Both are structured sequential execution: exact file writes, then a two-step error recovery with an explicit path. The Nano is consistent where the task is unambiguous. Everywhere else, the result varies run to run.

Does the architecture finding hold?

[Observed — intra-family data, verified: pass_rate_by_task.csv]

The first run’s finding was blunt: Nano 10/30 finished behind Super 12/30, and the architecture hypothesis collapsed. This run reverses the order. Nano 15/30 beats Super’s confirmed 11/30 by four points.

That makes the architecture conclusion more nuanced than it looked in May. Dense 30B does have an edge over MoE 120B in this family — across two Nano runs and two Super runs, the Nano is consistently either close to or above the Super. The edge is not large. +4 points in the second run, -2 in the first. Averaged across two campaigns each, they’re separated by about one point.

What hasn’t changed: both models are stuck at the bottom of the dataset. Nemotron Nano 15/30 is below Qwen3 Coder Next (20/30), below Llama 3.3 70B (20/30), and well below the cluster at 27/30 (Mistral Large 3, GLM-5, MiniMax M2.5, GPT-5.5). Architecture differences within the NVIDIA family account for a few points of variance. The gap to the top of the leaderboard is something else entirely.

[Speculation] The leading explanation is still post-training. NVIDIA’s inference hardware is best-in-class. Their models’ agentic task completion is not. Two models, different architectures, different parameter counts, both failing the same tasks with the same failure mode. That pattern points at training signal, not compute. Whether NVIDIA has invested significantly in agentic post-training for these models is not publicly documented.

Why does wrong_answer dominate?

[Observed — verified: failure_mode_histogram.csv]

14 of 15 failures in this run were wrong_answer. One tool_call_hallucinated. No infrastructure errors, no tool call formatting failures, no timeouts.

This failure signature has a specific meaning. wrong_answer is not confusion about the task or a formatting problem with the tool interface. The model understood what was being asked, executed the right sequence of tool calls, produced structured output, and the output was incorrect. It completed the task. It got it wrong.

For builders, this is the harder failure mode to handle. A model that errors loudly is easy to catch — the pipeline fails, a retry fires, the issue is visible. A model that produces confident, structured, wrong answers is harder to detect without a ground-truth checker. If you’re running the Nano on tasks that require reasoning over code, logs, or schemas, the wrong answers will look like right answers until something downstream breaks.

The first run had the same pattern: 19 of 20 failures were wrong_answer. Two consecutive campaigns, same failure mode distribution. This is structural, not noise.

What the task_07 consistency means

[Observed — verified: pass_rate_by_task.csv, tool_calls_by_task.csv]

task_07 (create four files with specified content using only fs_write) scored 3/3 in both runs. Average tool calls per run this campaign: 4.0. Average latency: 2.88 seconds. All three trials completed the full four-step sequence with correct content and correct paths.

Two consecutive 3/3 runs on the same task in the same model is a real signal. The Nano can follow explicit structured plans without deviation when the expected output is fully prescribed. It doesn’t need to reason or infer — the task is a pure execution sequence, and the Nano executes it cleanly.

task_08 (recover from a tool error, two-step fix) also scored 3/3 in both runs. Same pattern: a structured procedure with a defined endpoint.

What the Nano consistently fails are tasks that require inferring the right answer from context — reading a codebase and tracing a call chain (task_04 0/3 this run, 1/3 prior), interpreting a schema and finding the failing query (task_10 0/3 both runs), detecting that a computation is impossible from the shape of the data (task_09 0/3 both runs). Different tasks, same gap: the Nano handles prescription well and inference poorly.

The tasks that didn’t move

[Observed — verified: pass_rate_by_task.csv, cross_task_consistency.md]

task_09 (know when to stop) is 0/3 in both Nano runs, and 0/3 in the first Super run (the Super replication got 2/3, but that appears to be variance). task_10 (SQL investigation) is 0/3 in both Nano runs — the Nano has now failed every task_10 run across 6 attempts. The Super scored 3/3 on task_10 in both of its runs, which is the sharpest intra-family split in the dataset.

task_09 failures are wrong_answer — the model attempts to compute the 10-day moving average anyway rather than refusing. It either doesn’t notice the insufficient data window, or it notices and computes a result anyway. Either way, it produces an answer when the correct answer is “this cannot be done.”

task_10’s consistent 0/3 is harder to explain given the Super’s consistent 3/3. Both models are in the same family, presumably trained on overlapping data. The Nano failing SQL investigation every time while the Super passes it every time doesn’t have a clean explanation from the available data.

[Speculation] The task_10 split may reflect a difference in fine-tuning emphasis between the two models. The Super was positioned as an enterprise reasoning model; the Nano as an efficient on-device option. If SQL reasoning received more weight in the Super’s training, the task_10 gap could follow from that. This is a hypothesis. The current data establishes the gap exists; it doesn’t explain why.

How does cost compare?

[Observed — verified: cost_breakdown.csv]

$0.0084 total for 30 runs. $0.00056 per passing run.

Model	Score	$/pass
Mistral Large 3	27/30	$0.0022
MiniMax M2.5	27/30	$0.0024
GPT-OSS 20B	25/30	$0.00048
Qwen3 Coder Next	20/30	$0.00495
Nemotron Super 120B	11/30	$0.00154
Nemotron Nano 30B	15/30	$0.00056

The Nano is fractionally cheaper per pass than GPT-OSS 20B ($0.00048) but scores 10 points lower. For cost-sensitive workloads where only the tasks the Nano can actually complete are on the table — structured sequential execution, log investigation sometimes, targeted fixes sometimes — $0.00056/pass is the dataset’s second-lowest figure.

The problem is that the Nano’s 50% pass rate means two runs fail for every one that passes. Real-world agentic workloads don’t get to run only the tasks a model is good at. A campaign budget built on the Nano’s per-pass figure understates actual cost by roughly 2x on tasks where it fails.

What we don’t know yet

[Speculation] The +5 improvement from run 1 to run 2 spans six tasks and has no clear single driver. Whether this reflects run-to-run variance (the Nano’s underlying distribution is around 12—15/30 with a wide spread), a model update between campaigns (NVIDIA sometimes updates hosted models without version notices), or something about the campaign environment, can’t be determined from two runs. A third campaign would help narrow the variance estimate.

The task_10 split with the Super remains unexplained. 0/6 for the Nano, 6/6 for the Super, same family. The mechanism isn’t in the current evidence index.

And the task_04 decline — from 1/3 in run 1 to 0/3 in run 2 — adds a data point to the codebase tracing picture without resolving it. The Nano is now 1/6 on task_04 across two campaigns. Whether it can ever reliably trace multi-file call chains on this harness is not established.