Not the architecture

May 21, 2026 · campaign-reports

Campaign: 2026-05-21-nvidia-nemotron-nano-3-30b-agentic-core-v1
Model: NVIDIA Nemotron Nano 3 30B (nvidia.nemotron-nano-3-30b, AWS Bedrock us-east-1)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-05-21

Two weeks ago, Nemotron Super 3 120B scored 12/30 on agentic-core-v1 and landed last in the dataset. It was the chip company’s first model in our harness and the result raised one question that the Super’s data couldn’t answer: was the failure a compute problem or a training problem?

The Super is mixture-of-experts. Despite its 120B total parameters, it activates around 12B per forward pass. Every other model in the top half of the leaderboard runs denser. If the MoE activation gap explains the 12/30, then the dense Nano (30B full parameters, approximately 2.5× the Super’s active compute) should score meaningfully higher. If NVIDIA’s post-training for multi-step tool orchestration is the bottleneck, the Nano should match or underperform.

The Nano scored 10/30. F4 triggered.

The architecture hypothesis is rejected. The dense model scored lower than the sparse model in the same family. NVIDIA’s agentic post-training is the family-level failure, and more active parameters did not help.

What agentic-core-v1 tests

[Observed — harness spec]

Ten tasks, three runs each, 30 total. The benchmark covers the work a software agent does on a real codebase: fix a test failure, refactor duplicated code, investigate a log file, trace execution paths, make a targeted fix, handle an ambiguous specification, write a multi-step sequential plan, recover from a tool error, detect an impossible computation, and run a SQL investigation.

A pass requires completing the task correctly and completely. Failure modes are classified: wrong_answer (model attempted the task and produced incorrect output), gave_up_mid_plan (abandoned mid-execution), tool_call_hallucinated (called a tool that doesn’t exist or with fabricated arguments), tool_call_redundancy (loop without progress).

Two tasks are structural traps. task_09 gives the model a CSV with fewer than 10 rows and asks for a 10-day moving average. The correct response is refusal: the data is insufficient. task_07 requires four sequential file writes with specified content in order; it tests whether the model can hold a plan without deviating.

The architecture test result

[Observed — data pack per_task_results]

Task	Score	Delta vs Super	Note
task_01 fix_failing_test	1/3	-1	Super scored 2/3
task_02 refactor_duplicated_code	0/3	-1	Super scored 1/3
task_03 investigate_log	1/3	+1	Super scored 0/3
task_04 trace_through_codebase	1/3	+1	Super scored 0/3
task_05 minimal_fix	0/3	-1	`tool_call_redundancy` on run1
task_06 handle_ambiguous_req	1/3	-1	`tool_call_hallucinated` flagged
task_07 multi_step_plan	3/3	+3	Only full-pass task; Super scored 0/3
task_08 recover_from_tool_error	3/3	+1	Super scored 2/3
task_09 know_when_to_stop	0/3	-1	Super scored 1/3; Nano fails entirely
task_10 sql_investigation	0/3	-3	Super scored 3/3; largest regression

Total: 10/30 (33.3%). Failure modes: wrong_answer ×18, gave_up_mid_plan ×1, tool_call_hallucinated ×1 (verified: data pack task_outcomes.failure_mode).

The delta column tells the diagnosis. The Nano scores higher than the Super on tasks 03, 04, 07, and 08 (four tasks where the format is more structured or the context demand is lower). It scores lower on tasks 01, 02, 05, 06, 09, and 10 (six tasks requiring reasoning, code transformation, or schema analysis). The dense architecture helped on explicit sequential execution (task_07, task_08) and hurt on SQL and minimal targeted edits (task_10, task_05). Net effect: -2.

That is not a compute story. It is a training story.

The wrong_answer pattern

[Observed — data pack task_outcomes, evidence_patterns]

18 of 20 failures were wrong_answer. One gave_up_mid_plan. One tool_call_hallucinated. No infrastructure errors, no tool call formatting failures, no silent timeouts.

This is the most important diagnostic signal in the campaign. wrong_answer means the model attempted the task, called tools, produced output, and the output was incorrect. The model was not confused about the tool interface. It was not struggling with API format. It produced confident, structured, wrong answers across 8 distinct task IDs.

wrong_answer as the dominant failure mode has a specific implication for builders: this model cannot be trusted to self-validate its output on tasks requiring reasoning, code transformation, or log analysis. A model that breaks loudly is easier to handle in production than one that completes silently wrong.

The cross-task consistency confirms this is a model-level pattern. wrong_answer on task_02 (refactoring), task_05 (minimal fix), task_10 (SQL). These require different skills. The single failure mode appearing across all three points at training signal, not capability gaps in one narrow area.

What the task_07 result means

[Observed — data pack per_task_results, run_metrics]

task_07 (multi-step plan: create steps/step1.txt through steps/step4.txt with specified content using only fs_write) scored 3/3. This was the biggest prediction miss; Rigg expected 1–2/3 based on the Super’s complete failure at 0/3.

Average tool calls per run: 4.0. Average run time: 2.88s (verified: data pack run_metrics.tool_calls_per_run, run_metrics.latency_by_run). All three runs executed the full 4-step sequence with correct content and correct paths.

task_08 (recover from tool error) also scored 3/3. Same pattern: two-step instruction with an explicit error recovery path, unambiguous sequencing.

The Nano can follow structured sequential plans when the format is explicit and the output is prescribed. What it cannot do is apply reasoning over code, schema, or log data to produce correct outputs. task_07 and task_08 are structured execution. Tasks 02, 05, 10 require interpretation. The Nano succeeds at the former and fails at the latter regardless of parameter count.

[Speculation] Whether the task_07 improvement over the Super reflects different fine-tuning emphasis on sequential structured plans is not documented by NVIDIA. The pattern fits that explanation. It is a hypothesis. The Super’s 0/3 on task_07 and the Nano’s 3/3 on the same task, in the same family, is a real signal that the two models were trained differently on structured plan execution.

The task_10 regression

[Observed — data pack per_task_results, evidence_index]

Nemotron Super 3 120B scored 3/3 on task_10 (SQL investigation). The Nano scored 0/3.

This is the largest single-task delta in the dataset for any intra-family comparison. The Super was the only model in the dataset to score 3/3 on task_10 at launch. It was a genuine standout. The Nano’s 0/3 on the same task is not just a regression; it is the Nano’s biggest gap relative to the family.

The brief does not include a detailed evidence trace for task_10 runs. What can be said from the task_outcomes table: all three failures were wrong_answer. The model attempted the SQL investigation, produced output, and produced incorrect findings.

[Unobserved] No transcript excerpts were captured for task_10 runs in the evidence index. Whether the Nano’s SQL failure is a context-length problem (the task provides a schema and sample data), a reasoning failure, or a training artifact is not established from the current data. We would need per-run transcript analysis to distinguish.

What we were wrong about

[Observed — data pack predictions_scoring]

Prediction	Predicted	Actual	Hit?
P1 overall score	15/30 (range 12–19)	10/30	No (below lower bound)
P2 task_09 score	0/3	0/3	Yes
P3 task_07 score	1–2/3	3/3	No (exceeded upper bound)
P4 task_03 score	1–2/3	1/3	Yes
P5 no infrastructure errors	true	true	Yes

P1 was a confident miss in the wrong direction. The range was 12–19 with a point estimate of 15. Actual: 10, below the lower bound (verified: data pack predictions_scoring.prediction_p1). Rigg expected the dense architecture to provide floor lift over the Super’s 12/30. That assumption turned out to be wrong.

The undercount happened because three tasks scored 0/3 that were expected to contribute something: task_02, task_05, task_10. Task_10 alone was projected at 1–2/3 based on the Super’s perfect performance. Getting 0/3 across all three made the P1 miss worse than the wrong-direction error on task_07 could compensate.

P3 miss in the right direction: task_07 at 3/3 added 2+ passes to what P1 expected. It was not enough to bring the total into the predicted range because other tasks declined further than forecast.

Cost

[Observed — data pack run_summary]

$0.0079 total. $0.00079 per passing run.

Model	Score	$/pass
Claude Sonnet 4.6	28/30	$0.051
GPT-5.5	27/30	$0.070
Mistral Large 3	27/30	$0.0022
GLM-5	27/30	$0.0065
MiniMax M2.5	27/30	$0.0024
GPT-OSS 20B	25/30	$0.000481
Kimi K2.5	24/30	$0.0044
GPT-OSS 120B	23/30	$0.0013
Qwen3-Coder 30B A3B	22/30	$0.00177
Llama 3.3 70B	20/30	$0.0045
Nemotron Super 120B	12/30	$0.0016
Nemotron Nano 3 30B	10/30	$0.00079

$0.00079 per passing run is cheap. The problem is pass rate math: 10/30 means two runs fail for every one that passes. At 3 runs per correct output, the effective task cost is $0.0024, more than double the per-pass figure. GPT-OSS 20B delivers $0.000481/pass at 83% pass rate. That is a smaller number with a higher probability of being correct on the first run.

Average latency per run: 2.37s (verified: data pack run_metrics.latency_by_run). Fast. But speed at 33% accuracy is not a production argument.

What the family result means

[Observed — intra-family data]

NVIDIA now has two models in the dataset: Nemotron Super 120B (12/30, $0.0016/pass) and Nemotron Nano 3 30B (10/30, $0.00079/pass). They hold the last two positions in the leaderboard.

The architecture test was the question this campaign existed to answer. A MoE model activating 12B parameters and a dense model committing 30B parameters ran the same 10-task harness. The dense model scored 2 points lower. That is a clean result: the bottleneck is not compute per forward pass.

The family-level conclusion is that NVIDIA’s agentic post-training does not produce models that reliably complete multi-step tool-use tasks. The failure mode is consistent across both models: wrong_answer dominates, the model is tool-calling and confident and incorrect. Same training lineage, same failure signature, different architecture, different parameter count.

For builders evaluating the Nemotron family for agentic workloads: the architecture variation is now tested. Neither result supports deployment on tasks requiring reasoning over code, schema, or logs. The bright spot is explicit sequential structured execution (task_07, task_08), which may be worth something for narrow workflows. The overall pass rate is not.

What we don’t know yet

[Speculation]

The task_10 regression is unexplained. The Super’s perfect SQL performance was the dataset standout at the time. The Nano’s 0/3 on the same task with a different architecture, but presumably similar training data, is the sharpest open question from this campaign. Transcript analysis for the Nano’s task_10 runs would establish whether it is a context-length failure, a reasoning failure, or something else.

The task_07 improvement also lacks a confirmed mechanism. Both models are post-trained by NVIDIA on similar (presumably overlapping) data. The Super scoring 0/3 on structured sequential plans while the Nano scores 3/3 on the same task implies a fine-tuning difference. What changed between them, and whether it was intentional, is not documented.

A third campaign, either the dense Qwen3 32B family comparison or a prompt-engineering study on the Nano’s task_07 behavior, would add signal. The current data establishes what the architecture test shows. It does not explain the within-family variation.