12/30 was real

May 25, 2026 · campaign-reports

Campaign: 2026-05-24-nvidia-nemotron-super-3-120b-agentic-core-v1
Prior campaign: 2026-05-19-nemotron-super-3-120b-agentic-core-v1
Model: NVIDIA Nemotron Super 3 120B A12B (nvidia.nemotron-super-3-120b, AWS Bedrock ON_DEMAND us-east-1)
Harness: agentic-core-v1 (10 tasks x 3 runs = 30 total)
Campaign date: 2026-05-24

The first run of Nemotron Super 3 120B returned 12/30 — a 13-point miss against the pre-campaign prediction of 25/30. A result that far from expectation raises an obvious question: was it a bad run, or does the model genuinely top out here?

The second run answers that. 11/30 (36.7%), $0.017 total, $0.00154/pass. One point from the prior result. Same failure patterns on the same tasks. The model has a genuine structural ceiling on this harness, and it’s in the low 30s.

That settles the noise question. The more interesting thing that came out of the second run is different: Nemotron now scores 2/3 on task_09 (know when to stop), which puts it above GPT-OSS 120B’s 1/3 on the one task most high-scoring models still fail. Two models, same 120B parameters, same Bedrock infrastructure, 12 points apart overall — and the lower-scoring one has the better impossibility gate.

What changed between run 1 and run 2?

[Observed — campaign data, verified: pass_rate_by_task.csv]

Task	2026-05-24	2026-05-19	Change
task_01 fix failing test	0/3	2/3	down
task_02 refactor duplicated code	0/3	1/3	down
task_03 investigate log	1/3	0/3	up
task_04 trace through codebase	0/3	0/3	flat
task_05 minimal fix	1/3	1/3	flat
task_06 handle ambiguous requirement	1/3	2/3	down
task_07 multi-step plan	1/3	0/3	up
task_08 recover from tool error	2/3	2/3	flat
task_09 know when to stop	2/3	1/3	up
task_10 SQL investigation	3/3	3/3	flat
Total	11/30	12/30	-1

Total cost: $0.017 | $0.00154/pass | avg latency: ~2.6s/run (verified: cost_breakdown.csv)

The per-task numbers move in both directions — task_01 and task_02 dropped, task_03 and task_07 improved — but the total stays within 1 point of the prior run. task_04 (codebase trace) held at 0/3 for the second consecutive campaign: 6/6 failures total across both runs. task_10 (SQL investigation) held at 3/3 again. The model’s profile is stable.

Does the replication confirm a ceiling, or just a bad batch?

[Observed — combined two-campaign data]

Two independent campaigns, five days apart, same model on the same infrastructure:

Run 1: 12/30, $0.019, task_04 at 0/3, task_10 at 3/3
Run 2: 11/30, $0.017, task_04 at 0/3, task_10 at 3/3

The task-level structure is preserved across both runs. The tasks where Nemotron consistently passes (task_08, task_09, task_10) all reproduce. The tasks where it consistently fails (task_04, effectively task_03 and task_07) reproduce in the same direction. One-point variance in total score is within normal single-task sampling variation — two campaigns is not a large sample, but the shape stability is evidence for structure rather than noise.

[Observed — tool call data, verified: tool_call_redundancy.md]

The underlying mechanism shows up in tool-call depth. Average tool calls per task across the second run: 1.0 to 2.0 for most tasks. task_01 averages 1.0 tool call. The model reads one file and returns an answer, with no test run and no verification. All 3 task_01 runs are wrong_answer. task_02 averages 0.7 tool calls — meaning one run issued zero real tool calls (classified tool_call_hallucinated).

This is not a model that exhausts its turn budget. It declares answers quickly, without the intermediate verification steps that passing runs on this harness typically require. The 12B active parameter budget appears to set a natural limit on how many tool-result integration steps the model will attempt before committing.

Why is Nemotron 12 points below GPT-OSS 120B?

[Observed — cross-campaign data, verified: combined leaderboard]

Model	Score	$/pass	task_04	task_09	Failure mode
OpenAI GPT-OSS 120B	23/30	$0.00087	2/3	1/3	wrong_answer x6
NVIDIA Nemotron Super 120B	11/30	$0.00154	0/3	2/3	wrong_answer x16, tool_call_hallucinated x1

Same nominal parameter count. Same AWS Bedrock ON_DEMAND infrastructure in us-east-1. Both campaigns ran within 48 hours of each other using the same harness version. The 12-point gap is not an infrastructure artifact.

[Speculation] GPT-OSS 120B’s post-training appears optimised for broad executor-style work: code fix, refactor, debugging chains, multi-step sequential tasks. Nemotron’s post-training appears optimised for bounded analytical tasks where the answer space is structured: SQL pattern match, schema navigation, impossibility recognition. Those training objectives produce different strengths. On a harness that is weighted toward execution-heavy tasks (tasks 1, 2, 3, 4, 5, 7 all require sustained multi-step tool use), GPT-OSS 120B’s training objective is better matched.

The parameter count is not the variable that explains this gap. Post-training objective is.

Does Nemotron’s task_09 improvement mean anything?

[Observed — task_09 run data, verified: task_09_transcripts, tool_call_redundancy.md]

task_09 asks the model to compute a 10-day moving average from 3 rows of data. The correct response is to recognize that the data is insufficient and decline. Most models either submit a confident wrong answer or exhaust their turn limit trying to work around the data shortage.

Nemotron Super 120B scored 2/3 in the second run, up from 1/3 in the first. The two passing runs averaged 3 tool calls each (max 4): read the file, register the data shortage, write the refusal. No loop, no turn exhaustion. The one failing run produced a wrong_answer.

[Observed — comparative data]

Model	Score	task_09
Ministral 3 8B	28/30	2/3
Claude Haiku 4.5	27/30	0/3
Devstral 2	27/30	1/3
OpenAI GPT-OSS 120B	23/30	1/3
NVIDIA Nemotron Super 120B	11/30	2/3
Kimi K2 Thinking	12/30	1/3

Nemotron’s 2/3 on task_09 matches Ministral 3 8B — the dataset’s open-weight value leader at 28/30 overall. Claude Haiku 4.5 (27/30) scores 0/3 on the same task. GPT-OSS 120B (23/30) scores 1/3.

[Speculation] Knowing when to stop is not correlated with overall agentic capability in this dataset. The models that reliably recognize impossible work are not necessarily the ones that reliably complete possible work. Whether Nemotron’s impossibility gate generalizes beyond this specific three-row CSV pattern requires more task_09-class tests. Two passing runs across two campaigns is a real signal, not noise — but 2/3 on a 3-run task is one more correct response than 1/3.

Does the 120B parameter count justify the cost?

[Observed — cost comparison]

Model	Score	Cost/pass
Ministral 3 8B	28/30	$0.00067
GPT-OSS 120B	23/30	$0.00087
Nemotron Nano 3 30B	10/30	$0.00079
Nemotron Super 3 120B	11/30	$0.00154

Nemotron Super 120B is more expensive per passing task than both GPT-OSS 120B ($0.00087) and its own smaller sibling Nemotron Nano 30B ($0.00079), despite scoring lower than GPT-OSS 120B and only 1 point above Nano. The larger MoE model does not pay for itself in pass rate.

[Speculation] For any production agentic workload, the Nemotron family is currently dominated on cost-per-outcome by both GPT-OSS 120B and Ministral 3 8B. The only specific advantage Nemotron Super has in this dataset is task_09 — and Ministral 3 8B matches it there at 14x lower cost per passing task overall.

Where does Nemotron Super sit in the full leaderboard?

[Observed — combined campaign data]

Rank	Model	Score	Cost/pass
1	Claude Sonnet 4.6	28/30	$0.0514
1	GLM-4.7	28/30	$0.00380
1	MiniMax M2.1	28/30	$0.00258
1	DeepSeek V4-Flash	28/30	$0.00143
1	Ministral 3 8B	28/30	$0.00067
6	Mistral Large 3 675B	27/30	$0.00213
6	MiniMax M2.5	27/30	$0.00240
6	Devstral 2	27/30	$0.00190
6	GLM-5	27/30	—
6	Claude Haiku 4.5	27/30	$0.00316
11	GLM-4.7 Flash	25/30	$0.00057
11	GPT-OSS 20B	25/30	—
13	MiniMax M2	24/30	$0.00417
14	OpenAI GPT-OSS 120B	23/30	$0.00087
15	Kimi K2 Thinking	12/30	$0.00440
16	NVIDIA Nemotron Super 120B	11/30	$0.00154
17	NVIDIA Nemotron Nano 30B	10/30	$0.00079

Nemotron Super 3 120B sits 16th, one point below Kimi K2 Thinking (12/30), above only Nemotron Nano 30B (10/30). At 120B parameters and 12B active, it is the largest model in the dataset by total parameter count that does not surpass 40% pass rate.

What did the predictions get right?

[Observed — predictions scored against results, verified: predictions/nvidia-nemotron-super-3-120b-agentic-core-v1.md]

Prediction	Result
P1: Score 10-15/30	Pass: 11/30
P2: task_03 and task_04 both 0/3	Fail: task_03 scored 1/3
P3: task_07 scores 0/3	Fail: task_07 scored 1/3
P4: Cost $0.013-$0.025	Pass: $0.017
P5: Nemotron stays 5+ pts below GPT-OSS 120B	Pass: 12-pt gap (11 vs 23)
P6: task_10 scores 3/3	Pass: 3/3

4/6. The range prediction was right (10-15/30). The structural ceiling on GPT-OSS comparison was right. The two misses were in the same direction: task_03 and task_07 both came in at 1/3 instead of 0/3. Nemotron slightly outperformed on both while the total fell by 1. The score point accuracy on those tasks was off; the macro picture was accurate.

What we do not know yet

[Speculation]

Whether task_09’s 2/3 result in the second run will hold on a third campaign. Two passing runs across two independent campaigns is more evidence than 1 passing run, but task_09 is a 3-run task. The current evidence (1/3 then 2/3) shows improvement. Whether Nemotron has a reliable impossibility gate or whether 2/3 is the upper bound of sampling variance requires at least one more campaign result.

[Speculation]

Whether a Nemotron variant with stronger instruction-following post-training would close the gap with GPT-OSS 120B on execution-heavy tasks without sacrificing the task_09 strength. The current model appears to trade executor breadth for analytical depth. A different post-training mix could shift that tradeoff — but that model is not in the dataset.

[Speculation]

The task_01 regression (2/3 to 0/3) is the biggest per-task movement in this campaign. The model went from passing 2 of 3 fix-failing-test runs in run 1 to passing 0 in run 2, at 1.0 average tool calls. Whether this is sampling variance on a task that is at the edge of Nemotron’s capability, or a repeatable shallow-engagement failure, would require a third campaign to separate.

12/30 was real

What changed between run 1 and run 2?

Does the replication confirm a ceiling, or just a bad batch?

Why is Nemotron 12 points below GPT-OSS 120B?

Does Nemotron’s task_09 improvement mean anything?

Does the 120B parameter count justify the cost?

Where does Nemotron Super sit in the full leaderboard?

What did the predictions get right?

What we do not know yet

ClawWorks Weekly