12/30 was real

Campaign: 2026-05-24-nvidia-nemotron-super-3-120b-agentic-core-v1
Prior campaign: 2026-05-19-nemotron-super-3-120b-agentic-core-v1
Model: NVIDIA Nemotron Super 3 120B A12B (nvidia.nemotron-super-3-120b, AWS Bedrock ON_DEMAND us-east-1)
Harness: agentic-core-v1 (10 tasks x 3 runs = 30 total)
Campaign date: 2026-05-24


The first run of Nemotron Super 3 120B returned 12/30 — a 13-point miss against the pre-campaign prediction of 25/30. A result that far from expectation raises an obvious question: was it a bad run, or does the model genuinely top out here?

The second run answers that. 11/30 (36.7%), $0.017 total, $0.00154/pass. One point from the prior result. Same failure patterns on the same tasks. The model has a genuine structural ceiling on this harness, and it’s in the low 30s.

That settles the noise question. The more interesting thing that came out of the second run is different: Nemotron now scores 2/3 on task_09 (know when to stop), which puts it above GPT-OSS 120B’s 1/3 on the one task most high-scoring models still fail. Two models, same 120B parameters, same Bedrock infrastructure, 12 points apart overall — and the lower-scoring one has the better impossibility gate.


What changed between run 1 and run 2?

[Observed — campaign data, verified: pass_rate_by_task.csv]

Task2026-05-242026-05-19Change
task_01 fix failing test0/32/3down
task_02 refactor duplicated code0/31/3down
task_03 investigate log1/30/3up
task_04 trace through codebase0/30/3flat
task_05 minimal fix1/31/3flat
task_06 handle ambiguous requirement1/32/3down
task_07 multi-step plan1/30/3up
task_08 recover from tool error2/32/3flat
task_09 know when to stop2/31/3up
task_10 SQL investigation3/33/3flat
Total11/3012/30-1

Total cost: $0.017 | $0.00154/pass | avg latency: ~2.6s/run (verified: cost_breakdown.csv)

The per-task numbers move in both directions — task_01 and task_02 dropped, task_03 and task_07 improved — but the total stays within 1 point of the prior run. task_04 (codebase trace) held at 0/3 for the second consecutive campaign: 6/6 failures total across both runs. task_10 (SQL investigation) held at 3/3 again. The model’s profile is stable.


Does the replication confirm a ceiling, or just a bad batch?

[Observed — combined two-campaign data]

Two independent campaigns, five days apart, same model on the same infrastructure:

The task-level structure is preserved across both runs. The tasks where Nemotron consistently passes (task_08, task_09, task_10) all reproduce. The tasks where it consistently fails (task_04, effectively task_03 and task_07) reproduce in the same direction. One-point variance in total score is within normal single-task sampling variation — two campaigns is not a large sample, but the shape stability is evidence for structure rather than noise.

[Observed — tool call data, verified: tool_call_redundancy.md]

The underlying mechanism shows up in tool-call depth. Average tool calls per task across the second run: 1.0 to 2.0 for most tasks. task_01 averages 1.0 tool call. The model reads one file and returns an answer, with no test run and no verification. All 3 task_01 runs are wrong_answer. task_02 averages 0.7 tool calls — meaning one run issued zero real tool calls (classified tool_call_hallucinated).

This is not a model that exhausts its turn budget. It declares answers quickly, without the intermediate verification steps that passing runs on this harness typically require. The 12B active parameter budget appears to set a natural limit on how many tool-result integration steps the model will attempt before committing.


Why is Nemotron 12 points below GPT-OSS 120B?

[Observed — cross-campaign data, verified: combined leaderboard]

ModelScore$/passtask_04task_09Failure mode
OpenAI GPT-OSS 120B23/30$0.000872/31/3wrong_answer x6
NVIDIA Nemotron Super 120B11/30$0.001540/32/3wrong_answer x16, tool_call_hallucinated x1

Same nominal parameter count. Same AWS Bedrock ON_DEMAND infrastructure in us-east-1. Both campaigns ran within 48 hours of each other using the same harness version. The 12-point gap is not an infrastructure artifact.

[Speculation] GPT-OSS 120B’s post-training appears optimised for broad executor-style work: code fix, refactor, debugging chains, multi-step sequential tasks. Nemotron’s post-training appears optimised for bounded analytical tasks where the answer space is structured: SQL pattern match, schema navigation, impossibility recognition. Those training objectives produce different strengths. On a harness that is weighted toward execution-heavy tasks (tasks 1, 2, 3, 4, 5, 7 all require sustained multi-step tool use), GPT-OSS 120B’s training objective is better matched.

The parameter count is not the variable that explains this gap. Post-training objective is.


Does Nemotron’s task_09 improvement mean anything?

[Observed — task_09 run data, verified: task_09_transcripts, tool_call_redundancy.md]

task_09 asks the model to compute a 10-day moving average from 3 rows of data. The correct response is to recognize that the data is insufficient and decline. Most models either submit a confident wrong answer or exhaust their turn limit trying to work around the data shortage.

Nemotron Super 120B scored 2/3 in the second run, up from 1/3 in the first. The two passing runs averaged 3 tool calls each (max 4): read the file, register the data shortage, write the refusal. No loop, no turn exhaustion. The one failing run produced a wrong_answer.

[Observed — comparative data]

ModelScoretask_09
Ministral 3 8B28/302/3
Claude Haiku 4.527/300/3
Devstral 227/301/3
OpenAI GPT-OSS 120B23/301/3
NVIDIA Nemotron Super 120B11/302/3
Kimi K2 Thinking12/301/3

Nemotron’s 2/3 on task_09 matches Ministral 3 8B — the dataset’s open-weight value leader at 28/30 overall. Claude Haiku 4.5 (27/30) scores 0/3 on the same task. GPT-OSS 120B (23/30) scores 1/3.

[Speculation] Knowing when to stop is not correlated with overall agentic capability in this dataset. The models that reliably recognize impossible work are not necessarily the ones that reliably complete possible work. Whether Nemotron’s impossibility gate generalizes beyond this specific three-row CSV pattern requires more task_09-class tests. Two passing runs across two campaigns is a real signal, not noise — but 2/3 on a 3-run task is one more correct response than 1/3.


Does the 120B parameter count justify the cost?

[Observed — cost comparison]

ModelScoreCost/pass
Ministral 3 8B28/30$0.00067
GPT-OSS 120B23/30$0.00087
Nemotron Nano 3 30B10/30$0.00079
Nemotron Super 3 120B11/30$0.00154

Nemotron Super 120B is more expensive per passing task than both GPT-OSS 120B ($0.00087) and its own smaller sibling Nemotron Nano 30B ($0.00079), despite scoring lower than GPT-OSS 120B and only 1 point above Nano. The larger MoE model does not pay for itself in pass rate.

[Speculation] For any production agentic workload, the Nemotron family is currently dominated on cost-per-outcome by both GPT-OSS 120B and Ministral 3 8B. The only specific advantage Nemotron Super has in this dataset is task_09 — and Ministral 3 8B matches it there at 14x lower cost per passing task overall.


Where does Nemotron Super sit in the full leaderboard?

[Observed — combined campaign data]

RankModelScoreCost/pass
1Claude Sonnet 4.628/30$0.0514
1GLM-4.728/30$0.00380
1MiniMax M2.128/30$0.00258
1DeepSeek V4-Flash28/30$0.00143
1Ministral 3 8B28/30$0.00067
6Mistral Large 3 675B27/30$0.00213
6MiniMax M2.527/30$0.00240
6Devstral 227/30$0.00190
6GLM-527/30
6Claude Haiku 4.527/30$0.00316
11GLM-4.7 Flash25/30$0.00057
11GPT-OSS 20B25/30
13MiniMax M224/30$0.00417
14OpenAI GPT-OSS 120B23/30$0.00087
15Kimi K2 Thinking12/30$0.00440
16NVIDIA Nemotron Super 120B11/30$0.00154
17NVIDIA Nemotron Nano 30B10/30$0.00079

Nemotron Super 3 120B sits 16th, one point below Kimi K2 Thinking (12/30), above only Nemotron Nano 30B (10/30). At 120B parameters and 12B active, it is the largest model in the dataset by total parameter count that does not surpass 40% pass rate.


What did the predictions get right?

[Observed — predictions scored against results, verified: predictions/nvidia-nemotron-super-3-120b-agentic-core-v1.md]

PredictionResult
P1: Score 10-15/30Pass: 11/30
P2: task_03 and task_04 both 0/3Fail: task_03 scored 1/3
P3: task_07 scores 0/3Fail: task_07 scored 1/3
P4: Cost $0.013-$0.025Pass: $0.017
P5: Nemotron stays 5+ pts below GPT-OSS 120BPass: 12-pt gap (11 vs 23)
P6: task_10 scores 3/3Pass: 3/3

4/6. The range prediction was right (10-15/30). The structural ceiling on GPT-OSS comparison was right. The two misses were in the same direction: task_03 and task_07 both came in at 1/3 instead of 0/3. Nemotron slightly outperformed on both while the total fell by 1. The score point accuracy on those tasks was off; the macro picture was accurate.


What we do not know yet

[Speculation]

Whether task_09’s 2/3 result in the second run will hold on a third campaign. Two passing runs across two independent campaigns is more evidence than 1 passing run, but task_09 is a 3-run task. The current evidence (1/3 then 2/3) shows improvement. Whether Nemotron has a reliable impossibility gate or whether 2/3 is the upper bound of sampling variance requires at least one more campaign result.

[Speculation]

Whether a Nemotron variant with stronger instruction-following post-training would close the gap with GPT-OSS 120B on execution-heavy tasks without sacrificing the task_09 strength. The current model appears to trade executor breadth for analytical depth. A different post-training mix could shift that tradeoff — but that model is not in the dataset.

[Speculation]

The task_01 regression (2/3 to 0/3) is the biggest per-task movement in this campaign. The model went from passing 2 of 3 fix-failing-test runs in run 1 to passing 0 in run 2, at 1.0 average tool calls. Whether this is sampling variance on a task that is at the edge of Nemotron’s capability, or a repeatable shallow-engagement failure, would require a third campaign to separate.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.