Hardware expertise, software failure
Campaign: 2026-05-19-nemotron-super-3-120b-agentic-core-v1
Model: NVIDIA Nemotron Super 3 120B A12B
Harness: agentic-core-v1 (OpenClaw, 10 tasks × 3 runs)
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-05-19
NVIDIA built the machines that run every model in this dataset. H100s, Blackwell, NVLink: the infrastructure layer that makes modern LLM inference possible. The question going into this campaign was whether building on your own silicon gives you an edge when you decide to ship a model of your own.
Nemotron Super 3 120B is that attempt. A 120B-parameter mixture-of-experts model activating 12B parameters per forward pass. NVIDIA’s first hosted LLM entry in the modelbattles dataset. And the answer, at least on practical agentic tasks, is no: 12/30 (40%), dead last among every 100B+ model we’ve tested.
The hardware expertise doesn’t appear to have bought anything meaningful on the software side. What makes the result interesting isn’t the headline number. It’s the shape of the failure.
What the harness actually tests
[Observed: harness spec]
agentic-core-v1 runs each model on 10 tasks, 3 times each, 30 total runs. The tasks cover the core loop of agentic software work: fix a failing test, refactor duplicated code, investigate a log file, trace through a codebase, execute a minimal targeted fix, handle an ambiguous requirement, plan and complete a multi-step sequential write, recover from a tool error, identify an impossible computation, and run a SQL investigation.
A pass requires completing the task correctly. Failures are classified as wrong_answer (incorrect output), gave_up_mid_plan (model abandoned mid-execution), or a tool loop that never resolved. Two tasks are structural traps. Task_09 supplies 3 rows of data and asks for a 10-day moving average; the correct response is to refuse, and most non-reasoning models return a confident wrong number. Task_07 requires four sequential file writes with verification after each step; it tests whether a model can hold a plan without collapsing partway through.
What Nemotron Super 3 120B did
[Observed]
12 of 30 runs passed. Pass rate: 40.0% (verified: pass_rate_by_task.csv).
| Task | Score | Notes |
|---|---|---|
| task_01 fix failing test | 2/3 | Consistent enough |
| task_02 refactor duplicated code | 1/3 | Wrong answer on 2 of 3 runs |
| task_03 investigate log | 0/3 | Complete failure |
| task_04 trace through codebase | 0/3 | Complete failure |
| task_05 minimal fix | 1/3 | Inconsistent on targeted edits |
| task_06 handle ambiguous requirement | 2/3 | Acceptable |
| task_07 multi-step plan | 0/3 | Complete failure. First model in dataset to score 0 here. |
| task_08 recover from tool error | 2/3 | Tool error recovery mostly holds |
| task_09 know when to stop | 1/3 | Surprise. One of two non-reasoning models to score here. |
| task_10 SQL investigation | 3/3 | Perfect. Strongest single-task performance in the dataset. |
(verified: pass_rate_by_task.csv)
Total cost: $0.019 | $0.0016/pass | avg run time: ~1s (verified: cost_breakdown.csv)
The pattern is stark: structured query tasks work, open-ended investigation and sequential planning don’t. The model that produces perfect SQL analysis cannot keep four file writes in sequence.
The task_10 story
[Observed]
Task_10 is the SQL investigation: navigate a schema, trace a bug through query logic, produce the correct result. Nemotron scored 3/3: the strongest result on this task across the entire dataset. Every run was clean, every answer correct, no unnecessary tool calls.
This is not a fluke. The evidence bundles show consistent query construction, correct schema navigation, and no loop behaviour on any of the three runs (verified: pass_rate_by_task.csv). Nemotron is better at this specific task than Devstral 2 (27/30 overall), better than Claude Sonnet 4.6 (28/30 overall), better than any model we’ve run.
SQL pattern matching is what Nemotron does well. The input space is bounded, the success criteria are precise, and there’s no ambiguity about when the task is done. That profile fits a model that appears strong on structured benchmark-style tasks and brittle on anything requiring open-ended reasoning.
The task_07 failure
[Observed]
task_07 asks for four sequential file writes in a specific order, with a verification check after each one. It is mechanically straightforward. No ambiguous spec, no trap. Just hold the plan and execute it step by step.
Nemotron scored 0/3. This is the first model in the dataset to do that. Every prior model with a score below 22/30 has still managed 1/3 on task_07. GPT-OSS 120B sits at 23/30 in its campaign and still got 1/3. Nemotron got nothing.
The evidence bundles flag tool_call_redundancy and long_tail_turn_count on all three task_07 runs (verified: pass_rate_by_task.csv). The model enters a loop: it issues tool calls, gets back results, and issues more tool calls without advancing through the plan. The sequential execution requirement (write file 1, verify, write file 2, verify) appears to break the model’s execution path. It never makes it to the second file.
The contrast with task_10 is the article in one sentence: give Nemotron a bounded schema and a precise question, and it’s the best model we’ve run. Ask it to hold a four-step plan where each step depends on completing the prior one, and it loops until timeout.
MoE activation budget falsified
[Observed]
The pre-campaign hypothesis was built on a straightforward assumption: activation budget correlates with per-token reasoning quality in a MoE model. Nemotron activates 12B parameters per forward pass: 4× Qwen3 Next 80B A3B’s 3B. So the prediction was that Nemotron would score higher than Qwen3 (21/30).
It scored 12/30. The prediction was wrong by 9 points. The falsification condition was set at ≤21 and triggered.
| Model | Score | Active params | Cost/pass |
|---|---|---|---|
| Devstral 2 123B | 27/30 | ~123B dense | $0.0019 |
| Mistral Large 3 675B | 27/30 | ~675B dense | $0.0022 |
| Kimi K2.5 | 24/30 | unknown | $0.0044 |
| GPT-OSS 120B | 23/30 | 120B | $0.0013 |
| Qwen3 Next 80B A3B | 21/30 | 3B (MoE) | $0.0012 |
| DeepSeek V3.2 | 19/30 | ~dense | $0.014 |
| Nemotron Super 3 120B | 12/30 | 12B (MoE) | $0.0016 |
(verified: cost_breakdown.csv, leaderboard as of 2026-05-19)
Qwen3’s 3B active parameters beat Nemotron’s 12B by 9 points. Architecture and post-training dominate the agentic-core-v1 signal. Raw activation count is not a reliable predictor of how well a model handles multi-step agentic software tasks. That prior needs updating.
[Speculation]
Alibaba’s post-training for Qwen3 appears to have specifically targeted instruction-following and tool-use workflows. Nemotron’s post-training appears optimised for benchmark-style evaluation where the answer is structured and the task is time-bounded. Those are different skill sets, and the harness rewards the first set.
The task_09 surprise
[Observed]
The prediction for task_09 was 0/3. Every non-reasoning model except Devstral 2 (1/3) had scored 0 on it. Nemotron scored 1/3.
One run produced explicit data-insufficiency reasoning: the model identified that 3 data points cannot support a 10-day moving average and declined to produce a number. The other two runs returned wrong numeric answers. Final score: 1/3 (verified: pass_rate_by_task.csv).
Nemotron and Devstral 2 are the only non-reasoning models in this dataset to score on task_09. Whether Nemotron’s training included structured refusal examples or this is sampling variance is unknown. One run out of three is not enough to characterise it as reliable impossibility detection.
What the predictions got wrong
[Observed]
Prediction accuracy: 2/6. Four falsification conditions triggered.
| Prediction | Result |
|---|---|
| P1: Score 22–27/30 | Fail: 12/30 actual (falsification: ≤21 triggered) |
| P2: task_09 = 0/3 | Fail: 1/3 actual (falsification: non-zero) |
| P3: task_07 = 3/3 | Fail: 0/3 actual (falsification triggered) |
| P4: Cost < $0.10 | Pass: $0.019 |
| P5: No infrastructure errors | Pass: all wrong_answer |
| P6: Score > Qwen3 21/30 | Fail: 12/30 (falsification triggered) |
The point estimate was 25/30. The result was 13 points lower. That gap is not explained by a bad run or an infrastructure issue: there were zero infrastructure errors, all 30 runs completed, and all failures were wrong_answer or loop-termination. Nemotron just doesn’t pass the tasks.
The prediction was wrong because it was built on the activation-budget heuristic, which the data doesn’t support. A different framing (post-training target over raw parameters) would have landed closer to the actual result.
What we don’t know yet
[Speculation]
The task_07 loop behaviour (all three runs) suggests a specific failure in sequential plan execution where each step depends on confirming the prior one. Whether this is a Nemotron-specific failure mode or a property of the 12B activation budget in MoE architectures requires running another MoE model with similar activation-to-total-parameter ratio.
The task_09 single pass is interesting but not reproducible with the current data. Three additional runs of task_09 in isolation would show whether Nemotron’s impossibility recognition is consistent or luck-of-sampling. We ran one campaign with the standard 3-run protocol; we didn’t run the follow-up.
The SQL strength (task_10 3/3, best in dataset) alongside the investigation failure (task_03 0/3, task_04 0/3) looks like a post-training split: bounded query tasks trained heavily, open-ended investigative reasoning not. If that’s right, a Nemotron model with stronger instruction-following post-training would score differently on task_03 and task_04 without necessarily changing task_10. We don’t have that model in the dataset.