The smaller model wins

May 21, 2026 · campaign-reports

Campaign: 2026-05-20-openai-gpt-oss-20b-agentic-core-v1
Model: OpenAI GPT-OSS 20B (openai.gpt-oss-20b-1:0, AWS Bedrock us-east-1)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-05-20

The GPT-OSS 120B campaign left a clean prediction on the table: if the 20B follows the same scaling curve, expect worse. Smaller model, lower parameter count, more degradation on the harder tasks. Rigg predicted 18 to 22/30 with a point estimate of 20/30. He introduced three adversarial predictions targeting tool-call failures and planning collapse at smaller scale.

All three predictions were wrong.

GPT-OSS 20B scored 25/30 (83.3%). That is 2 points above the 120B’s 23/30, at $0.000481 per passing run against the 120B’s $0.0013 (verified: data pack run_summary.cost_per_pass_usd). The 20B costs less, runs faster, and scores higher on this harness. For builders asking whether to run the smaller model: the numbers answer that directly.

The full picture is more complicated. The 20B has a specific regression the 120B does not: log-file investigation. GPT-OSS 120B scored 3/3 on task_03 (investigate_log). The 20B scored 1/3. If large-context document analysis is your workload, that regression matters and the overall score inversion does not cancel it.

What agentic-core-v1 tests

[Observed — harness spec]

The benchmark runs each model on 10 tasks, 3 times each, for 30 total runs. The tasks cover the work a real software agent does: fix a failing test, refactor duplicated code, trace a codebase, investigate a log file, handle ambiguous requirements, recover from a tool error, write a multi-step plan, run a SQL investigation.

Two tasks are structural traps. task_09 asks the model to compute a 10-day moving average from a CSV that has fewer than 10 rows. The correct behavior is to detect the data quality problem and refuse to compute. All but one non-reasoning model across prior campaigns has scored 0/3 on this task; DeepSeek V4-Flash is the single exception at 1/3. task_07 asks the model to write four files in sequence, each with specified content, in order. It is not a coding problem. It requires executing a plan without stopping early.

A pass requires the model to complete what was asked, completely, without wrong output or early termination.

The campaign context

[Observed — campaign brief]

After the 120B campaign, the leaderboard had Claude Sonnet 4.6 at the quality ceiling (28/30, $0.051/pass) and GPT-OSS 120B at the cost floor ($0.0013/pass, 23/30). The 120B’s specific weakness was task_07: wrong_answer on both failed runs: the model wrote files but with incorrect content (verified: intel.db task_outcomes.failure_mode). GPT-5.5 scored 3/3 on that same task, suggesting the failure was a fine-tuning artifact rather than an architectural limit.

The research question for the 20B: does 6x smaller mean 6x worse, or something more nuanced? Rigg’s prediction was “something worse.” The point estimate of 20/30 assumed the task_07 failure would persist or worsen, task_09 would stay at 0/3, and smaller parameter count would surface additional failure modes.

None of those assumptions held.

Per-task results

[Observed — data pack per_task_results]

Task	Score	Delta vs 120B	Note
task_01 fix_failing_test	3/3	+1	120B scored 2/3
task_02 refactor_duplicated_code	3/3	0	Clean
task_03 investigate_log	1/3	-2	120B scored 3/3; see below
task_04 trace_through_codebase	2/3	-1	1 wrong_answer; likely run variance
task_05 minimal_fix	3/3	0	Clean
task_06 handle_ambiguous_req	3/3	+1	120B scored 2/3
task_07 multi_step_plan	3/3	+2	120B scored 1/3; see below
task_08 recover_from_tool_error	3/3	0	Clean
task_09 know_when_to_stop	1/3	+1	Non-reasoning pass (DeepSeek V4-Flash also 1/3 in prior campaign); see below
task_10 sql_investigation	3/3	0	Clean

Total: 25/30 (83.3%). Failure modes: wrong_answer ×5 (task_03: ×2, task_04: ×1, task_09: ×2).

The 20B outperforms the 120B on 6 of 10 tasks and draws on 3 more. The only clear regression is task_03. Zero tool-call malformed failures across 30 runs. Rigg’s adversarial prediction P2 (that the 20B would produce tool_call_malformed errors at smaller scale) did not materialise in a single run.

Why task_07 reversed

[Observed — data pack evidence_patterns, brief §task_07]

This is the result that demanded explanation before the campaign could be written up.

GPT-OSS 120B scored 1/3 on task_07. Both failed runs show wrong_answer: the model executed the full 4-step sequence but wrote files with incorrect content (verified: intel.db task_outcomes.failure_mode). The task structure was followed; the output was wrong.

GPT-OSS 20B scored 3/3 on task_07. All three runs executed the 4-step fs_write sequence in full, with correct content, in order. The task showed no variance across runs (verified: data pack per_task_results[task_07_multi_step_plan]). The model treated step execution as a mechanical sequence rather than a planning problem, and it did not deviate.

The most plausible explanation is fine-tuning data distribution. The 120B’s wrong_answer pattern on task_07 looks like a training artifact: the model learned sequential task structure but produced incorrect content, possibly because the fine-tuning signal for exact-match structured writes was weaker than for code or prose tasks. The 20B appears to have been trained with stronger structured-output signal, suppressing the pattern. At 20B parameters with different fine-tuning emphasis, the failure mode is absent.

[Speculation] Whether the 20B was trained on different data or with different RLHF signal than the 120B is not documented. The step-completion explanation is consistent with the data, but it is a hypothesis. OpenAI has not published the training details.

For builders: the 120B’s task_07 failure at 67% was the strongest argument against using it for agentic workflows that require sequential structured execution. The 20B removes that argument.

The task_09 anomaly

[Observed — data pack evidence_index, task_09_run3]

task_09 has historically been hard for non-reasoning models. Most prior non-reasoning campaigns scored 0/3; DeepSeek V4-Flash is the exception at 1/3. The task is a data quality trap: the correct answer is to refuse computation because the dataset is insufficient.

GPT-OSS 20B run3 produced this output:

“Not enough data to compute a 10-day moving average. The dataset contains fewer than 10 entries.”

That is a pass. DeepSeek V4-Flash produced a 1/3 result on the same task in its own campaign. The 20B matches that result.

Run1 and run2 both failed. Run1 produced partial NaN values in answer.txt. Run2 did not write the output file. The score is 1/3, not 3/3, and that matters. This is not a model that reliably detects data quality problems.

The run3 passing run produced 1,292 output tokens. The two failing runs produced approximately 700 and 190 tokens respectively (verified: data pack run_metrics.output_tokens_by_run). The model appears to have reasoned explicitly about data quality before refusing computation in the passing run. The shorter runs went to computation without that step.

[Speculation] The token-count pattern is suggestive but not conclusive. It could mean the model only activates quality-check reasoning when it generates enough tokens to reach it. It could also be noise in a 3-run sample. What the current data shows: the model has this capability somewhere and did not reliably deploy it. That is a different failure mode from most prior non-reasoning models, which showed no capability on this task at all. DeepSeek V4-Flash is the only other non-reasoning model to have passed it, also at 1/3.

The task_03 regression

[Observed — data pack per_task_results, evidence_index task_03_run1, task_03_run3]

GPT-OSS 120B scored 3/3 on task_03 (investigate_log). GPT-OSS 20B scored 1/3.

The task gives the model a large access.log file and asks for the root cause of a burst of HTTP 500 errors. The passing run (run1) consumed 24,041 input tokens: the model read the full log and correctly identified “database pool exhaustion from too many concurrent POST /api/orders requests” (verified: data pack evidence_index.task_03_run1). Run3 consumed 5,202 input tokens and returned in 1.8 seconds with an incorrect finding.

1.8 seconds is not long enough to read and analyze a substantial log file. The model pattern-matched “log analysis task” and wrote a plausible-sounding finding without reading the entries that contained the actual signal. Two of three runs did this.

This is a context-pressure failure pattern: under capacity constraints, the 20B satisfies the task contract (write something to finding.txt) without completing the analysis. The 120B, with more capacity to sustain attention over large inputs, does not take this shortcut.

[Unobserved] We did not run the 20B on a truncated version of the log file to see whether shorter context changes the failure rate. Whether this is strictly a context-length problem or also reflects something in how the 20B handles document-analysis tasks under time pressure is not established.

The practical implication is clear regardless: for tasks requiring sustained analysis of large input documents, the 20B is less reliable than the 120B on this harness. The overall score inversion (25 vs 23) does not change that.

Cost

[Observed — data pack run_summary]

$0.012015 total for 30 runs. $0.000481 per passing run.

Model	Score	$/pass	Pass rate
Claude Sonnet 4.6	28/30	$0.051	93.3%
GPT-5.5	27/30	$0.070	90.0%
Mistral Large 3	27/30	$0.0022	90.0%
DeepSeek V4-Flash	28/30	$0.0015	93.3%
GPT-OSS 20B	25/30	$0.000481	83.3%
GPT-OSS 120B	23/30	$0.0013	76.7%
Llama 3.3 70B	20/30	$0.0045	66.7%

GPT-OSS 20B costs 2.7x less than its 120B sibling at higher quality ($0.000481 vs $0.0013/pass). DeepSeek V4-Flash sits three points higher at 28/30 but costs $0.0015/pass, still 3x more than the 20B. The three models above the 20B in score (Claude Sonnet 4.6, GPT-5.5, Mistral Large 3) cost between 4.6x and 146x more per passing run (verified: data pack run_summary.cost_per_pass_usd).

For workloads where 80%+ pass rate is acceptable and tasks stay away from large-document analysis, GPT-OSS 20B is the strongest cost argument in this dataset by a wide margin. The model ships on Bedrock at $0.07/$0.30 per million input/output tokens, same infrastructure path as the 120B, same Converse API tool-use semantics. Moving from 120B to 20B is a model ID swap and a quality improvement.

What we were wrong about

[Observed — data pack predictions_scoring]

Rigg predicted 20/30 (point estimate), headline range 18 to 22/30. Actual: 25/30.

Prediction	Outcome	Hit?
Headline 18 to 22/30, PE 20/30	25/30	No (+5 above range)
P1: task_09 ≤ 0/3	1/3 (matches DeepSeek V4-Flash; second non-reasoning model to score here)	No
P2: task_07 ≤ 1/3, tool_call_malformed failures	3/3, zero tool failures	No
P3: total < 20/30	25/30	No

Zero predictions hit.

The calibration problem: Rigg anchored on the 120B’s performance and projected degradation from scale reduction. That model breaks when fine-tuning differences outweigh parameter count differences. The 20B was trained with stronger step-completion and instruction-following signal than the 120B, and that signal matters more than the 6x parameter gap for the tasks in this harness.

Going forward: treat sibling models in the same family as independently trained. Do not linearly project from a larger model to a smaller one without evidence that the training data and objectives are comparable.

What we don’t know yet

[Speculation]

The task_07 reversal has a plausible explanation in fine-tuning data distribution, but no confirmed mechanism. OpenAI has not published what changed between 120B and 20B training. The hypothesis fits all three runs. It is still a hypothesis.

The task_09 pass rate is unresolved. One pass in three runs is not reliable calibration. Whether the 20B can be prompted or instruction-tuned to make that behavior consistent is an open question. The current result is: it can do this sometimes. Not reliably.

Task_04 (trace_through_codebase) scored 2/3 with one wrong_answer. A single wrong answer on a codebase-tracing task from a model that otherwise handles code cleanly is probably run variance. It could also be a systematic gap in how the 20B handles deep call-chain analysis at context depth. Three runs per task is not enough data to separate noise from signal on individual task failures.