What OpenAI kept for itself

Campaign: 2026-05-17-openai-gpt-oss-120b-agentic-core-v1
Model: OpenAI GPT-OSS 120B (openai.gpt-oss-120b-1:0, AWS Bedrock us-east-1)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-05-17


When OpenAI put GPT-OSS 120B on AWS Bedrock, the obvious research question was: how much does the open release cost you?

It’s a fair question because GPT-5.5 had just run on this harness (27/30, $0.070 per passing run) and GPT-OSS 120B carries the same OpenAI lineage. Same Bedrock Converse tool-use plumbing, no API key required, same infrastructure path. The model is a 120B dense transformer, open weights, and available on Bedrock’s US East 1 ON_DEMAND. If you believed OpenAI’s open release was close to their closed product, 25–27/30 was a reasonable expectation. That’s what Rigg predicted going in (P6: “≥25/30, within 2 points of GPT-5.5”).

The actual score: 23/30. Four points below GPT-5.5. Second from the bottom in the agentic-core-v1 leaderboard, above Llama 3.3 70B’s 20/30.

The cost: $0.03 total, $0.0013 per passing run (verified: data pack run_summary.cost_per_pass_usd). Cheapest we’ve run, below Mistral Large 3’s previous floor of $0.0022.

Both numbers are real. The cost floor is genuine. The quality penalty is also genuine. And the two don’t cancel out in a way that makes GPT-OSS 120B obviously useful. Whether this is a good deal depends entirely on whether your workload ever touches multi-step planning, which is where the model’s performance collapsed.


What agentic-core-v1 tests

[Observed — harness spec]

The benchmark runs each model on 10 tasks, 3 times each, for 30 total runs. Tasks cover the kinds of work a real software agent does: fix a failing test, refactor duplicated code, trace through a codebase, handle an ambiguous requirement, recover from a tool error. Two tasks are explicit traps. task_09 (know when to stop) expects the model to recognise that a task is impossible and say so. Every non-reasoning model so far has walked into this and produced wrong answers instead. task_07 is where this article is really about.

A pass means the model did what was asked, completely, without inventing output that wasn’t requested. A fail is a wrong answer, a partial completion, a refused task, or a model that called the same tool in a loop without acting on the result.


Situation

[Observed — campaign brief]

GPT-OSS 120B is OpenAI’s first open-weights model on Bedrock. Before this campaign, the leaderboard had a clear quality ceiling (Claude Sonnet 4.6, 28/30) and a clear cost floor (Mistral Large 3, 27/30 at $0.0022/pass). The middle was occupied by GPT-5.5 and DeepSeek V4-Flash, both scoring 25–27/30 but costing between $0.05 and $0.08 per pass. Llama 3.3 70B had held the cheapest-per-pass position at $0.0045, but with 20/30, it was below the threshold where the harder tasks in this harness are reliably solvable.

GPT-OSS 120B entered this with two things going for it: OpenAI provenance (which should mean robust instruction-following) and Bedrock pricing ($0.15 input / $0.60 output per million tokens, low by frontier standards). The prediction was that the OSS quality penalty would be small: maybe 1–2 points off GPT-5.5. That was wrong.


Results

[Observed — data pack per_task_results]

TaskScoreNote
task_01 fix_failing_test2/31 wrong_answer (model returned an incorrect result) — minor miss on standard debugging
task_02 refactor_duplicated_code3/3Clean
task_03 investigate_log3/3Clean
task_04 trace_through_codebase3/3Clean
task_05 minimal_fix3/3Clean
task_06 handle_ambiguous_req2/31 tool-loop failure — see below
task_07 multi_step_plan1/3Planning coherence gap — see below
task_08 recover_from_tool_error3/3Clean
task_09 know_when_to_stop0/3Expected trap — consistent with all prior non-reasoning models
task_10 sql_investigation3/3Clean

Total: 23/30 (76.67%). Failure modes: wrong_answer ×6, gave_up_mid_plan (model halted before completing the task) ×1.

The model is competent on read/analyze/write tasks. Seven of ten tasks scored 2/3 or better. The two misses outside of task_07 (task_01 and task_06) are single-run failures that probably sit within run variance. task_09 is the known trap, and every model without explicit reasoning capabilities fails it the same way.

task_07 is different.


Why does multi-step planning break down?

[Observed — data pack evidence_patterns, brief §task_07]

task_07 asks the model to write four files (steps/step1.txt through steps/step4.txt) with specified content, using fs_write only, in order. It’s not a coding task. It doesn’t require understanding a codebase or debugging logic. It requires the model to execute a sequential plan and not stop early.

GPT-5.5: 3/3. GPT-OSS 120B: 1/3.

Two failed runs: one gave_up_mid_plan (execution halted after 2 of 4 steps, did not attempt the remainder) and one wrong_answer (files written with incorrect content). The model knew what the task was. It had the tool. It stopped anyway.

This is what proprietary fine-tuning buys you. A 120B parameter model from OpenAI, with confirmed tool use and clean performance on trace and refactor tasks, falls apart when asked to maintain state across four sequential write operations. GPT-5.5, presumably trained with more instruction-following signal on exactly this class of sequential planning, handles it cleanly every time.

The implication for builders: if your agentic workflow involves sustained multi-step planning where each step depends on the prior completing successfully, GPT-OSS 120B fails that specific pattern at a 67% rate (verified: data pack per_task_results[task_07_multi_step_plan]). That’s not acceptable for production agentic use. On tasks that don’t require that kind of planning coherence (code tracing, log investigation, SQL queries) the model performs well.


What is the tool-loop pattern?

[Observed — data pack evidence_patterns.tool_call_redundancy]

The task_06 failure surfaced a second distinct behaviour worth naming. One failed run showed the model calling fs_read({'path': 'src/counter.py'}) four consecutive times without acting on the results. It read the same file, four times, before the run failed. The model doesn’t appear to register prior tool results as state updates before re-calling.

This appeared in 3 of 30 runs (verified: evidence_patterns.tool_call_redundancy.affected_runs). In two cases it was benign (task_03, passed regardless). In one it was the direct cause of a task_06 failure.

This is different from the planning problem. The planning failure is about sequential state: can the model finish a plan it started? The tool-loop is about immediate state: does the model know what it just read? They may share a root cause (insufficient instruction-following on “process tool results before calling again”) but they surface in different conditions. Planning tasks fail because the model stops early. Ambiguous tasks fail because the model loops instead of acting.


Cost

[Observed — data pack run_summary]

$0.03 total. $0.0013 per passing run.

ModelScore$/pass
Claude Sonnet 4.628/30$0.051
GPT-5.527/30$0.070
Mistral Large 327/30$0.0022
DeepSeek V4-Flash25/30$0.076
GPT-OSS 120B23/30$0.0013
Llama 3.3 70B20/30$0.0045

GPT-OSS 120B is the cheapest model in this dataset. At $0.0013/pass, it costs 41% less per passing run than Mistral Large 3, the previous cost floor. That’s real.

The cost story doesn’t help you if your workload hits planning tasks. At 23/30 with task_07 at 1/3, you’re buying the cheapest pass rate in the leaderboard on a harness where planning coherence matters. Mistral Large 3 at $0.0022/pass and 27/30 is the better option for agentic workloads. But for workflows that stay in the read/analyze/write zone and never require multi-step planning (code search, log analysis, SQL investigation), GPT-OSS 120B’s cost advantage is real and the quality trade-off doesn’t bite.


What we were wrong about

[Observed — data pack predictions_scoring]

Rigg predicted 26/30. Actual: 23/30. Predictions score: 2/6.

The misses clustered in one direction: the OSS quality gap was steeper than expected, and it was concentrated in task_07. P1 (pass rate 24–27/30) missed. P3 (task_07: 3/3) missed badly: the model scored 1/3 on a task where GPT-5.5 is reliable. P6 (≥25/30, within 2 of GPT-5.5) also missed.

The calibration lesson is specific: multi-step planning tasks are where OSS models most visibly diverge from their proprietary counterparts. The parameter count doesn’t compensate for the fine-tuning delta. Going forward, planning tasks get a steeper OSS penalty in predictions for any open-weights model without documented instruction-following RLHF on sequential execution.


Rerun confirmation: 2026-05-24

[Observed — campaign 2026-05-24-openai-gpt-oss-120b-agentic-core-v1]

A second independent run of GPT-OSS 120B on agentic-core-v1 completed 2026-05-24. Score: 23/30 (76.67%) — identical to the original. Failure modes: wrong_answer x6, gave_up_mid_plan x1 — same counts as run 1.

Cost this run: $0.02 total, $0.00087/pass (vs $0.0013/pass in run 1). The difference reflects lighter token use on some tasks in run 2. Both costs confirm GPT-OSS 120B sits well below the previous cost floor (Mistral Large 3 at $0.0022/pass).

The per-task distribution shifted between runs, which is worth documenting:

TaskRun 1 (2026-05-17)Run 2 (2026-05-24)Change
task_01 fix_failing_test2/33/3+1
task_05 minimal_fix3/31/3-2
task_07 multi_step_plan1/31/30
task_09 know_when_to_stop0/31/3+1
All otherssamesame0

The headline result is stable. The per-task noise is real: task_01 and task_09 improved by one pass each; task_05 regressed by two. task_07 (the planning coherence failure this article is about) held at 1/3 across both runs. That’s the number that matters for builders considering this model for agentic work.

The task_09 shift from 0/3 to 1/3 is a single pass, not a pattern change. The original article stated every non-reasoning model fails task_09 entirely; that remains the general picture, but one pass in run 2 suggests the model occasionally outputs something that passes the grader rather than a systematic capability. One pass across 6 total runs (1/6) is noise.

Size inversion confirmed across two independent runs. GPT-OSS 120B scored 23/30 in both the May 17 and May 24 campaigns. GPT-OSS 20B scored 25/30 in its May 20 campaign. The 20B model outscores its 6x-larger sibling by 2 points, at 1.85x lower cost per pass ($0.000481 vs $0.00087). The inversion is not a single-run artefact.


What we don’t know yet

[Speculation]

Whether task_07’s failure rate is stable across harness versions. This is a 3-run sample per task. A 1/3 result on task_07 could reflect genuine 33% reliability, or it could be a 50–60% reliability model that had two bad runs. We would need at least 10 runs on task_07 alone to distinguish between those. We haven’t done that.

We also don’t know how much of the task_01 miss (2/3) is model-level signal vs. run variance. A single wrong_answer on a standard debugging task, from a model that otherwise handles code tasks cleanly, is ambiguous. It could be a systematic gap in GPT-OSS 120B’s test-fixing loop, or it could be noise. The current dataset doesn’t resolve it.

What the data does confirm: planning coherence is GPT-OSS 120B’s specific weakness, and it’s the weakness that matters most for agentic deployments. The cost advantage is real. Whether that combination is useful depends on your task profile.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.