OpenAI's flagship didn't move. The leaderboard did.
Campaign: 2026-06-16-gpt-5.5-flagship-agentic-core-v1
Model: OpenAI GPT-5.5 (gpt-5.5, OpenAI direct API — $5.00/$30.00 per 1M input/output tokens)
Harness: agentic-core-v1
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-06-17 (21:55–22:14Z, 19 minutes)
In May we ran GPT-5.5 Instant (the cheaper, faster variant) and it scored 27/30. This month we ran the flagship. Also 27/30.
The flagship costs more per token ($5.00 vs $1.50 input), has a larger context window, and is positioned as the premium option for demanding workloads. On agentic-core-v1 (the harness described below), it posted the exact same pass rate as its cheaper sibling. Same score. Same failure mode. A month apart.
That would be an interesting footnote if the leaderboard had stayed still. It hasn’t. Since the May run, DeepSeek V4-Pro with thinking mode joined the harness at 30/30 for $0.12 total: three more passes, twelve and a half times cheaper. Mistral Small 4 and DeepSeek V4-Flash sit above GPT-5.5 at $0.03 and $0.04 respectively. The question this re-evaluation is really answering isn’t whether GPT-5.5 got better. It’s whether that matters given what’s now available.
What agentic-core-v1 tests
[Observed]
The harness runs 10 tasks, each executed 3 times, for 30 total runs. Every task has a deterministic checker: a program that reads the model’s output file and returns pass or fail against acceptance criteria. No partial credit.
The tasks cover the kind of work that shows up on an actual engineering queue: fix a failing test, refactor duplicated code, investigate a production log, trace through a codebase, write a minimal fix under a constraint, handle an ambiguous requirement, sequence a multi-step plan, recover from a broken tool, recognise when a problem is unsolvable as stated, and run a SQL investigation.
Nine of ten tasks have clear right answers. task_09 asks the model to compute a 10-day moving average from a CSV file containing only three rows. The correct response is to recognise that three data points cannot produce a 10-day average and produce a qualified output or explicit refusal. Every model in our dataset has struggled with this task. Most still do.
What we predicted
Before the run, Rigg committed three predictions (verified: campaigns/gpt-5.5-flagship-agentic-core-v1.predictions.md):
- task_09 still fails 0/3 — same scalar error as May.
- Score improves to 28+ — OpenAI had a month to address the specific failure mode from the Instant campaign.
- Total cost under $4 — large-context task_03 was cheaper in recent campaigns for similar models.
Results: P1 correct, P2 wrong, P3 correct.
We were wrong about P2. The model OpenAI shipped for the flagship tier in June 2026 does not handle task_09 differently than GPT-5.5 Instant did in May. Six attempts across two independent campaigns. Zero passes.
What GPT-5.5 flagship did
[Observed]
27 of 30 runs passed. Pass rate: 90.0% (verified: pass_rate_by_task.csv). Nine of ten task types were clean at 3/3. task_09 was 0/3. All three task_09 failures were wrong_answer, meaning the checker received output and rejected it, not that the model failed to produce anything (verified: failure_mode_histogram.csv).
| Task | Result | Avg tool calls | Avg latency | Avg cost |
|---|---|---|---|---|
| task_01 fix failing test | 3/3 | 5.0 | 17.6s | $0.016 |
| task_02 refactor duplicated code | 3/3 | 5.0 | 65.9s | $0.026 |
| task_03 investigate log | 3/3 | 5.3 | 82.3s | $0.349 |
| task_04 trace through codebase | 3/3 | 6.0 | 22.0s | $0.017 |
| task_05 minimal fix | 3/3 | 5.0 | 49.9s | $0.017 |
| task_06 handle ambiguous requirement | 3/3 | 6.3 | 45.9s | $0.035 |
| task_07 multi-step plan | 3/3 | 4.0 | 3.4s | $0.007 |
| task_08 recover from tool error | 3/3 | 2.0 | 9.2s | $0.009 |
| task_09 know when to stop | 0/3 | 2.3 | 107.1s | $0.022 |
| task_10 SQL investigation | 3/3 | 3.0 | 32.5s | $0.010 |
(verified: pass_rate_by_task.csv, tool_calls_by_task.csv, latency_distribution.csv, cost_breakdown.csv)
Zero diagnostic patterns fired across the 30 runs: no cross_task_consistency signals (inconsistent approach or quality across different task types), no diagnosis_then_regression (model correctly identifies a problem then reverts to a wrong fix at implementation), no long_tail_turn_count (runs requiring disproportionately many tool calls, indicating the model got stuck in repair loops), no tool_call_redundancy (repeated identical tool calls with no progress between them) (verified: evidence directory, 4 files, all null results). Where GPT-5.5 works, it runs clean.
What does task_09 actually ask?
[Observed]
The task: read data.csv (three rows of revenue by date: 100, 150, 200), compute the 10-day moving average, write results to answer.txt.
Three data points cannot produce a 10-day window. The correct behaviour is to recognise this and say so explicitly: decline to compute, write a qualified explanation, or produce output that acknowledges the data is insufficient.
GPT-5.5 did not do this. In two of the three June runs, the model wrote a scalar: 150. That appears to be the mean of the three values, or the middle value. Not a moving average; no context. Run 3 attempted the correct CSV format (date column, 10_day_moving_average column) but populated only three rows — treating each individual data point as its own window, not computing a rolling aggregate (verified: pass_rate_by_task.csv, run IDs in the task_09 evidence file).
[Observed]
The latency pattern adds detail. task_09 averaged 107.1 seconds across the three runs, with a minimum of 22 seconds and a maximum of 275 seconds (verified: latency_distribution.csv). That high variance suggests the model tried different approaches before settling on an answer. It did not settle on the correct one, but it did seem to hesitate.
[Speculation]
The hesitation without correction might indicate the model recognised something was off but defaulted to producing a number over producing an explanation. This is a harder failure mode to fix than a mechanical one: if the model’s training has reinforced “produce a numeric answer” for computation tasks, even recognising the issue at the reasoning layer may not be enough to break that pattern.
Why task_03 costs so much
[Observed]
task_03 accounted for $1.046 of the $1.53 total campaign spend (68% of cost for 10% of the tasks) (verified: cost_breakdown.csv). The task involves reading a multi-thousand-line access log, identifying a burst of HTTP 500 errors, tracing their root cause, and writing a structured finding. GPT-5.5 passed all three runs.
The high cost comes from input tokens. The access log is large; GPT-5.5 read the whole thing. Average input token count per run was approximately 66,000 tokens. The large context window was doing real work. It found the right root cause and produced clean finding.txt output in all three runs (verified: pass_rate_by_task.csv).
For teams running log analysis pipelines with large files, this is the scenario where GPT-5.5’s context window is doing real work. The question is whether that use case justifies the price point relative to alternatives that can also do it.
Where did GPT-5.5 land on the current leaderboard?
[Observed]
| Model | Score | Pass rate | Cost | Period |
|---|---|---|---|---|
| Claude Opus 4.8 | 30/30 | 100% | $7.34 | Jun 2026 |
| DeepSeek V4-Pro (thinking) | 30/30 | 100% | $0.12 | Jun 2026 |
| Mistral Small 4 | 29/30 | 96.7% | $0.03 | May 2026 |
| DeepSeek V4-Flash | 28/30 | 93.3% | $0.04 | May 2026 |
| Claude Sonnet 4.6 | 28/30 | 93.3% | $1.44 | May 2026 |
| GPT-5.5 flagship (Jun 2026) | 27/30 | 90.0% | $1.53 | Jun 2026 |
| GPT-5.5 Instant (May 2026) | 27/30 | 90.0% | $1.89 | May 2026 |
| Claude Fable 5 | 25/30 | 83.3% | $1.97 | Jun 2026 |
| Gemini 3.1 Pro | 23/30 | 76.7% | $0.85 | Jun 2026 |
(verified: pass_rate_by_task.csv for the June campaign; prior campaign data from the leaderboard article)
The flagship costs 8% less than Instant did in May ($1.53 vs $1.89), largely because task_03 consumed fewer input tokens this run. The total improvement is noise.
GPT-5.5 now sits below two models that score higher at a fraction of the cost, and level with Sonnet 4.6 which gets one extra pass for slightly less money. The May leaderboard had GPT-5.5 near the top of the pack. The June leaderboard does not.
What is GPT-5.5 actually for?
[Speculation]
The data doesn’t answer this directly. What it does is narrow the field.
On tasks that don’t require large-context log analysis, GPT-5.5 costs more per passing run than every other top-tier model on this harness. DeepSeek V4-Pro (30/30, $0.12) is the direct comparison: three more passes at a fraction of the price. Mistral Small 4 (29/30, $0.03) gets two more passes at nearly nothing. If cost-per-task matters, those two dominate.
Where GPT-5.5 has a plausible edge is the scenario represented by task_03: large-context investigation where the model needs to read a substantial input file and produce a structured analysis. The context window does real work there. If a pipeline regularly processes files in the tens-of-thousands-of-tokens range, the question is whether GPT-5.5’s large window and clean execution on those tasks justifies the price gap over competitors who may not handle the same input length as reliably.
[Unobserved]
We haven’t tested GPT-5.5 against other models on specifically large-context workloads to confirm whether the task_03 advantage holds at scale, or whether it’s a feature of this particular task’s structure. That comparison doesn’t exist in the current dataset.
What we don’t know yet
[Speculation]
The 27/30 score has been stable across two independent campaigns covering different GPT-5.5 variants and a month of OpenAI development. We don’t know whether that stability reflects an architectural limit, a training data characteristic, or something specific to how this harness presents the problem.
[Speculation]
We also don’t know whether the failure mode on task_09 is specific to moving average computation, or whether it generalises to any task involving underdetermined data. The harness doesn’t have enough task_09 variants to isolate the variable. A targeted campaign on that question would need a different design.
[Speculation]
Predictions for a third GPT-5.5 run, if it happens: task_09 still fails unless OpenAI makes a targeted change to how the model handles explicitly insufficient data. The broader nine-of-ten consistency will hold. Cost will depend on task_03 input token count.
Source references
Campaign data: 2026-06-16-gpt-5.5-flagship-agentic-core-v1/
Predictions file: gpt-5.5-flagship-agentic-core-v1.predictions.md
Pass rates: pass_rate_by_task.csv
Cost breakdown: cost_breakdown.csv
Latency: latency_distribution.csv
Failure modes: failure_mode_histogram.csv
Evidence patterns: evidence/ (4 files, all null results)
Prior run: 2026-05-10-gpt-5.5-agentic-core-v1/