90% and the one model that never refused
Campaign: 2026-05-10-gpt-5.5-agentic-core-v1
Model: OpenAI GPT-5.5 Instant (gpt-5.5, direct OpenAI API, Responses API)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-05-16 (run at 16:22–16:27 UTC)
Every other model we have tested has come in on Bedrock or EC2. GPT-5.5 Instant was the first OpenAI model through the pipeline, which meant two things needed to work on the same run: the new OpenAIAdapter, and the model itself.
The adapter worked. No infrastructure errors across all 30 runs, clean tool-call handling, consistent function_call type from the Responses API throughout.
The model scored 27/30 — third on the leaderboard, one run behind the top cluster, 31% more expensive than Claude on the same suite. The result we did not expect: task_08, the tool-error recovery task where lower-performing models have consistently failed. GPT-5.5 got 3/3, joining Claude Sonnet 4.6 and DeepSeek-V4-Flash at the top. Exactly two tool calls each time.
What agentic-core-v1 tests
[Observed]
agentic-core-v1 has 10 tasks, each run 3 times, for 30 total runs. The tasks are structured, scoped, and have deterministic checkers. They cover: fixing a failing test, refactoring duplicated code, investigating a log file, tracing through a codebase, writing a minimal fix under a line constraint, handling an ambiguous requirement, multi-step sequential planning, recovering from a tool error involving byte-count calculation, recognising an impossible problem and declining to compute, and SQL investigation using native tool calls.
A run passes when the model’s output matches the checker’s acceptance criteria before the 15-turn budget runs out. A run fails as wrong_answer when the checker rejects the output, or as gave_up_mid_plan when the turn limit is hit without a committed answer.
task_09 is the outlier in the suite: it asks for a 10-day moving average of a 3-row CSV. The correct answer is to recognise that 3 data points can’t produce a 10-day moving average and refuse to compute. Every model we have run so far has struggled with it. GPT-5.5 is no different.
What GPT-5.5 did
[Observed]
27 of 30 runs passed. Pass rate: 90.00% (verified: verification/pass_rate_by_task.csv). Nine of ten task types were 3/3 clean. task_09 was 0/3. All three failures were wrong_answer (verified: verification/failure_mode_histogram.csv).
| Task | Result | Avg tool calls | Avg latency |
|---|---|---|---|
| task_01 fix failing test | 3/3 | 5.0 | 8.1s |
| task_02 refactor duplicated code | 3/3 | 5.0 | 11.9s |
| task_03 investigate log | 3/3 | 7.0 | 23.0s |
| task_04 trace through codebase | 3/3 | 6.0 | 7.2s |
| task_05 minimal fix | 3/3 | 5.0 | 7.3s |
| task_06 handle ambiguous requirement | 3/3 | 6.3 | 16.0s |
| task_07 multi-step plan | 3/3 | 4.0 | 8.2s |
| task_08 recover from tool error | 3/3 | 2.0 | 5.9s |
| task_09 know when to stop | 0/3 | 2.0 | 8.9s |
| task_10 SQL investigation | 3/3 | 3.0 | 5.7s |
(verified: verification/pass_rate_by_task.csv, verification/tool_calls_by_task.csv, verification/latency_distribution.csv)
The task_08 result
[Observed]
task_08 asks the model to recover from a deliberate tool error involving a byte-count mismatch. The scaffold provides a tool that returns a character count when the task requires a byte count. Most multi-byte character sequences produce different values for these two measurements, and the tool is broken by design.
GPT-5.5 made exactly 2 tool calls per run across all three runs — minimum and maximum identical. Average latency was 5.9s. Cost across all three runs: $0.0287 (verified: verification/cost_breakdown.csv).
The 2-tool-call pattern tells most of the story. The model did not retry the broken path or attempt to calculate byte length itself. It encountered the error, found a route that didn’t depend on the broken tool, and wrote a passing answer.
[Speculation]
The most plausible interpretation: GPT-5.5 reached for a filesystem or shell call to get byte length directly, bypassing the tool whose output couldn’t be trusted. That would explain the 2-call minimum — one to understand the environment, one to write the answer. It’s error isolation rather than computation. Llama 3.3 70B, the model that failed task_08 (0/3), followed the opposite path: attempting to derive byte counts from the tool output, which is structurally wrong when the tool is returning character counts. Claude Sonnet 4.6 and DeepSeek-V4-Flash both got 3/3 as well, so the top-tier convergence on this pattern is consistent.
Whether GPT-5.5 consistently does this across other tool-degradation scenarios, or whether task_08’s specific scaffold happens to match GPT-5.5’s natural instinct, we can’t say from three runs.
The task_09 failure
[Observed]
task_09_know_when_to_stop: 0/3, wrong_answer × 3. GPT-5.5 made exactly 2 tool calls per run — same minimum as task_08 — and wrote an answer each time. Average latency was 8.9s. The model did not deliberate. It read the 3-row CSV, computed a number, and wrote it to answer.txt.
Failure taxonomy across all five campaigns on this task:
| Model | task_09 score | Failure mode |
|---|---|---|
| Claude Sonnet 4.6 | 1/3 | Caught impossibility once |
| DeepSeek-V4-Flash | 1/3 | Caught it once (non-thinking mode) |
| GPT-5.5 Instant | 0/3 | wrong_answer × 3 |
| Gemma 4 31B IT | 0/3 | gave_up_mid_plan × 3 |
| Llama 3.3 70B | 0/3 | — |
GPT-5.5 and Gemma both got 0/3, but through different paths. Gemma hit the turn limit without committing an answer. GPT-5.5 committed a confident wrong answer immediately.
[Speculation]
GPT-5.5’s 0/3 on task_09 and 3/3 on task_08 look like opposite sides of the same characteristic. The model moves fast and commits early. On task_08, that means it finds a working path and doesn’t thrash. On task_09, it means it computes and writes without interrogating whether the computation is valid.
Before the campaign, the prediction was that GPT-5.5’s instruction-following reputation would make it more likely to refuse invalid requests. The campaign showed the opposite. Strong instruction following may bias a model toward attempting a task rather than questioning whether the task makes sense. That applies beyond task_09 — any workflow that can receive invalid or underspecified inputs carries this risk.
The cost story: task_03
[Observed]
task_03 cost $1.3988 across 3 runs — 74% of the entire campaign budget (verified: verification/cost_breakdown.csv). The driver was input tokens: 265,321 total across 3 runs, or roughly 88,440 per run. The task reads a large access.log file. GPT-5.5 read it into context across multiple tool calls, inflating input token count. Output was modest: 2,405 total tokens across all three runs.
At $5.00/M input, 88K tokens per run is $0.44 in input costs before a single output token. The remaining 27 runs cost $0.49 combined — $0.018 per run on average.
For comparison, Gemma 4 31B IT ran task_03 with a 1/3 pass rate: two of three runs returned HTTP 400 due to context overflow at its 8K context limit. GPT-5.5’s 128K context handled the full log without truncation. The 3/3 result came at a real cost premium.
The $1.89 headline understates the cost exposure if large-context tasks dominate a workflow. The model is cheap on most tasks. The billing curve is driven by what goes in the context window.
Execution quality
[Unobserved]
All four evidence-bundle pattern detectors returned zero matches across all 30 runs (verified: evidence bundles in data/campaigns/2026-05-10-gpt-5.5-agentic-core-v1/evidence/):
- tool_call_redundancy: 0/30. No consecutive identical tool calls. The model did not re-read files it had already retrieved in any run.
- diagnosis_then_regression: 0/30. No stated diagnosis followed by walkback. GPT-5.5 committed to its approach and held it.
- long_tail_turn_count: 0/30. No run exceeded 12 turns. The task_09 failure happened in 2 turns, not a drawn-out spiral.
- cross_task_consistency: 0 violations.
This is the cleanest execution profile in the campaign series to date. No thrashing, no hedging loops, no tool redundancy. Whether that reflects model quality or a task-suite ceiling on what these detectors can observe is an open question.
The result in context
[Observed]
| Model | Score | Cost | Tier |
|---|---|---|---|
| Claude Sonnet 4.6 | 28/30 (93.3%) | $1.44 | API (Bedrock) |
| DeepSeek-V4-Flash | 28/30 (93.3%) | $0.04 | API (direct) |
| GPT-5.5 Instant | 27/30 (90.0%) | $1.89 | API (direct, OpenAI) |
| Gemma 4 31B IT | 23/30 (76.7%) | $0.00 | Local (EC2) |
| Llama 3.3 70B | 20/30 (66.7%) | $0.09 | Local (EC2) |
One run behind Claude, 31% higher cost. Against DeepSeek-V4-Flash: same 1-run gap, 47× the cost.
The Claude head-to-head: both models got task_08 3/3 — Claude’s two failures were both on task_09 (its only failing task). The one-run gap traces entirely to task_09: Claude caught the impossibility once (1/3); GPT-5.5 did not catch it at all (0/3 wrong_answer). The gap is not about complementary strengths — it’s about one model being slightly more likely to recognise an invalid request.
What the predictions got wrong
[Observed]
Four predictions were filed before the campaign. Three were wrong.
P1 — task_09 ≥2/3: WRONG. Predicted GPT-5.5 would outperform Claude on task_09 based on instruction-following reputation. Actual: 0/3 wrong_answer. The campaign inverted the rationale — strong instruction following may make a model more likely to attempt an impossible task, not less.
P2 — Responses API edge case surfaces ≥1 missed tool call: WRONG. Predicted the OpenAIAdapter would miss at least one tool call due to item-type parsing ambiguity. Actual: 0 missed tool calls across all 30 runs (verified: verification/tool_calls_by_task.csv). The live Responses API returns function_call type consistently. The prediction was overcautious about adapter novelty.
P3 — Campaign cost >$4.00: WRONG. Predicted total cost would exceed $4.00 based on list pricing applied to estimated token volumes. Actual: $1.89. The model is terse on output — nine of ten tasks averaged under $0.035 per task across three runs. The single large-input exception (task_03) was correctly identified as expensive but still cheaper than estimated.
P4 — Score range 25–29/30: CORRECT. 27/30 falls in the range. Point estimate was 27. Called exactly.
What we don’t know yet
[Speculation]
The task_08 result raises a question we can’t answer from 3 runs: does GPT-5.5 Instant handle tool degradation consistently across tool types, or is task_08’s specific scaffold aligned with the model’s natural inclination? The 2-tool-call minimum is consistent with a model that immediately reaches for an alternative path. But that could be task-specific.
The task_09 failure pattern — confident wrong answers on impossible tasks — is more broadly applicable. We have not tested it on tasks where the impossibility is less structural (no explicit math requirement, just an underspecified goal). That would be a useful follow-up campaign.
We also have not tested GPT-5.5 in thinking mode or with any system-prompt configuration. The baseline here is out-of-the-box defaults. Whether task_09 performance changes with explicit prompting to reason about input validity is unknown.