90% and the one model that never refused

Campaign: 2026-05-10-gpt-5.5-agentic-core-v1
Model: OpenAI GPT-5.5 Instant (gpt-5.5, direct OpenAI API, Responses API)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-05-16 (run at 16:22–16:27 UTC)


Every other model we have tested has come in on Bedrock or EC2. GPT-5.5 Instant was the first OpenAI model through the pipeline, which meant two things needed to work on the same run: the new OpenAIAdapter, and the model itself.

The adapter worked. No infrastructure errors across all 30 runs, clean tool-call handling, consistent function_call type from the Responses API throughout.

The model scored 27/30 — third on the leaderboard, one run behind the top cluster, 31% more expensive than Claude on the same suite. The result we did not expect: task_08, the tool-error recovery task where lower-performing models have consistently failed. GPT-5.5 got 3/3, joining Claude Sonnet 4.6 and DeepSeek-V4-Flash at the top. Exactly two tool calls each time.


What agentic-core-v1 tests

[Observed]

agentic-core-v1 has 10 tasks, each run 3 times, for 30 total runs. The tasks are structured, scoped, and have deterministic checkers. They cover: fixing a failing test, refactoring duplicated code, investigating a log file, tracing through a codebase, writing a minimal fix under a line constraint, handling an ambiguous requirement, multi-step sequential planning, recovering from a tool error involving byte-count calculation, recognising an impossible problem and declining to compute, and SQL investigation using native tool calls.

A run passes when the model’s output matches the checker’s acceptance criteria before the 15-turn budget runs out. A run fails as wrong_answer when the checker rejects the output, or as gave_up_mid_plan when the turn limit is hit without a committed answer.

task_09 is the outlier in the suite: it asks for a 10-day moving average of a 3-row CSV. The correct answer is to recognise that 3 data points can’t produce a 10-day moving average and refuse to compute. Every model we have run so far has struggled with it. GPT-5.5 is no different.


What GPT-5.5 did

[Observed]

27 of 30 runs passed. Pass rate: 90.00% (verified: verification/pass_rate_by_task.csv). Nine of ten task types were 3/3 clean. task_09 was 0/3. All three failures were wrong_answer (verified: verification/failure_mode_histogram.csv).

TaskResultAvg tool callsAvg latency
task_01 fix failing test3/35.08.1s
task_02 refactor duplicated code3/35.011.9s
task_03 investigate log3/37.023.0s
task_04 trace through codebase3/36.07.2s
task_05 minimal fix3/35.07.3s
task_06 handle ambiguous requirement3/36.316.0s
task_07 multi-step plan3/34.08.2s
task_08 recover from tool error3/32.05.9s
task_09 know when to stop0/32.08.9s
task_10 SQL investigation3/33.05.7s

(verified: verification/pass_rate_by_task.csv, verification/tool_calls_by_task.csv, verification/latency_distribution.csv)


The task_08 result

[Observed]

task_08 asks the model to recover from a deliberate tool error involving a byte-count mismatch. The scaffold provides a tool that returns a character count when the task requires a byte count. Most multi-byte character sequences produce different values for these two measurements, and the tool is broken by design.

GPT-5.5 made exactly 2 tool calls per run across all three runs — minimum and maximum identical. Average latency was 5.9s. Cost across all three runs: $0.0287 (verified: verification/cost_breakdown.csv).

The 2-tool-call pattern tells most of the story. The model did not retry the broken path or attempt to calculate byte length itself. It encountered the error, found a route that didn’t depend on the broken tool, and wrote a passing answer.

[Speculation]

The most plausible interpretation: GPT-5.5 reached for a filesystem or shell call to get byte length directly, bypassing the tool whose output couldn’t be trusted. That would explain the 2-call minimum — one to understand the environment, one to write the answer. It’s error isolation rather than computation. Llama 3.3 70B, the model that failed task_08 (0/3), followed the opposite path: attempting to derive byte counts from the tool output, which is structurally wrong when the tool is returning character counts. Claude Sonnet 4.6 and DeepSeek-V4-Flash both got 3/3 as well, so the top-tier convergence on this pattern is consistent.

Whether GPT-5.5 consistently does this across other tool-degradation scenarios, or whether task_08’s specific scaffold happens to match GPT-5.5’s natural instinct, we can’t say from three runs.


The task_09 failure

[Observed]

task_09_know_when_to_stop: 0/3, wrong_answer × 3. GPT-5.5 made exactly 2 tool calls per run — same minimum as task_08 — and wrote an answer each time. Average latency was 8.9s. The model did not deliberate. It read the 3-row CSV, computed a number, and wrote it to answer.txt.

Failure taxonomy across all five campaigns on this task:

Modeltask_09 scoreFailure mode
Claude Sonnet 4.61/3Caught impossibility once
DeepSeek-V4-Flash1/3Caught it once (non-thinking mode)
GPT-5.5 Instant0/3wrong_answer × 3
Gemma 4 31B IT0/3gave_up_mid_plan × 3
Llama 3.3 70B0/3

GPT-5.5 and Gemma both got 0/3, but through different paths. Gemma hit the turn limit without committing an answer. GPT-5.5 committed a confident wrong answer immediately.

[Speculation]

GPT-5.5’s 0/3 on task_09 and 3/3 on task_08 look like opposite sides of the same characteristic. The model moves fast and commits early. On task_08, that means it finds a working path and doesn’t thrash. On task_09, it means it computes and writes without interrogating whether the computation is valid.

Before the campaign, the prediction was that GPT-5.5’s instruction-following reputation would make it more likely to refuse invalid requests. The campaign showed the opposite. Strong instruction following may bias a model toward attempting a task rather than questioning whether the task makes sense. That applies beyond task_09 — any workflow that can receive invalid or underspecified inputs carries this risk.


The cost story: task_03

[Observed]

task_03 cost $1.3988 across 3 runs — 74% of the entire campaign budget (verified: verification/cost_breakdown.csv). The driver was input tokens: 265,321 total across 3 runs, or roughly 88,440 per run. The task reads a large access.log file. GPT-5.5 read it into context across multiple tool calls, inflating input token count. Output was modest: 2,405 total tokens across all three runs.

At $5.00/M input, 88K tokens per run is $0.44 in input costs before a single output token. The remaining 27 runs cost $0.49 combined — $0.018 per run on average.

For comparison, Gemma 4 31B IT ran task_03 with a 1/3 pass rate: two of three runs returned HTTP 400 due to context overflow at its 8K context limit. GPT-5.5’s 128K context handled the full log without truncation. The 3/3 result came at a real cost premium.

The $1.89 headline understates the cost exposure if large-context tasks dominate a workflow. The model is cheap on most tasks. The billing curve is driven by what goes in the context window.


Execution quality

[Unobserved]

All four evidence-bundle pattern detectors returned zero matches across all 30 runs (verified: evidence bundles in data/campaigns/2026-05-10-gpt-5.5-agentic-core-v1/evidence/):

This is the cleanest execution profile in the campaign series to date. No thrashing, no hedging loops, no tool redundancy. Whether that reflects model quality or a task-suite ceiling on what these detectors can observe is an open question.


The result in context

[Observed]

ModelScoreCostTier
Claude Sonnet 4.628/30 (93.3%)$1.44API (Bedrock)
DeepSeek-V4-Flash28/30 (93.3%)$0.04API (direct)
GPT-5.5 Instant27/30 (90.0%)$1.89API (direct, OpenAI)
Gemma 4 31B IT23/30 (76.7%)$0.00Local (EC2)
Llama 3.3 70B20/30 (66.7%)$0.09Local (EC2)

One run behind Claude, 31% higher cost. Against DeepSeek-V4-Flash: same 1-run gap, 47× the cost.

The Claude head-to-head: both models got task_08 3/3 — Claude’s two failures were both on task_09 (its only failing task). The one-run gap traces entirely to task_09: Claude caught the impossibility once (1/3); GPT-5.5 did not catch it at all (0/3 wrong_answer). The gap is not about complementary strengths — it’s about one model being slightly more likely to recognise an invalid request.


What the predictions got wrong

[Observed]

Four predictions were filed before the campaign. Three were wrong.

P1 — task_09 ≥2/3: WRONG. Predicted GPT-5.5 would outperform Claude on task_09 based on instruction-following reputation. Actual: 0/3 wrong_answer. The campaign inverted the rationale — strong instruction following may make a model more likely to attempt an impossible task, not less.

P2 — Responses API edge case surfaces ≥1 missed tool call: WRONG. Predicted the OpenAIAdapter would miss at least one tool call due to item-type parsing ambiguity. Actual: 0 missed tool calls across all 30 runs (verified: verification/tool_calls_by_task.csv). The live Responses API returns function_call type consistently. The prediction was overcautious about adapter novelty.

P3 — Campaign cost >$4.00: WRONG. Predicted total cost would exceed $4.00 based on list pricing applied to estimated token volumes. Actual: $1.89. The model is terse on output — nine of ten tasks averaged under $0.035 per task across three runs. The single large-input exception (task_03) was correctly identified as expensive but still cheaper than estimated.

P4 — Score range 25–29/30: CORRECT. 27/30 falls in the range. Point estimate was 27. Called exactly.


What we don’t know yet

[Speculation]

The task_08 result raises a question we can’t answer from 3 runs: does GPT-5.5 Instant handle tool degradation consistently across tool types, or is task_08’s specific scaffold aligned with the model’s natural inclination? The 2-tool-call minimum is consistent with a model that immediately reaches for an alternative path. But that could be task-specific.

The task_09 failure pattern — confident wrong answers on impossible tasks — is more broadly applicable. We have not tested it on tasks where the impossibility is less structural (no explicit math requirement, just an underspecified goal). That would be a useful follow-up campaign.

We also have not tested GPT-5.5 in thinking mode or with any system-prompt configuration. The baseline here is out-of-the-box defaults. Whether task_09 performance changes with explicit prompting to reason about input validity is unknown.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.