The pass that only worked because pandas wasn't installed
Campaign: 2026-05-15-claude-sonnet-4.6-agentic-core-v1
Model: claude-sonnet-4-6 (Anthropic, via AWS Bedrock)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Date: 2026-05-04
The first question when you run a new benchmark is whether anything will actually fail. You design tasks to probe edge cases, but you don’t know until the runs land whether the model sails through or hits the wall you built.
Claude Sonnet 4.6 mostly sailed through. That was the surprise: 9 of 10 task types, zero failures. And then one task broke it twice, and the run that didn’t break — the single pass — was an accident. The model never chose the right answer. The environment forced it there because a library wasn’t installed.
That accident is the story.
The numbers
[Observed]
28 of 30 runs passed. Pass rate: 93.3% (verified: pass_rate_by_task.csv). Total cost: $1.44 across 30 runs ($0.048 per run average, verified: cost_breakdown.csv). One task accounted for both failures. The other 9 tasks were 3/3 clean.
Both headline predictions were wrong. Predicted pass rate: 80–90%. Actual: 93.3%, above the upper bound. Predicted cost: $12–18. Actual: $1.44, off by roughly 10×. The predictions are on record and scored.
The 10× cost miss is worth examining. The prediction assumed that “generous turn budgets” would drive cost up. That reasoning is wrong. Most tasks completed in 4–11 tool calls (verified: tool_calls_by_task.csv). Turn budget headroom does not determine cost; actual usage does. The tasks are short and the Bedrock adapter is efficient. In hindsight this was obvious.
The one task that broke: know when to stop
[Observed]
The task (internally task_09_know_when_to_stop) asked the model to compute a 10-day moving average of a revenue column in a CSV file. The data file had exactly 3 rows, covering three consecutive days. A 10-day window applied to 3 data points is structurally ambiguous: the task specification didn’t say what to do when there’s less data than the window size. The task name was a hint about the expected behaviour.
Run 1 passed. Runs 2 and 3 failed. The harness labels a run that hits the turn limit without producing a complete answer as gave_up_mid_plan, meaning the model kept working but never committed to a final answer before time ran out.
Run 1: the accident
Early in the run, the model tried to import pandas to do the calculation. Pandas wasn’t installed in the task environment, so the import failed with a ModuleNotFoundError. Forced to use plain Python instead, the model fell back to a simpler calculation and wrote this to the output file:
“Note: With only 3 data points (fewer than the 10-day window), the moving average is computed using all available data up to each date (min_periods=1 approach).”
The task checker accepted this. The model had chosen, via the fallback path and not by deliberate reasoning, to treat the window as cumulative up to the available data. That is a defensible answer to the underspecified question. But the model didn’t reason its way there. The missing library forced it there.
Runs 2 and 3: confident diagnosis, no resolution
Runs 2 and 3 had pandas available. Without the forced fallback, the model approached the problem differently and got stuck.
In both runs, the model read the CSV file, correctly identified that it had only 3 rows and needed a 10-day window, and announced it was ready to compute the answer. Then it produced all-null results (three rows, all showing no value), wrote that to the output file, re-read the same CSV it had already loaded, and repeated the same steps. This loop continued until it hit the turn limit and the harness closed the run as gave_up_mid_plan.
The model knew the constraint. It stated the constraint. It never decided what to do about it.
Both failed runs re-read the input file after already having its contents; the harness flagged this as a redundant tool call pattern, discussed below.
The structural finding
[Observed] The run 1 pass depends entirely on pandas being absent from the environment. That is invisible to the task specification and not part of the acceptance criteria. On any system where pandas is available (which is most real environments), the model would have no reason to use the plain-Python fallback. Based on runs 2 and 3, it would likely have produced all-null output and run to the limit.
[Speculation] If run 1 had pandas available, it would probably also have failed. The real pass rate for this task in a standard environment may be 0/3, not 1/3. This is not tested.
A pattern that showed up: repeating the same tool call
[Observed]
7 of 30 runs showed what the harness calls tool_call_redundancy: consecutive identical tool calls, meaning the model called the same tool with the same arguments twice in a row, back to back (verified: tool_call_redundancy.md). That is 23.3% of runs.
Three tasks accounted for all instances:
- Refactor duplicated code (
task_02): all 3 runs each called the file-write tool on the same output file with identical content in consecutive turns. All three passed. Writing the same file twice is harmless; the second write just overwrites the first with the same content. - Recover from a tool error (
task_08): one run re-read the same data file twice in a row after already having its contents. Passed. The other two runs had zero redundancy. - Know when to stop (
task_09): all 3 runs re-read the same CSV with identical arguments after already having the file. Run 1 passed (see above). Runs 2 and 3 failed.
[Unobserved] The other 7 tasks had zero redundant tool calls across all runs, confirmed by the counter-examples section of tool_call_redundancy.md.
The three tasks with redundancy have structurally different outcomes. In the refactor task, the model writes a correct result and writes it again, harmlessly. In the moving-average task, the redundancy appears alongside failure: the model re-reads data it already has, a signal that it’s stuck rather than progressing.
[Speculation] All three tasks with redundancy involve a point where the model may not be confident its output is correct without looking at the source again. Whether this is a general pattern or specific to these task types is an open question with only one data point each way.
A pattern that didn’t show up: taking back a diagnosis
[Unobserved]
Zero of 30 runs showed diagnosis_then_regression, the pattern where a model explicitly states what’s wrong and then walks that diagnosis back. The detectors ran against all 30 transcripts (verified: diagnosis_then_regression.md). This is an explicit null result.
Run 3 of task_09 is directly relevant here: the model said “Now I have all the information needed” and then looped to the turn limit. That looks like it might be regression, but it isn’t. The model never walked back its stated plan. It maintained the same approach throughout; it just couldn’t produce a final answer under it. The harness labels this gave_up_mid_plan, not regression. False confidence is not the same as reversal.
Cost and latency
[Observed]
Total: $1.44. Per-run mean: $0.048 (verified: cost_breakdown.csv).
Cheapest task: the multi-step planning task at $0.012 per run average. Most expensive: the log investigation task at $0.13 per run average, because that task includes a 500-line log file as part of the prompt (118,046 input tokens across 3 runs).
The slowest task by elapsed time was the codebase tracing task, averaging 56.6 seconds per run. That task averaged 18.3 tool calls per run, reading 5 source files, some of them multiple times. No single call was slow; the latency came from the volume of calls (verified: latency_distribution.csv, tool_calls_by_task.csv).
What the predictions got wrong
Prediction: pass rate 80–90%. WRONG, high miss. Actual: 93.3%, above the upper bound.
Prediction: cost $12–18. WRONG, high miss. Actual: $1.44, off by roughly 10×.
Prediction: task_09 failures would be wrong_answer, meaning the checker would reject a numeric output that didn’t meet the acceptance criteria. WRONG. The failures were gave_up_mid_plan: the model never produced a final numeric answer to reject. The prediction on count (1 of 3 runs passing) was correct. The predicted failure mechanism was not.
What we don’t know yet
-
Whether
task_09run 1 still passes on a system with pandas installed. -
Whether the redundant-tool-call pattern predicts loop-failure in general. One data point where redundancy co-occurs with failure (the moving-average task), one where it co-occurs with a pass (the refactor task). Not enough to generalise.
-
Whether 93.3% holds on harder task variants. The agentic-core-v1 tasks are structured, well-scoped, and have deterministic checkers. Behaviour on tasks with ambiguous acceptance criteria (or larger, noisier codebases) is not covered by this run.
-
Whether $0.048 per run scales to real production work. Almost certainly not: cost is driven by input token volume, and production jobs carry far more context than these tasks.
Evidence pack: verification/ directory. Full transcripts in data/transcripts/2026-05-15-claude-sonnet-4.6-agentic-core-v1/.