Sonnet 4.6 can count cards. It cannot play basic strategy.

Campaign: 2026-05-25-claude-sonnet-4-6-casino-strategy-v1
Model: Claude Sonnet 4.6 (us.anthropic.claude-sonnet-4-6-20250514-v1:0)
Harness: casino-strategy-v1 v1.0 (baseline run)
Runs: 15 (5 tasks x 3)
Campaign date: 2026-05-25

Harness update (2026-05-27): ModelClaw PR #111 recalibrated task_04 to OPR-only scoring. Bankroll survival is now informational, not a pass criterion. Sonnet 4.6’s task_04 result (0/3) is unaffected: it failed on OPR, not bankroll survival. Scores in this article reflect the original v1.0 criteria.


Sonnet 4.6 scored 28/30 (93%) on agentic-core-v1 (verified: 2026-05-15-claude-sonnet-4.6-agentic-core-v1 pass_rate_by_task.csv). At the time of this run, that was the highest score in our dataset (verified: leaderboard state 2026-05-25). Then we ran casino-strategy-v1. It scored 3/15 (20%).

That 73-point gap is why this article exists. It is also why the harness exists.

agentic-core-v1 tests discrete single-turn reasoning tasks with file system tools and clear pass criteria. Each run starts clean. Fail task_04 on agentic-core and the next trial is independent. casino-strategy-v1 is the opposite: the model plays blackjack hand by hand against a command-line game engine, and every wrong decision on hand 12 limits what is recoverable by hand 30.

The question: does Sonnet 4.6’s agentic capability translate into sequential decision-making under a strict rubric? No. But the failure modes are specific, and the most interesting result is not the overall score.


The harness

[Observed: harness spec v1.0]

Five tasks, three runs each, 15 total. The game engine is game.py, a command-line blackjack simulator the model invokes with tool calls. game.py status returns current hand state as JSON. game.py action <choice> executes hit, stand, split, or double. The model reads state and acts until the hand resolves.

Scoring criteria:

TaskHandsPass criteria
task_01 basic strategy drill30OPR at or above 0.80
task_02 count-aware betting20OPR at or above 0.75, CABR at or above 0.60
task_03 split/double edge cases15OPR at or above 0.73
task_04 bankroll survival20OPR at or above 0.75 and bankroll above $0
task_05 full session50OPR at or above 0.78, plus a valid session report

A pass requires clearing all applicable thresholds. There is no partial credit. OPR at 0.74 on a task that requires 0.75 is a fail.


The scoreboard

[Observed]

TaskPassedPass rate
task_01 basic strategy drill0/30%
task_02 count-aware betting2/367%
task_03 split/double edge cases1/333%
task_04 bankroll survival0/30%
task_05 full session0/30%
Total3/1520%

(verified: pass_rate_by_task.csv)

14 of 15 runs completed without infrastructure errors. The model ran the game engine. Failures are at strategy quality: wrong_answer on 12 of 15 runs.


How it broke: three patterns across 15 runs

[Observed]

Three patterns appear across the evidence bundles. All three span multiple tasks.

Long tail turn count (15/15 runs). Every run shows it. Optimal basic strategy decisions should not require polling game state repeatedly. The hand state after a game.py action hit is deterministic. Instead, the model averaged 14.0 to 19.3 tool calls per task, with a maximum of 29 on task_02. Sonnet re-queries the game engine for confirmation on every move rather than tracking state internally. (verified: long_tail_turn_count.md, 15 entries, all runs)

Diagnosis-then-regression (10/15 runs). The model produces a correct strategy call in its reasoning text, then invokes a different action. On task_03 split/double edge cases, the reasoning identifies the right play and the tool call issues a different action: action hit instead of action split. Strategy knowledge is present. The translation to a tool invocation is where it breaks. (verified: diagnosis_then_regression.md, 10 entries)

Tool call redundancy (8/15 runs). Consecutive identical tool calls in back-to-back turns. From the evidence bundle: task_04_bankroll_survival run2, turns 15 through 17, three consecutive game.py action hit calls. The model lost track of game state under bankroll pressure and defaulted to the last action it was comfortable with. (verified: tool_call_redundancy.md, 21 entries across 8 runs)


task_01 and task_02 swapped

[Observed]

Pre-run predictions scored 1/7 (14%) (verified: campaigns/2026-05-25-claude-sonnet-4-6-casino-strategy-v1.predictions.md). The miscalibrations were specific, not scattered, and the task_01/task_02 inversion is the most instructive.

PredictionActualScore
task_01 at or above 2/30/3wrong
task_02 at or below 1/32/3wrong, opposite direction

The prediction model expected task_01 (basic strategy drill) to be straightforward for Sonnet 4.6. The basic strategy chart is extensively published, and Sonnet almost certainly encountered it during training. task_02 (count-aware betting) was expected to be harder, because tracking a running count under multi-turn pressure sounded more cognitively demanding.

The actual results inverted the assumption.

task_01 requires 30 consecutive hands at OPR at or above 0.80. The model starts fine and degrades. Soft hand decisions, pair splits, and surrender calls on specific matchups (16 vs 10, 9-9 vs 7) require the right answer on each of the 30 hands. OPR is a running average with no recovery path. By hand 22, enough prior context has accumulated that recall of strategy edge cases starts to slip, and a few wrong answers push it below the threshold.

task_02 required something different. The Hi-Lo true count is exposed directly in the game state JSON. The model needed to read one number and apply a scaling rule: high count, raise the bet; low count, lower it. Each hand produces a single decision that does not depend on prior hands. Two of three runs passed.

[Speculation]

This suggests Sonnet 4.6 performs better on sparse high-value decisions than on sustained decision quality across a session. Bet-sizing is one correct action per hand. Basic strategy is thirty correct actions where each prior decision contributes to context load. Whether this is a training-data effect or a context-pressure effect is unresolved. The four-model leaderboard offers partial evidence: Claude Haiku 4.5 passes task_02 3/3 and also passes task_05 (50 hands), while Sonnet fails both task_01 and task_05. If Haiku handles sustained play better than Sonnet despite similar parameter scale, the context-pressure explanation is weaker than the training-data one. We do not have enough data to choose between them.


Cost profile

[Observed]

Total: $5.67 across 15 runs ($0.378/run average).

TaskTotalAvg/run
task_02 count-aware betting$1.67$0.56
task_05 full session$1.64$0.55
task_04 bankroll survival$1.12$0.37
task_03 split/double edge cases$0.74$0.25
task_01 basic strategy drill$0.50$0.17

(verified: cost_breakdown.csv)

The cost driver is context re-injection. Each game.py status call returns full game state JSON, which accumulates in conversation history. At 14 to 19 tool calls per run, input tokens grow fast. task_02 accumulated 476K input tokens across three runs (158K/run average) (verified: cost_breakdown.csv). casino-strategy-v1 averaged $0.378/run across all 15 runs, versus $0.048/run for agentic-core-v1 on this model ($1.44 total across 30 runs; verified: 2026-05-15-claude-sonnet-4.6-agentic-core-v1 cost_breakdown.csv), approximately 8x more expensive per run.

For a multi-model roster run, this matters. Estimated cost at similar model pricing: $50 to $80, depending on roster size.


We were wrong about task_01

[Observed]

The point estimate was 9/15 (60%). Actual: 3/15 (20%). A 40-point miss.

The prediction anchored on agentic-core-v1 performance: if a model scores 93% on sequential task execution, it should score somewhere in the 60% range on a different execution harness. That logic fails in the casino domain because the constraint is not sequential execution per se. It is sequential execution with no error recovery under a strict threshold. agentic-core-v1 has slack: partial completion still passes some tasks. casino-strategy-v1 does not. One wrong split on hand 4 contributes to a failing OPR average even if the next 26 hands are played correctly.

Beyond the task_01/task_02 inversion, the cost prediction was also wrong. The pre-run estimate anchored on agentic-core-v1 per-run cost without accounting for game-state context accumulation. task_02’s $0.56/run average was the most expensive individual task, which was not predicted.

[Unobserved]

We did not sample transcripts to identify whether task_01 failures are concentrated on specific hand types (soft 17, 9-9 against a 7, 16 against a 10) or distributed across all hand types. That distinction matters for understanding whether this is a knowledge gap on specialist plays or a general context-pressure effect. It is on the list for the next Sonnet run if the multi-model data does not resolve it first.


Harness validation

[Observed]

The harness ran 15/15 tasks to completion with valid JSON output. No stuck runs, no tool-call routing failures. The classifier applied OPR, CABR, and bankroll conditions correctly. Evidence bundles captured real behavioral patterns.

The threshold question is open. task_01’s 0/3 result with our strongest baseline model at the time suggests two possibilities: either OPR at or above 0.80 over 30 hands is calibrated correctly and Sonnet genuinely cannot sustain that play quality, or the threshold is too aggressive and needs recalibration. The data from one model cannot distinguish these.

ModelClaw PR #111 (merged 2026-05-27) added a tool-call guard to casino-strategy-v1 and recalibrated task_04 to OPR-only after the multi-model run revealed a dual-constraint design problem. Sonnet 4.6’s task_04 result (0/3) is unchanged either way.

The four-model leaderboard results (Mistral Large 3 at 67%, Haiku 4.5 at 40%, GPT-4o Turbo and Sonnet tied at 20%) are documented separately: casino-strategy-v1 leaderboard. Mistral Large 3 is the only model to pass task_01 at all (1/3), which is enough to confirm the threshold is reachable. task_01 is not miscalibrated. Sonnet’s 0/3 is a Sonnet result.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.