Opus 4.8 dominates casino-strategy-v1 — except for the one task it can't crack

Campaign: 2026-06-19-claude-opus-4-8-casino-strategy-v1
Model: Claude Opus 4.8 (Bedrock us-east-1, cross-region inference profile)
Harness: casino-strategy-v1 v1.1 (tool-call guard active)
Runs: 15 (5 tasks × 3)
Campaign date: 2026-06-19


Mistral Large 3 has sat at the top of the casino-strategy-v1 leaderboard since May at 66.7%. Every model we’ve run since has landed below it. We added three Bedrock models in one campaign batch: Claude Opus 4.8, Amazon Nova Pro, and Llama 3.3 70B.

Opus 4.8 ended Mistral’s run.

It scored 13/15 (86.7%), a 20-point gap over the prior leader. It’s the only model in the dataset to pass task_01, the basic strategy drill, three times running. The one task it didn’t beat — task_03, the split and double edge-case suite — has beaten every model we’ve tested, and Opus 4.8 only partially escaped it. That anomaly is the most interesting thing in this report.


What casino-strategy-v1 tests

[Observed]

Five tasks, three runs each, 15 runs total. The model plays blackjack against a command-line game engine (game.py) and interacts with it via tool calls: deal a hand, hit, stand, split, double. The engine returns game state; the model decides what to do. Pass criteria vary per task and require clearing score thresholds, not just making reasonable plays:

A run either passes or fails. OPR at 0.74 when the threshold is 0.75 is a failure. Partial credit does not exist.


The results

[Observed]

TaskRunsPassedRate
task_01 basic strategy drill33100%
task_02 count-aware betting33100%
task_03 split/double edge cases3133%
task_04 bankroll survival33100%
task_05 full session33100%
Total151386.7%

(verified: pass_rate_by_task.csv)

Four of five tasks: clean. task_03: one pass, two failures. The failure modes from the full 15 runs are 13 passed, 1 wrong_answer, and 1 budget_exhausted (verified: failure_mode_histogram.csv). Both failures land in task_03.

The budget_exhausted failure is the run that played to completion — all 15 hands — but consumed the highest token budget of any task_03 run. In the transcripts, Opus reasoned through each decision before committing, and the edge-case hand set task_03 uses triggered more deliberation than standard distribution hands. That deliberation is expensive. It is also, arguably, correct behaviour. The problem is that extended reasoning on each hand compounds across 15 hands until the budget ceiling arrives.


How Opus plays

[Observed]

Opus 4.8 uses 11–21 tool calls per task, averaging 21.0 on task_02, the count-aware betting task (verified: tool_calls_by_task.csv). That’s genuine interactive play. It deals hands, receives state, decides, then deals again — rather than pre-computing a strategy and submitting it in bulk. On task_01 (basic strategy drill), it went further: midway through run 1, it paused and wrote a Python automation script:

“Now I understand the format. Let me write a Python script that implements basic strategy and plays out all hands automatically by calling game.py via subprocess.” — run f5da0591, task_01_basic_strategy_drill_run1, turn 13

That’s a different approach from every prior model in this dataset. Mistral Large 3 played hand-by-hand throughout. Haiku 4.5 did too. Opus decided the manual approach was slower than scripting the strategy and swapped to automation mid-run.

The script approach on task_01 — the task every prior model has fumbled — is almost certainly why it passed 3/3. (verified: task_01 was the floor task for the full dataset before this campaign; only Mistral Large 3 ever passed it, at 1/3, per casino-strategy-v1-leaderboard-2026.mdx.)

On task_03, the edge-case task, Opus played hand-by-hand and reasoned aloud through the counterintuitive spots. Here is run 2 (the failing run 82ccd62d) encountering a borderline double:

“Hand 10: 3,8 = hard 11 vs dealer K(10). Double down on 11 vs 2-9, but vs 10 = hit (basic strategy: 11 vs 10 hit in S17… actually 11 double vs 2-10, hit vs A). The rules here say ‘Double down on 10 or 11 vs dealer 2-9’. So vs 10, hit.” — run 82ccd62d, task_03_split_double_edge_cases_run1, turn 23

It got the decision right there. The run failed on other hands, not this one. But the turn shows the difference in processing mode: task_01 gets a script; task_03 gets live deliberation. One approach scales cleanly, the other stumbles on the curated hard cases.

12 of 15 runs used more than 12 turns out of the default 15 (verified: long_tail_turn_count.md). Opus is thorough. It re-checks state, validates decisions, confirms completion before writing the session report. That costs tokens on every task — but it also means the work is actually done.


Why task_03 is still unsolved

[Observed + Speculation]

task_03 has a 0% pass rate across every model before this campaign. Opus 4.8 goes 1/3 — which is progress, but the single passing run had OPR=1.0 (perfect) while the two failures fell short of the 0.73 threshold (verified: task_03 run data). The variance within a single model on the same task is large.

The hands task_03 selects for are genuinely counterintuitive. Splitting 9-9 against a 7 is correct (you split against 2-6 and 8-9, stand against 7, 10, and Ace). Doubling Ace-7 against a 4 is correct (soft 18 doubles against 3-6). Most models, and most casual human players, either don’t know these rules or fail to apply them consistently under time and token pressure.

Opus’s failure mode is not ignorance — it states the correct rule in the transcripts on some of the hard cases. It’s inconsistency. [Speculation] The hypothesis is that task_03’s curated hand set surfaces a gap between “can recite the rule” and “applies the rule reliably when multiple edge cases arrive in sequence.” Whether that gap closes with a longer context window, a different prompting strategy, or simply more runs is an open question.

No cross-task consistency failure pattern appeared in the data (verified: cross_task_consistency.md). The task_03 issue is isolated to task_03.


The cost story

[Observed]

$36.87 for 15 runs (verified: cost_breakdown.csv). task_05 (50-hand full session) consumed $11.41 alone — $3.80 per run. task_02 (count-aware betting) was the second-highest at $8.08. The cost scales with hand count and reasoning depth: more hands, more deliberation per hand, higher spend.

For context, Amazon Nova Pro ran 15 casino-strategy-v1 runs for $0.08 total and scored 9/15 (60.0%). That’s a 27-point deficit from Opus at 460× cheaper. Llama 3.3 70B ran for $1.06 and scored 8/15 (53.3%).

The question “is the frontier model worth the cost?” depends on what you’re buying. If the goal is to pass the easy tasks (task_02, task_04, task_05), Nova Pro gets three of those for $0.08. If you need task_01 — the basic strategy drill that had a 0/12 record across four prior models — Opus 4.8 at $36.87 is the only option that works.


Updated leaderboard

[Observed]

RankModelScoreCost (15 runs)
1Claude Opus 4.813/15 (86.7%)$36.87
2Mistral Large 310/15 (66.7%)
3Amazon Nova Pro9/15 (60.0%)$0.08
4Llama 3.3 70B8/15 (53.3%)$1.06
5Claude Haiku 4.56/15 (40.0%)
6GPT-4o Turbo3/15 (20.0%)
6Claude Sonnet 4.63/15 (20.0%)

task-level breakdown across all 7 models: see casino-strategy-v1-leaderboard-bedrock-expansion-2026.mdx (when published).


What we don’t know yet

[Unobserved]

We did not look for whether the budget_exhausted failure in task_03 correlates with the specific hand index where budget ran out. The transcript ran to completion; the failure is in OPR, not turn count. Whether extending the budget would change the outcome is untested.

[Speculation] The 1/3 task_03 pass rate may not be reproducible at a higher run count. A single passing run out of three is a wide confidence interval. Whether Opus 4.8 stabilises above the 0.73 OPR threshold on task_03 across 10+ runs, or whether the pass was an outlier within normal variance, is an open question that would require a targeted re-run.

The task_03 anomaly has now appeared across seven models and 21 runs (3 per model). The consistent failure pattern suggests the task design is surfacing something real about how models handle curated edge-case gauntlets versus standard-distribution play. Whether fixing that gap is a training problem, a prompting problem, or a task design problem is not something this data resolves.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.