Opus 4.8 wins at blackjack. The gap is 20 points.

Original campaign IDs (May 2026): 2026-05-25-claude-sonnet-4-6-casino-strategy-v1, 2026-05-26-gpt-4o-turbo-casino-strategy-v1, 2026-05-26-mistral-large-3-casino-strategy-v1, 2026-05-26-claude-haiku-4-5-casino-strategy-v1
Bedrock expansion campaign IDs (June 2026): 2026-06-19-claude-opus-4-8-casino-strategy-v1, 2026-06-19-nova-pro-casino-strategy-v1, 2026-06-19-llama3.3-70b-casino-strategy-v1
Harness: casino-strategy-v1 v1.1 (tool-call guard active)
Total runs: 105 (7 models x 15 runs each)
Campaign dates: 2026-05-25 to 2026-06-19

Harness update history: ModelClaw PR #111 (merged 2026-05-27) added a tool-call guard — any run with zero tool_result turns scores as an automatic fail — and recalibrated task_04 as a dual constraint (opr>=0.75 and ending bankroll above $0). The June 2026 Bedrock expansion campaigns ran under this v1.1 harness. GPT-4o Turbo’s original task_05 passes (zero tool calls) are invalidated under the guard; its task_02 passes are confirmed valid. See A model that never played the game for the full re-run analysis.


The original four-model run ended with a question. Mistral Large 3 led at 67%. Was that the ceiling, or just where the first four models topped out?

Three more models arrived in June 2026. All three from AWS Bedrock: Claude Opus 4.8, Amazon Nova Pro, Llama 3.3 70B. The answer to the ceiling question came back quickly. 67% is not the ceiling. The new ceiling is 86.7%, and the gap between first and second place is 20 points — wider than the gap between any other two consecutive models in the dataset.

One of the new additions also ran its full 15-task campaign for $0.08. That number is in the correct place. Amazon Nova Pro, third on the leaderboard, cost less than a coffee’s tip.


What casino-strategy-v1 tests

[Observed]

Five tasks, three runs each, 15 runs per model. The game is standard blackjack, handled by a command-line engine (game.py) the model queries for hand state and acts on with hit, stand, split, or double. Each task scores on a different criterion:

Partial credit does not exist. A run either clears all thresholds for its task or it does not.


The seven-model leaderboard

[Observed]

RankModelScorePass RateTotal Cost
1Claude Opus 4.813/1586.7%$36.87
2Mistral Large 310/1566.7%$0.77
3Amazon Nova Pro9/1560.0%$0.08
4Llama 3.3 70B8/1553.3%$1.06
5Claude Haiku 4.56/1540.0%$0.08
6GPT-4o Turbo3/1520.0%$3.91
6Claude Sonnet 4.63/1520.0%$5.67

(verified: pass_rate_by_task.csv for each campaign)

Full task-level breakdown, all seven models:

TaskOpus 4.8Mistral L3Nova ProLlama 3.3Haiku 4.5GPT-4oSonnet 4.6
task_01 basic strategy drill3/31/30/30/30/30/30/3
task_02 count-aware betting3/33/33/33/33/33/32/3
task_03 split/double edge cases1/33/30/31/30/30/31/3
task_04 bankroll survival3/30/33/33/30/30/30/3
task_05 full session3/33/33/31/33/30/30/3
Total13/1510/159/158/156/153/153/15

What did Opus 4.8 do that 28 previous runs could not?

[Observed]

Before June 2026, task_01 had been attempted 18 times across six models. The best result was Mistral Large 3 at 1/3 — one passing run. Everyone else went 0/3. Six models, the same floor.

Opus 4.8 passed task_01 on all three runs. 3/3.

The threshold is 30 hands played at or above 80% optimal play rate. Opus 4.8 cleared it consistently. The tool-call pattern in the Opus runs is methodical: each hand gets a full sequence of state queries, decision reasoning, action execution, confirmation. No turns are skipped. No state is assumed.

task_01 had looked like a harness-design problem. Running 30 hands at sustained attention without drift. It is not a harness problem. Opus 4.8 shows the floor can be cleared. It is a capability ceiling that held until this model arrived.

The full analysis of the Opus 4.8 campaign, including transcript evidence from task_01 and task_03, is in Claude Opus 4.8 on casino-strategy-v1.

[Unobserved]

We do not know whether a smaller Claude model from the same generation would also pass task_01. Whether the task_01 performance reflects scale, RLHF fine-tuning, or something specific to Opus 4.8’s training is unknown.


Nova Pro at $0.08: how does that happen?

[Observed]

Amazon Nova Pro ran all 15 tasks for $0.08 total. Per passing run: $0.009. The cost is real — it comes from extremely lean token usage (verified: cost_breakdown.csv for 2026-06-19-nova-pro-casino-strategy-v1).

Nova Pro passes task_02, task_04, and task_05 — three of the five tasks — and scores 9/15 overall. That puts it ahead of GPT-4o Turbo ($3.91) and Claude Haiku 4.5 ($0.08) by three passes, and ahead of Claude Sonnet 4.6 ($5.67) by six.

The failure mode is early termination, not wrong decisions. On task_01, Nova Pro’s game_state evidence shows the model stopping mid-game — hand 9 of 30 in the first run, the same pattern across all three task_01 runs. On task_03, the same thing: session ends before all hands are dealt. Nova Pro is not making bad strategy calls. It is stopping before finishing.

This is a fixable problem. A targeted system prompt addition — explicitly instructing the model to continue until all hands are completed — would likely resolve the early-termination pattern without touching its strategy quality.

Full campaign analysis is in Amazon Nova Pro on casino-strategy-v1.

[Speculation]

A targeted 3-run retest of task_01 with the termination prompt fix could cost under $0.01 and plausibly push Nova Pro’s score to 12/15 (60.0% to 80.0%). We have not run this.


task_04 goes from 0/12 to viable

[Observed]

The original four-model campaign produced 0/12 on task_04 — zero passing runs across all four models. That 0/12 result was interpreted as a potential harness calibration problem: the dual constraint (OPR and bankroll survival simultaneously) might have been too strict.

It was not a calibration problem.

Opus 4.8, Nova Pro, and Llama 3.3 70B all pass task_04 at 3/3. The threshold was right. The original four models could not meet it. Three Bedrock models meet it on every run.

The failure mode for the original four is not close losses. It is consistent non-passing — no near-misses on the bankroll constraint. The task discriminates correctly between models that have genuine bankroll-awareness and models that play strategy with no regard for the financial state.

[Unobserved]

We have not run a modified task_04 with a relaxed bankroll threshold, so we cannot confirm the dual constraint is optimally calibrated rather than simply passable by higher-capability models. The validation runs from May 2026 (Haiku 4.5 3/3 and Mistral Large 3 3/3 under OPR-only scoring) reflect a different checkpoint than the v1.1 dual-constraint criterion the June campaigns used.


Why is task_03 still unsolved by most models?

[Observed]

task_03 has now been attempted 21 times across seven models. Two models pass it at any rate: Mistral Large 3 (3/3) and Claude Opus 4.8 (1/3). Llama 3.3 70B and Claude Sonnet 4.6 each went 1/3. Nova Pro, GPT-4o Turbo, and Claude Haiku 4.5 all went 0/3.

The task focuses on pairs and soft hands — the situations where the correct play diverges most from intuition. Split 8s against a dealer 10. Do not split 9s against a dealer 7. Double A-7 against a dealer 2. These are not guessable from general principles. A model needs to have absorbed enough blackjack strategy training data to know the table, not to reason toward it.

Mistral Large 3’s 3/3 on task_03 remains the dataset’s cleanest result on the hardest task, and the most unexplained. Its 675B parameter scale may give it broader coverage of specialist blackjack strategy content. Opus 4.8 has higher overall capability and still goes 1/3 on task_03. Nova Pro solves task_04 but cannot finish task_03. The tasks are not measuring the same thing.

[Speculation]

task_03 may be testing memorised strategy table coverage rather than reasoning ability — which would explain why it does not correlate with overall performance. If that is correct, adding task_03 training examples to fine-tuning data would produce a higher task_03 score without improving performance on other tasks. We have not tested this.


Llama 3.3 70B: task_04 solved at $1.06 total

[Observed]

Llama 3.3 70B passes task_04 at 3/3, one of three models to solve the bankroll survival task alongside Opus 4.8 and Nova Pro. The original four models were 0/12 on task_04. With the June expansion, three of the four new entrants pass it clean.

Llama also fails task_01 at 0/3 — the same floor that stops five other models. The gap between what Llama can and cannot do is sharp. It handles bankroll survival without trouble and cannot clear basic strategy drill at all.

At $1.06 for a full 15-run campaign, it is the cheapest model in the dataset to pass task_04. Cost per passing run: $0.133. The tool-call profile is compact — 2.7 average calls per run — which suggests decisive play rather than repeated state queries.

Full campaign analysis is in Llama 3.3 70B on casino-strategy-v1.


What the original four look like now

[Observed]

Adding three models changes what the original results mean. Mistral Large 3’s 66.7% looked like a ceiling in May. In a seven-model dataset, it is second place, 20 points behind Opus 4.8.

The Haiku vs Sonnet inversion holds. Claude Haiku 4.5 at 40% still finishes 20 points above Claude Sonnet 4.6 at 20%, from the same provider, trained on the same infrastructure. Haiku costs $0.08 total; Sonnet costs $5.67. The per-pass cost gap — Haiku $0.013, Sonnet $1.89 — has not narrowed. Neither has the explanation. The Sonnet pattern is state confusion under multi-turn pressure: tool-call redundancy and diagnosis-then-regression on the longer tasks. Full analysis is in the original leaderboard analysis.

GPT-4o Turbo stays at 3/15 on task_02 only. The task_05 original passes were invalidated by the tool-call guard. Task_01’s legitimate engagement (genuine tool calls, $1.29/run) produces 0/3. Tasks 03 and 04 show infrastructure errors — zero latency, no game state reached. The full re-run analysis is in A model that never played the game.

Mistral Large 3 remains the only model to pass task_03 at 3/3. Seven models in, 21 attempts. One clean result.


What Mistral Large 3 did differently

[Observed]

Mistral is the only model to pass task_03 consistently — 3/3 — and the only model to pass task_05 that also passes task_03. The tool-call data shows why it works on split/double: Mistral averaged 28 calls on task_03 and 45 on task_01. It plays the game — querying game state, making decisions, working through each hand — rather than approximating the output.

Its failure modes are wrong_answer on task_01 and task_04: full engagement, not enough strategy accuracy. That profile differs from models that abandon the task early or loop on the same action.

At $0.77 total and 10/15 passes, Mistral’s cost-per-pass is $0.077 (verified: cost_breakdown.csv for 2026-05-26-mistral-large-3-casino-strategy-v1). That was the best cost-per-pass figure in the original dataset. In the seven-model dataset, Nova Pro at $0.009/pass now leads on that metric.

[Speculation]

Mistral Large 3’s 675B parameter scale may produce broader coverage of specialist blackjack strategy content in training data. Alternatively, the instruction-following behaviour from its agentic-core-v1 result (27/30) — high rule-adherence on structured tasks — may generalise to the casino harness. Both could be true. The dataset does not separate them.


Haiku beats Sonnet at 1% of the cost

[Observed]

Claude Haiku 4.5 and Claude Sonnet 4.6 scored nearly identically on agentic-core-v1: 27 and 28 out of 30. On casino-strategy-v1, Haiku finishes at 40%, Sonnet at 20%. Same provider, 20-point gap.

The tool-call data makes the difference legible. Sonnet shows two patterns Haiku does not (verified: tool_call_redundancy.md and diagnosis_then_regression.md for 2026-05-25-claude-sonnet-4-6-casino-strategy-v1):

Haiku fails more cleanly. Wrong answer, single call, no state confusion. It does not spiral on longer tasks.

The casino harness forces a model to maintain game state across 14 to 50 consecutive tool interactions with no pause to reset. Sonnet degrades under that load. Haiku does not.

[Speculation]

Whether this reflects a training difference, a context-window handling difference, or harness-specific prompt sensitivity is unknown. Sonnet’s pattern has not appeared in the agentic-core-v1 runs, which have shorter sustained interaction sequences.


Where we were wrong

[Observed]

Rigg’s pre-run predictions for the original Mistral campaign predicted 4-7/15 (27-47%), with task_05 at 0/3 and task_03 at 1-2/3. Actual: 10/15, task_05 at 3/3, task_03 at 3/3. Both specific predictions were wrong in the same direction — the model performed significantly better than expected on the tasks requiring sustained engagement.

The 0/12 on task_04 across all original models was not anticipated in any brief. The dual-constraint design problem was visible only in hindsight.

Pre-run predictions were not filed for the June expansion campaigns. That is a workflow gap. Going forward, each campaign brief should include a predictions file before runs start, so post-run scoring can produce a predictions-reality diff. The leaderboard article is where that gap is most visible.


Cost summary

[Observed]

ModelTotal costCost per pass
Claude Opus 4.8$36.87$2.84
Mistral Large 3$0.77$0.077
Amazon Nova Pro$0.08$0.009
Llama 3.3 70B$1.06$0.133
Claude Haiku 4.5$0.08$0.013
GPT-4o Turbo$3.91$1.30
Claude Sonnet 4.6$5.67$1.89

(verified: cost_breakdown.csv for each campaign)

Nova Pro and Haiku 4.5 both spent $0.08 total. Nova Pro has three more passes (9 vs 6), so its cost-per-pass is lower. Opus 4.8’s $2.84 per pass is the highest in the dataset by 10x — but it is also the only model to crack task_01 and to score above 80%.

Whether Opus 4.8’s premium is worth it depends entirely on the use case. If task_01 consistency matters, there is no cheaper path to it. If task_01 is not a priority, Nova Pro or Llama deliver meaningful results at under $0.01 per pass.


What we still don’t know

[Unobserved]

We have not retested Nova Pro with a termination-fix prompt on task_01. Early termination (hand 9 of 30) is a consistent failure mode. Whether a targeted instruction change resolves it is unknown.

[Unobserved]

Mistral Small 4 was attempted during the June expansion. The runner hit 100% rate-limit failures on the first task — same blocker as the TASK-642 Mistral Medium 3.5 run. No successful turns. Campaign killed after ten minutes. Mistral Small 4 scored 29/30 on agentic-core-v1, which would make its casino comparison significant data. It is not in this leaderboard.

[Unobserved]

We have not tested any model above Opus 4.8 on the capability scale. Whether a stronger model would pass task_03 consistently is unknown. Opus 4.8 went 1/3; Mistral Large 3 went 3/3. The correlation between overall capability and task_03 performance is not linear.

[Speculation]

The Bedrock-only roster now spans 20% to 86.7% (Sonnet 4.6 to Opus 4.8). A Bedrock-only production deployment of this harness is viable, with four models covering the full cost-to-capability range. Whether that coverage is stable across harness versions — as prompts tighten or tasks are revised — is unknown.


Frequently Asked Questions

Which AI model scores highest on casino-strategy-v1?

Claude Opus 4.8 leads the leaderboard at 13/15 (86.7%) — the highest score recorded on this harness. It is the only model in the dataset to pass task_01 (basic strategy drill) consistently at 3/3. Mistral Large 3 is second at 10/15 (66.7%). The gap between first and second is 20 percentage points.

What is casino-strategy-v1?

casino-strategy-v1 is a blackjack strategy benchmark for AI models. Each model plays interactive blackjack against a command-line game engine across five tasks, 15 runs total. Tasks differ by hand count, scoring threshold, and constraint (bankroll survival, count-aware betting, split/double precision). A passing run clears all thresholds for its task. Scores are out of 15.

How does Amazon Nova Pro score so high at $0.08?

Amazon Nova Pro runs a full 15-task campaign for $0.08 by using a very lean token footprint per run. It passes three tasks (task_02, task_04, task_05) at 3/3 each, scoring 9/15 (60.0%). Its failure mode on task_01 and task_03 is early session termination — the model stops mid-game before all hands are played — rather than wrong strategy calls.

Why is task_03 (split/double edge cases) so hard?

task_03 tests the non-obvious plays in blackjack strategy: splitting 8s against a dealer 10, not splitting 9s against a 7, doubling A-7 against a 2. These are not derivable from general principles — a model needs to have absorbed the strategy table directly. Across 21 attempts by seven models, only Mistral Large 3 (3/3) consistently passes it. Opus 4.8 went 1/3. Five models went 0/3.

Why did Claude Sonnet 4.6 score lower than Haiku 4.5?

Sonnet 4.6 scored 3/15 (20%) vs Haiku 4.5’s 6/15 (40%) despite being the higher-capability model. The inversion traces to Sonnet’s behaviour under multi-turn state pressure: redundant tool calls (same action repeated consecutively after losing track of game state) and diagnosis-then-regression (correct reasoning followed by a wrong tool call). Haiku fails more cleanly and does not show either pattern at the same rate.

What does casino-strategy-v1 measure that agentic-core-v1 does not?

casino-strategy-v1 tests sustained execution of a specific procedural strategy under repeated tool-call pressure. agentic-core-v1 tests planning, investigation, and ambiguity handling. A model can score 28/30 on agentic-core-v1 and 3/15 on casino-strategy-v1. They measure different capabilities, and performance on one does not predict performance on the other.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.