Amazon Nova Pro scores 9/15 on casino-strategy-v1 for eight cents

June 20, 2026 · campaign-reports

Campaign: 2026-06-19-nova-pro-casino-strategy-v1
Model: Amazon Nova Pro (amazon.nova-pro-v1:0, Bedrock us-east-1)
Harness: casino-strategy-v1 v1.1 (tool-call guard active)
Runs: 15 (5 tasks × 3)
Campaign date: 2026-06-19

Amazon Nova Pro was the second model we ran in the Bedrock expansion batch. The first was Opus 4.8, which dominated everything. Nova Pro was meant to answer a different question: what does a cheap Bedrock model do with the same harness?

Total campaign cost: $0.08. That is not a rounding error or a formatting choice. The full number is $0.0826 across all 15 runs.

It scored 9/15 (60.0%), third on the leaderboard — ahead of GPT-4o Turbo, Claude Haiku 4.5, and Claude Sonnet 4.6. The two tasks it failed have something in common: Nova Pro stopped playing mid-game on both of them. Not because it got confused. It just stopped, at the same hand in every single run. That pattern is the main thing this report is about.

What casino-strategy-v1 tests

[Observed]

Five tasks, three runs each, 15 runs total. The model plays blackjack against a command-line game engine (game.py), interacting through tool calls: deal a hand, hit, stand, split, double. The engine returns a JSON game state; the model reads it and decides what to do. Each task has a specific pass threshold: the model needs to hit a required score, not just play reasonable blackjack:

task_01 (basic strategy drill) — 30 hands, flat bet, optimal play rate (OPR) ≥ 0.80. OPR measures what fraction of decisions matched correct basic strategy. 24 of 30 hands must be right.
task_02 (count-aware betting) — 20 hands, running count visible in game state, OPR ≥ 0.75 and count-aware bet ratio (CABR) ≥ 0.60. The model scales bets with the count.
task_03 (split/double edge cases) — 15 hands, OPR ≥ 0.73. A curated set of counterintuitive decisions: pairs of 6s against high cards, soft doubles, 10 and 11 against face cards.
task_04 (bankroll survival) — 20 hands, $200 starting stack, OPR ≥ 0.75 and ending bankroll above $0.
task_05 (full session) — 50 hands, OPR ≥ 0.78, plus a valid session report.

Pass or fail. No partial credit.

The results

[Observed]

Task	Runs	Passed	Rate
task_01 basic strategy drill	3	0	0%
task_02 count-aware betting	3	3	100%
task_03 split/double edge cases	3	0	0%
task_04 bankroll survival	3	3	100%
task_05 full session	3	3	100%
Total	15	9	60.0%

(verified: pass_rate_by_task.sql)

Three tasks at 100%, two at 0%. No partial passes anywhere. The failure mode histogram shows all 6 failures are classified as wrong_answer. The harness rejected the output, not because the model crashed or timed out, but because the game session didn’t reach the required number of hands (verified: failure_mode_histogram.sql).

What Nova Pro did well

[Observed]

On task_02, task_04, and task_05, Nova Pro ran clean campaigns. All nine passed. No failures, no close calls.

The tool-call footprint on these tasks is tiny. Count-aware betting (task_02) averaged 1.0 calls per run. Bankroll survival (task_04): 2.0 calls. Full session across 50 hands (task_05): 4.7 calls (verified: tool_calls_by_task.sql). For context: Opus 4.8 used 11–21 calls on those same tasks. Llama 3.3 70B used between 3 and 56.

Nova Pro appears to script longer game sequences in a single Python call — run a loop, play multiple hands, return the final state — rather than calling the engine once per hand. On the tasks it passes, that works well. A 50-hand session in under 5 tool calls, completed in under 8 seconds per run, for a fraction of a cent each.

Where it stopped

[Observed]

task_01 requires 30 hands. Nova Pro’s runs ended at hand 9, every time. All three runs: hand 9 (verified: game_state.json, task_01_basic_strategy_drill_run1/2/3).

task_03 requires 15 hands. Nova Pro’s runs ended at hand 7. All three runs: hand 7 (verified: game_state.json, task_03_split_double_edge_cases_run1/2/3).

The consistency is the striking part. Three independent runs, same stopping point each time. This is not random. Something about the way Nova Pro approaches these two tasks produces a session that terminates at a fixed hand, across every attempt.

Looking at the run transcripts tells part of the story. On task_01 run1 (e507aba6), Nova Pro’s first tool call played 8 hands at once — the result came back with "hand_number": 8. Then it played one more hand in the second call, reaching hand 9. On the third call, the session ended. The model was mid-thinking, about to play hand 9, when the harness stopped it with 3 tool calls used (verified: tool_calls_by_task.sql — task_01 avg 3.0 calls per run).

On task_03 run1 (c8c6b0d2), the first call played 6 hands. The second call played one more. At hand 7, with 2 calls used, the transcript ends.

The pattern: Nova Pro batches a block of hands in its first call, then switches to single actions. On task_02, task_04, task_05, this strategy happens to complete the task. On task_01 and task_03, the combined hand count from the batch plus a few individual calls doesn’t reach the task requirement before the tool-call budget runs out.

Why these two tasks specifically?

[Speculation]

Task_01 is 30 hands — the longest single-session requirement in the harness. Task_03 is 15 hands but involves a curated set of difficult decisions that may cause the model to play more cautiously or interactively. One reasonable hypothesis: Nova Pro’s initial batch size is tuned to handle shorter game sequences efficiently; for 30 hands, the batch plus a few calls doesn’t add up to completion.

We can’t confirm this without more runs at different batch sizes or with different system prompt instructions. What we can say is that the sessions were clearly incomplete — the transcripts show the model mid-play, not wrapping up. This isn’t a model that thinks it’s done. The sessions just ran out of budget before the task did.

None of the other evidence bundles turned up anything of note. No long-tail runs (0 of 15), no tool call redundancy (0 of 15), no diagnosis-then-regression patterns (0 of 15), no cross-task consistency (0 of 15) (verified: long_tail_turn_count.md, tool_call_redundancy.md, diagnosis_then_regression.md, cross_task_consistency.md). Nova Pro’s play is compact and doesn’t backtrack. It just doesn’t always finish.

The leaderboard

[Observed]

Model	Score	Cost
Claude Opus 4.8	13/15 (86.7%)	$36.87
Mistral Large 3	10/15 (66.7%)	—
Amazon Nova Pro	9/15 (60.0%)	$0.08
Llama 3.3 70B	8/15 (53.3%)	$1.06
Claude Haiku 4.5	6/15 (40.0%)	—
GPT-4o Turbo	3/15 (20.0%)	—
Claude Sonnet 4.6	3/15 (20.0%)	—

Third place for eight cents. Nova Pro outperforms three models (including GPT-4o Turbo and Haiku 4.5) on a test where none of the cost should matter, because the question is which model plays better blackjack.

The cost gap between Nova Pro and the models above it is large. Mistral Large 3 sits 7 points higher at a fraction of Opus 4.8’s $36.87, though we don’t have campaign cost data for Mistral. Against Llama 3.3 70B (the other cheap Bedrock model in this batch), Nova Pro scores one point higher (9 vs 8) at 13× lower cost ($0.08 vs $1.06). Different failure patterns: Llama fails task_01 because its OPR falls short of the 0.80 threshold. Nova Pro fails it because the session ends before the 30th hand.

What we don’t know yet

[Unobserved]

We looked for long-tail behaviour, tool-call loops, and diagnostic patterns across all 15 runs and found none. The null results are real: this model doesn’t spin or backtrack. That part is clean.

What we don’t know: whether explicit continuation instructions (“keep playing until the 30th hand is dealt”) would fix the early termination. The termination happens at a consistent point, which suggests it’s a structural issue, not randomness. A 3-run retest of task_01 with an additional system prompt instruction would cost under $0.01 and answer the question.

We also don’t know where the 8-hand batch size comes from. Nova Pro’s first call on task_01 reliably plays 8 hands before returning control. That’s not a harness default. The engine processes one hand at a time. Nova Pro must be scripting a loop inside its first Python call. Why 8 specifically, whether that’s stable across different system prompts, and whether adjusting it would change the task_01 and task_03 outcomes: open questions.

No predictions were committed before this campaign ran, so there’s no prediction-scoring to report.

Amazon Nova Pro scores 9/15 on casino-strategy-v1 for eight cents

What casino-strategy-v1 tests

The results

What Nova Pro did well

Where it stopped

Why these two tasks specifically?

The leaderboard

What we don’t know yet

ClawWorks Weekly