Llama 3.3 70B solved the task nobody could crack. Then failed the easy one.
Campaign: 2026-06-19-llama3.3-70b-casino-strategy-v1
Model: Meta Llama 3.3 70B (llama3.3-70b, Bedrock, provider=meta)
Harness: casino-strategy-v1 v1.1 (tool-call guard active)
Runs: 15 (5 tasks × 3)
Campaign date: 2026-06-19
Llama 3.3 70B was the third model in the Bedrock expansion batch, after Claude Opus 4.8 and Amazon Nova Pro. The question going in: does an open-weight model in the 70B range compete with the proprietary options on this harness?
8/15 (53.3%), $1.06 for the full campaign. Second on the pre-expansion leaderboard, behind Mistral Large 3 (10/15, 67%) and ahead of Claude Haiku 4.5 (6/15, 40%).
But the score isn’t the story. Llama 3.3 70B is the first model on this harness to pass task_04 — bankroll survival — 3/3 runs. Twelve models across four previous campaigns all failed it. And yet Llama still can’t get through task_01, the same task that floors everyone else. It solved the hard problem and got stuck on the familiar one.
What casino-strategy-v1 tests
[Observed]
Five tasks, three runs each. The model plays blackjack against a command-line game engine (game.py) using tool calls: deal, hit, stand, split, double. The engine returns a JSON game state per turn; the model reads it and acts. Each task has a specific pass threshold:
| Task | What it tests | Threshold | Runs |
|---|---|---|---|
| task_01 | Basic strategy over 30 hands | OPR ≥ 0.80 | 3 |
| task_02 | Count-aware betting over 20 hands | OPR ≥ 0.75 + CABR ≥ 0.60 | 3 |
| task_03 | Split/double edge cases over 15 hands | OPR ≥ 0.73 | 3 |
| task_04 | Bankroll survival from $200 short-stack, 20 hands | OPR ≥ 0.75 + bankroll > 0 | 3 |
| task_05 | Full session, 50 hands, with written report | OPR ≥ 0.78 + valid report | 3 |
OPR is the optimal play rate — the fraction of decisions that match basic strategy. Harness v1.1 has the tool-call guard active, which means models that plan tool use but never call tools fail with an infrastructure error rather than a wrong answer score. Gemma 3 27B hit that guard at 0/30 on agentic-core-v1. Llama 3.3 70B does not — every run in this campaign produced actual play, actual tool calls, actual scores.
The task_04 breakthrough
[Observed]
task_04 has been a wall. The original four models — Claude Sonnet 4.6, GPT-4o Turbo, Claude Haiku 4.5, and Mistral Large 3 — went 0/12 combined under the original dual-constraint threshold. After a recalibration to validate the threshold (task04-val reruns), the current bar is OPR ≥ 0.75 + bankroll > 0. Under that bar, Llama 3.3 70B passes 3/3. (verified: pass_rate_by_task.csv)
The tool-call profile is telling. task_04 averages 2.7 calls per run — min 2, max 3. (verified: tool_calls_by_task.csv) That is a model committing to decisions rather than re-querying state every hand. task_04 is 20 hands from a $200 short-stack where bet sizing matters and the bankroll can actually hit zero. Llama’s count-aware bet rate and conservative scaling under a negative count suggest it reads the constraint and adjusts.
The 12 prior failures at task_04 were not about the threshold being too tight. They were a model-calibration problem. Llama 3.3 70B passes the same threshold without special tuning.
task_01 is still the floor
[Observed]
Llama’s task_01 scores across three runs: OPR = 0.75, 0.60, 0.7143. Threshold is 0.80. Three misses, none of them close. (verified: pass_rate_by_task.csv)
Final-turn outputs from each run:
“The optimal play rate is 0.75, which means that 75% of the decisions made during the session were optimal according to basic strategy.”
(run 90311704-7534-4fe6-ac9c-c0810992daff, turn 58)
“The optimal play rate is 0.6, which means that 60% of the decisions made during the session were optimal according to basic strategy.”
(run 4c4ba9a9-44ec-4c9f-8bff-950ceaa78880, turn 56)
“The optimal play rate for the 30 hands played is 0.7143, meaning that 20 out of 28 decisions made were optimal according to basic strategy.”
(run 99cbd87b-07ad-4cb3-94f7-c314677f9d5f, turn 58)
Three runs, three different OPRs (0.75, 0.60, 0.71), none above the 0.80 threshold. Run 2 dropped to 0.60 — this is not a model that consistently hits 0.77 and just falls short. The OPR variance across runs suggests the hand seed distribution matters: some draws concentrate on the soft-total and edge-case decisions where basic strategy is counterintuitive.
task_01 averages 56.3 tool calls per run, the highest of any task. (verified: tool_calls_by_task.csv) Llama is playing every hand in full. The errors are in play quality, not execution.
One more thing the tool_call_redundancy bundle catches: Llama calls python3 game.py deal twice in sequence across all three task_01 runs, and duplicates python3 game.py action hit multiple times per run. (verified: tool_call_redundancy.md, runs 90311704/4c4ba9a9/99cbd87b) The engine handles repeated calls gracefully, so this does not fail the run. But it suggests the model is re-issuing commands it already sent rather than reading the result and moving on.
Only Mistral Large 3 has ever passed task_01, and only 1/3 runs. Most models land in the 0.70–0.77 OPR range. Llama is in that band on two of three runs, with one outlier at 0.60.
Scores by task
[Observed]
| Task | Runs | Passed | Pass rate | Avg tool calls | Avg cost/run |
|---|---|---|---|---|---|
| task_01 | 3 | 0 | 0% | 56.3 | $0.295 |
| task_02 | 3 | 3 | 100% | 2.0 | $0.003 |
| task_03 | 3 | 1 | 33% | 20.3 | $0.044 |
| task_04 | 3 | 3 | 100% | 2.7 | $0.005 |
| task_05 | 3 | 1 | 33% | 3.7 | $0.008 |
(verified: pass_rate_by_task.csv, tool_calls_by_task.csv, cost_breakdown.csv)
The bimodal tool-call pattern is the structural story here. task_01 and task_03 are in “full-play mode”: 30 hands and 15 hands respectively, with the model calling the game engine for each decision. task_02 and task_04 are in “decision mode”: 2.0 and 2.7 average calls, passing 100% of the time. Llama handles discrete-decision tasks cleanly. The failures all live in tasks that require sustained accuracy across many sequential decisions.
Failure profile
[Observed]
Every failure across the campaign is wrong_answer. No infrastructure errors, no tool-call guard fires, no scaffolding failures. (verified: failure_mode_histogram.csv — 8 passed, 7 wrong_answer)
The cross_task_consistency evidence bundle flags that wrong_answer appears across three distinct task_ids: task_01, task_03, and task_05. (verified: cross_task_consistency.md, run 90311704) The shared structure across those tasks: they all require many sequential decisions, either across many hands or with complex per-hand conditions. task_02 and task_04, which Llama passes 100%, are the ones with fewer, more discrete decisions.
That pattern — pass the constraint-heavy short tasks, fail the sustained-accuracy long ones — does not appear in other models’ campaigns in the same shape. Sonnet 4.6, for example, showed diagnosis_then_regression: it would state a correct action then call the wrong command. Llama does not show that pattern. (verified: diagnosis_then_regression.md — 0/15 runs) Its failures are cleaner. Wrong play, consistently, without the self-contradiction artifact.
Cost profile
[Observed]
$1.06 total. $0.13 per passing run. (verified: cost_breakdown.csv)
task_01 accounts for $0.88 of that — 83% of the campaign budget on 3 runs that all failed. (verified: cost_breakdown.csv — task_01 total $0.8844 of $1.0632 campaign total) This is the same task_01 cost signature seen across every model on this harness: 30-hand full-play sessions are expensive because they require 50+ tool calls with token-heavy JSON game state per call.
At $0.13 per pass, Llama 3.3 70B has the best cost-per-pass in the leaderboard at this point in the campaign series. Nova Pro ($0.08 total, 9/15 passes) and Opus 4.8 ($36.87, 13/15) bring cost-per-pass of roughly $0.009 and $2.84 respectively, so Llama is not the cheapest path to a pass — Nova Pro wins that comparison — but it solves task_04 which Nova Pro does not.
The agentic-core inversion
[Observed, with Speculation]
Llama 3.3 70B scored 20/30 (66.7%) on agentic-core-v1, which put it well below the top tier on that harness. On casino-strategy-v1 it scores 53.3% — above Claude Haiku 4.5 (40%), which scored 27/30 on agentic-core.
The agentic-core rank does not predict casino-strategy-v1 rank. The two harnesses measure different things. agentic-core-v1 tests file operations, shell manipulation, tool composition, and multi-step coordination. casino-strategy-v1 tests whether a model can sustain decision accuracy under constraint: financial pressure, discrete game rules, long interactive sessions.
[Speculation] Llama 3.3 70B may have a training distribution that favors rule-following under constraint over the multi-tool coordination agentic-core rewards. Its task_04 pass (first ever, 3/3, under short-stack bankroll pressure) and task_02 pass (count-aware betting, 100%) are both tasks where the model reads a constraint and adjusts a single parameter. Its failures are in tasks that require many sequential decisions with no external pressure signal.
That is a theory. It would need cross-harness data to test.
What we were wrong about
[Observed]
The predictions for this campaign were not filed in the campaign repo before the run. This is a gap in the workflow — no pre-run predictions file means no scored miss/hit table for this article.
What we expected based on agentic-core-v1 results: Llama 3.3 70B’s mid-tier agentic-core score suggested it would be a mid-leaderboard casino result, possibly around Haiku 4.5 territory. The task_04 pass was not anticipated — the task had failed 12 consecutive times across four prior models. A first-time 3/3 solve on it, especially from a model that underperformed on agentic-core, was not the predicted outcome.
Pre-run predictions will be filed going forward. The adversarial-predictions workflow in MODELBATTLES_VOICE.md is the right structure — we publish what we expected before seeing results, and score the misses. Omitting them here means this article cannot do that properly.
What we don’t know yet
[Unobserved / Speculation]
The task_01 miss (OPR 0.60–0.75 across three runs) is wide enough that a single follow-up campaign targeting task_01 alone would tell us whether this is a consistent ceiling or variance from the 3-run sample size. Run 2 dropping to 0.60 is a bigger miss than the other two, and 3 runs is a narrow window for a 30-hand stochastic task.
The tool_call_redundancy pattern (repeated deal and hit commands in task_01) is present across all three task_01 runs but absent from task_02/04 runs. Whether the redundancy is causing the accuracy drop or is just a correlate of the high-call-count task structure — we don’t know. The engine handles duplicates gracefully, so it isn’t failing the run, but it might be consuming context that a model with cleaner tool-call discipline would use differently.
[Unobserved] We scanned for diagnosis_then_regression in all 15 runs and found none. That is an explicit null result, not an absence of analysis. Llama’s failure mode is consistent wrong decisions without the self-contradiction pattern seen in Sonnet 4.6.
[Speculation] The task_04 solve suggests Llama might do better on shorter, higher-stakes tasks (bankroll pressure, few hands) than on longer endurance tasks (30-50 hands). If the harness ever adds a task in the 10-hand, high-pressure format, Llama might move up the leaderboard meaningfully.
Leaderboard position
[Observed]
Pre-expansion leaderboard (five models). The 7-model table including Opus 4.8 and Nova Pro publishes separately.
| Rank | Model | Score | Pass rate | Avg cost/run |
|---|---|---|---|---|
| 1 | Mistral Large 3 | 10/15 | 67% | $0.051 |
| 2 | Llama 3.3 70B | 8/15 | 53.3% | $0.071 |
| 3 | Claude Haiku 4.5 | 6/15 | 40% | $0.005 |
| 4 | GPT-4o Turbo | 3/15 | 20% | $0.261 |
| 5 | Claude Sonnet 4.6 | 3/15 | 20% | $0.378 |
Llama 3.3 70B is the only model in this table that passes task_04. That will still be true when the 7-model table publishes.