Llama 3.3 70B solved the task nobody could crack. Then failed the easy one.

Campaign: 2026-06-19-llama3.3-70b-casino-strategy-v1
Model: Meta Llama 3.3 70B (llama3.3-70b, Bedrock, provider=meta)
Harness: casino-strategy-v1 v1.1 (tool-call guard active)
Runs: 15 (5 tasks × 3)
Campaign date: 2026-06-19


Llama 3.3 70B was the third model in the Bedrock expansion batch, after Claude Opus 4.8 and Amazon Nova Pro. The question going in: does an open-weight model in the 70B range compete with the proprietary options on this harness?

8/15 (53.3%), $1.06 for the full campaign. Second on the pre-expansion leaderboard, behind Mistral Large 3 (10/15, 67%) and ahead of Claude Haiku 4.5 (6/15, 40%).

But the score isn’t the story. Llama 3.3 70B is the first model on this harness to pass task_04 — bankroll survival — 3/3 runs. Twelve models across four previous campaigns all failed it. And yet Llama still can’t get through task_01, the same task that floors everyone else. It solved the hard problem and got stuck on the familiar one.


What casino-strategy-v1 tests

[Observed]

Five tasks, three runs each. The model plays blackjack against a command-line game engine (game.py) using tool calls: deal, hit, stand, split, double. The engine returns a JSON game state per turn; the model reads it and acts. Each task has a specific pass threshold:

TaskWhat it testsThresholdRuns
task_01Basic strategy over 30 handsOPR ≥ 0.803
task_02Count-aware betting over 20 handsOPR ≥ 0.75 + CABR ≥ 0.603
task_03Split/double edge cases over 15 handsOPR ≥ 0.733
task_04Bankroll survival from $200 short-stack, 20 handsOPR ≥ 0.75 + bankroll > 03
task_05Full session, 50 hands, with written reportOPR ≥ 0.78 + valid report3

OPR is the optimal play rate — the fraction of decisions that match basic strategy. Harness v1.1 has the tool-call guard active, which means models that plan tool use but never call tools fail with an infrastructure error rather than a wrong answer score. Gemma 3 27B hit that guard at 0/30 on agentic-core-v1. Llama 3.3 70B does not — every run in this campaign produced actual play, actual tool calls, actual scores.


The task_04 breakthrough

[Observed]

task_04 has been a wall. The original four models — Claude Sonnet 4.6, GPT-4o Turbo, Claude Haiku 4.5, and Mistral Large 3 — went 0/12 combined under the original dual-constraint threshold. After a recalibration to validate the threshold (task04-val reruns), the current bar is OPR ≥ 0.75 + bankroll > 0. Under that bar, Llama 3.3 70B passes 3/3. (verified: pass_rate_by_task.csv)

The tool-call profile is telling. task_04 averages 2.7 calls per run — min 2, max 3. (verified: tool_calls_by_task.csv) That is a model committing to decisions rather than re-querying state every hand. task_04 is 20 hands from a $200 short-stack where bet sizing matters and the bankroll can actually hit zero. Llama’s count-aware bet rate and conservative scaling under a negative count suggest it reads the constraint and adjusts.

The 12 prior failures at task_04 were not about the threshold being too tight. They were a model-calibration problem. Llama 3.3 70B passes the same threshold without special tuning.


task_01 is still the floor

[Observed]

Llama’s task_01 scores across three runs: OPR = 0.75, 0.60, 0.7143. Threshold is 0.80. Three misses, none of them close. (verified: pass_rate_by_task.csv)

Final-turn outputs from each run:

“The optimal play rate is 0.75, which means that 75% of the decisions made during the session were optimal according to basic strategy.”
(run 90311704-7534-4fe6-ac9c-c0810992daff, turn 58)

“The optimal play rate is 0.6, which means that 60% of the decisions made during the session were optimal according to basic strategy.”
(run 4c4ba9a9-44ec-4c9f-8bff-950ceaa78880, turn 56)

“The optimal play rate for the 30 hands played is 0.7143, meaning that 20 out of 28 decisions made were optimal according to basic strategy.”
(run 99cbd87b-07ad-4cb3-94f7-c314677f9d5f, turn 58)

Three runs, three different OPRs (0.75, 0.60, 0.71), none above the 0.80 threshold. Run 2 dropped to 0.60 — this is not a model that consistently hits 0.77 and just falls short. The OPR variance across runs suggests the hand seed distribution matters: some draws concentrate on the soft-total and edge-case decisions where basic strategy is counterintuitive.

task_01 averages 56.3 tool calls per run, the highest of any task. (verified: tool_calls_by_task.csv) Llama is playing every hand in full. The errors are in play quality, not execution.

One more thing the tool_call_redundancy bundle catches: Llama calls python3 game.py deal twice in sequence across all three task_01 runs, and duplicates python3 game.py action hit multiple times per run. (verified: tool_call_redundancy.md, runs 90311704/4c4ba9a9/99cbd87b) The engine handles repeated calls gracefully, so this does not fail the run. But it suggests the model is re-issuing commands it already sent rather than reading the result and moving on.

Only Mistral Large 3 has ever passed task_01, and only 1/3 runs. Most models land in the 0.70–0.77 OPR range. Llama is in that band on two of three runs, with one outlier at 0.60.


Scores by task

[Observed]

TaskRunsPassedPass rateAvg tool callsAvg cost/run
task_01300%56.3$0.295
task_0233100%2.0$0.003
task_033133%20.3$0.044
task_0433100%2.7$0.005
task_053133%3.7$0.008

(verified: pass_rate_by_task.csv, tool_calls_by_task.csv, cost_breakdown.csv)

The bimodal tool-call pattern is the structural story here. task_01 and task_03 are in “full-play mode”: 30 hands and 15 hands respectively, with the model calling the game engine for each decision. task_02 and task_04 are in “decision mode”: 2.0 and 2.7 average calls, passing 100% of the time. Llama handles discrete-decision tasks cleanly. The failures all live in tasks that require sustained accuracy across many sequential decisions.


Failure profile

[Observed]

Every failure across the campaign is wrong_answer. No infrastructure errors, no tool-call guard fires, no scaffolding failures. (verified: failure_mode_histogram.csv — 8 passed, 7 wrong_answer)

The cross_task_consistency evidence bundle flags that wrong_answer appears across three distinct task_ids: task_01, task_03, and task_05. (verified: cross_task_consistency.md, run 90311704) The shared structure across those tasks: they all require many sequential decisions, either across many hands or with complex per-hand conditions. task_02 and task_04, which Llama passes 100%, are the ones with fewer, more discrete decisions.

That pattern — pass the constraint-heavy short tasks, fail the sustained-accuracy long ones — does not appear in other models’ campaigns in the same shape. Sonnet 4.6, for example, showed diagnosis_then_regression: it would state a correct action then call the wrong command. Llama does not show that pattern. (verified: diagnosis_then_regression.md — 0/15 runs) Its failures are cleaner. Wrong play, consistently, without the self-contradiction artifact.


Cost profile

[Observed]

$1.06 total. $0.13 per passing run. (verified: cost_breakdown.csv)

task_01 accounts for $0.88 of that — 83% of the campaign budget on 3 runs that all failed. (verified: cost_breakdown.csv — task_01 total $0.8844 of $1.0632 campaign total) This is the same task_01 cost signature seen across every model on this harness: 30-hand full-play sessions are expensive because they require 50+ tool calls with token-heavy JSON game state per call.

At $0.13 per pass, Llama 3.3 70B has the best cost-per-pass in the leaderboard at this point in the campaign series. Nova Pro ($0.08 total, 9/15 passes) and Opus 4.8 ($36.87, 13/15) bring cost-per-pass of roughly $0.009 and $2.84 respectively, so Llama is not the cheapest path to a pass — Nova Pro wins that comparison — but it solves task_04 which Nova Pro does not.


The agentic-core inversion

[Observed, with Speculation]

Llama 3.3 70B scored 20/30 (66.7%) on agentic-core-v1, which put it well below the top tier on that harness. On casino-strategy-v1 it scores 53.3% — above Claude Haiku 4.5 (40%), which scored 27/30 on agentic-core.

The agentic-core rank does not predict casino-strategy-v1 rank. The two harnesses measure different things. agentic-core-v1 tests file operations, shell manipulation, tool composition, and multi-step coordination. casino-strategy-v1 tests whether a model can sustain decision accuracy under constraint: financial pressure, discrete game rules, long interactive sessions.

[Speculation] Llama 3.3 70B may have a training distribution that favors rule-following under constraint over the multi-tool coordination agentic-core rewards. Its task_04 pass (first ever, 3/3, under short-stack bankroll pressure) and task_02 pass (count-aware betting, 100%) are both tasks where the model reads a constraint and adjusts a single parameter. Its failures are in tasks that require many sequential decisions with no external pressure signal.

That is a theory. It would need cross-harness data to test.


What we were wrong about

[Observed]

The predictions for this campaign were not filed in the campaign repo before the run. This is a gap in the workflow — no pre-run predictions file means no scored miss/hit table for this article.

What we expected based on agentic-core-v1 results: Llama 3.3 70B’s mid-tier agentic-core score suggested it would be a mid-leaderboard casino result, possibly around Haiku 4.5 territory. The task_04 pass was not anticipated — the task had failed 12 consecutive times across four prior models. A first-time 3/3 solve on it, especially from a model that underperformed on agentic-core, was not the predicted outcome.

Pre-run predictions will be filed going forward. The adversarial-predictions workflow in MODELBATTLES_VOICE.md is the right structure — we publish what we expected before seeing results, and score the misses. Omitting them here means this article cannot do that properly.


What we don’t know yet

[Unobserved / Speculation]

The task_01 miss (OPR 0.60–0.75 across three runs) is wide enough that a single follow-up campaign targeting task_01 alone would tell us whether this is a consistent ceiling or variance from the 3-run sample size. Run 2 dropping to 0.60 is a bigger miss than the other two, and 3 runs is a narrow window for a 30-hand stochastic task.

The tool_call_redundancy pattern (repeated deal and hit commands in task_01) is present across all three task_01 runs but absent from task_02/04 runs. Whether the redundancy is causing the accuracy drop or is just a correlate of the high-call-count task structure — we don’t know. The engine handles duplicates gracefully, so it isn’t failing the run, but it might be consuming context that a model with cleaner tool-call discipline would use differently.

[Unobserved] We scanned for diagnosis_then_regression in all 15 runs and found none. That is an explicit null result, not an absence of analysis. Llama’s failure mode is consistent wrong decisions without the self-contradiction pattern seen in Sonnet 4.6.

[Speculation] The task_04 solve suggests Llama might do better on shorter, higher-stakes tasks (bankroll pressure, few hands) than on longer endurance tasks (30-50 hands). If the harness ever adds a task in the 10-hand, high-pressure format, Llama might move up the leaderboard meaningfully.


Leaderboard position

[Observed]

Pre-expansion leaderboard (five models). The 7-model table including Opus 4.8 and Nova Pro publishes separately.

RankModelScorePass rateAvg cost/run
1Mistral Large 310/1567%$0.051
2Llama 3.3 70B8/1553.3%$0.071
3Claude Haiku 4.56/1540%$0.005
4GPT-4o Turbo3/1520%$0.261
5Claude Sonnet 4.63/1520%$0.378

Llama 3.3 70B is the only model in this table that passes task_04. That will still be true when the 7-model table publishes.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.