Articles

Amazon Nova Pro scores 9/15 on casino-strategy-v1 for eight cents — June 20, 2026
Amazon Nova Pro (Bedrock) scores 9/15 (60%) on casino-strategy-v1, placing third on the leaderboard for $0.08. It passes three tasks cleanly and stops mid-game on the other two — at exactly the same hand every time.
Opus 4.8 wins at blackjack. The gap is 20 points. — June 20, 2026
Seven models on casino-strategy-v1: Claude Opus 4.8 at 86.7%, Mistral Large 3 at 66.7%, Amazon Nova Pro at 60.0%, Llama 3.3 70B at 53.3%. The Bedrock expansion fills in the leaderboard and answers the ceiling question.
Opus 4.8 dominates casino-strategy-v1 — except for the one task it can't crack — June 20, 2026
Claude Opus 4.8 scores 13/15 (86.7%) on casino-strategy-v1, the highest mark in the dataset. It's 20 points ahead of the prior leader and $36.79 more expensive than the model in second. The gap tells you something real about what task complexity costs at the frontier.
Llama 3.3 70B solved the task nobody could crack. Then failed the easy one. — June 20, 2026
Meta Llama 3.3 70B scores 8/15 (53.3%) on casino-strategy-v1 and becomes the first model to pass task_04 (bankroll survival). It also fails task_01 — the same floor that trips every other model on this harness.
Gemma 3 27B knows what to do. It just won't do it. — June 19, 2026
Google's best-available Bedrock model scored 0/30 on agentic-core-v1 — not because it can't use tools, but because it won't invoke them on its own. Approximately 400× cheaper than Sonnet 4.6. Zero passes.
OpenAI's flagship didn't move. The leaderboard did. — June 18, 2026
GPT-5.5 flagship scored 27/30 in June 2026, matching its May result exactly. Same task, same failure mode, one fewer dollar. The difference is what happened around it: DeepSeek V4-Pro now sits at 30/30 for $0.12.
The second perfect score. It costs $0.12. — June 16, 2026
DeepSeek V4-Pro in thinking mode on agentic-core-v1: 30/30 (100%), $0.12 total. The first perfect score on this harness, Claude Opus 4.8, cost $7.34. Thinking mode closed the 7-point gap from V4-Flash and added nothing to the failure-mode count.
Gemini 3.1 Pro on agentic-core-v1: 23/30, and the 40-point gap from Flash — June 15, 2026
Gemini 3.1 Pro scored 23/30 (76.7%) on agentic-core-v1 — a 40-point jump from Flash. Multi-file nav and SQL landed cleanly. task_09 is still 0/3 for the whole family.
We ran a 7-day production eval. Opus 4.7 did not earn the upgrade. — June 14, 2026
We set up a shadow eval to find out whether Opus 4.7 writes better content than Sonnet 4.6. The routing infrastructure ran for seven days and collected zero comparable articles. With no positive evidence for a 10x cost premium, Sonnet 4.6 stays.
Scout knew the fix. It described the fix. It never wrote the fix. — June 14, 2026
Meta Llama 4 Scout on agentic-core-v1: 10/30 (33%). The model diagnosed bugs correctly, produced working solutions, and then stopped. fs_write was never called.
59 tool calls. Zero mid-plan reversals. Opus 4.8's first run on frontier-eval-v1. — June 13, 2026
Claude Opus 4.8 on frontier-eval-v1: 25/35 total, 25/25 on the five tasks the harness can score ($166.26 actual). The diagnosis-then-regression pattern is absent across all 35 runs. Two task failures trace to harness infrastructure bugs. Sets the Opus 4.8 baseline before Fable 5 comparison.
Gemma 4 12B on agentic-core-v1: 91.7% on the tasks it actually ran — June 11, 2026
Gemma 4 12B IT ran 30 agentic tasks on a local EC2 A10G and passed 22. Six failures were infrastructure errors, not the model. Adjusted for that, it outscored its 31B sibling.
Flagship pricing, third-place finish. Claude Fable 5 on agentic-core-v1. — June 10, 2026
Claude Fable 5 scored 25/30 (83.3%) on agentic-core-v1 at $1.97 total. Haiku 4.5 scored 27/30 at $0.085. Sonnet 4.6 scored 28/30 at $1.44. The new Mythos-class flagship trails both in the family that built it.
Claude's worst model became its best. One version bump, 30/30. — June 7, 2026
Claude Opus 4.8 on agentic-core-v1: 30/30 (100%), $7.34 total, $0.245 per passing task. First perfect score in the ModelClaw dataset. The gave_up_mid_plan failure mode that accounted for 7 of 9 failures in Opus 4.7 is entirely absent.
Two models at 27/30 are not the same model — June 3, 2026
Claude Haiku 4.5 and Devstral 2 123B both score 27/30 on agentic-core-v1. Their failure profiles are completely different. Why the per-task breakdown matters more than the headline number.
The model you expected to win didn't — June 3, 2026
Three case studies from the agentic-core-v1 dataset where the bigger, more expensive model lost to a smaller one from the same lab. Claude Opus 4.7 at $0.528 per pass. GPT-OSS 120B beaten by its own 20B sibling. Mistral Small 4 leading the leaderboard at $0.001 per pass.
35 models. One solved it. — June 3, 2026
task_09 is the hardest task in agentic-core-v1: compute a 10-day moving average on 3 rows of data. Across 35 model campaigns, the pass rate is 16%. Only one model has ever passed all three runs. Here's what divides the ones that pass from the ones that loop forever.
29/30 for three cents — May 31, 2026
Mistral Small 4 scored 29/30 on agentic-core-v1, the new harness leader, at $0.03 total and $0.001 per passing run. That is ~50x cheaper per pass than Claude Sonnet 4.6 at 28/30.
We predicted 26. It scored 11. — May 28, 2026
Gemini 3.5 Flash posted 11/30 on agentic-core-v1, the worst result by any model that successfully ran the harness. Our prediction was 26. Here's what happened, why it happened, and what 'gave_up_mid_plan' means for anyone building on top of Google's agentic flagship.
A model that never played the game — May 28, 2026
GPT-4o Turbo scored 6/15 on casino-strategy-v1 without making a single tool call. After hardening the checker to require real game interaction, its score dropped to 3/15.
Anthropic's most expensive model scored worst in its own family — May 25, 2026
Claude Opus 4.7 on agentic-core-v1: 21/30 (70%), $0.528 per passing task. Haiku 4.5 scored 27/30 for $0.00316/pass. Sonnet 4.6 scored 28/30. This is the complete Anthropic 4.x picture.
Sonnet 4.6 can count cards. It cannot play basic strategy. — May 25, 2026
Claude Sonnet 4.6 scores 3/15 (20%) on casino-strategy-v1, the baseline run that launched the harness. The 73-point gap from its 28/30 agentic-core-v1 result is real, and the failure modes are specific.
The architecture question, revisited — May 25, 2026
A second run of NVIDIA Nemotron Nano 3 30B scores 15/30 on agentic-core-v1 -- five points above the first run's 10/30, and now clearly ahead of the 120B Super sibling. The architecture finding is updated: dense 30B does beat MoE 120B. The family ceiling is still low.
12/30 was real — May 25, 2026
A second run of NVIDIA Nemotron Super 3 120B confirms 11/30 on agentic-core-v1 -- one point from the prior result, same failure pattern. The replication also surfaces a sharper finding: Nemotron's impossibility detection now outscores GPT-OSS 120B despite a 12-point overall deficit.
The coder trap — May 25, 2026
Qwen3 Coder Next scores 20/30 on agentic-core-v1 -- below every prior Qwen3 model in this dataset, at 5x the cost-per-pass of Qwen3 32B. Coder specialisation trades away the ambiguity-handling the harness requires.
Why basic strategy alone isn't enough: what card counting actually does and why AI could do it better — May 24, 2026
Basic strategy gets you to a ~0.5% house edge. That's still a steady drain. Card counting flips the edge to the player, but only when executed with near-perfect accuracy. That's the problem we're building a tool to solve.
One point apart — May 24, 2026
Claude Haiku 4.5 scored 27/30 on agentic-core-v1 -- one point below Anthropic's flagship Sonnet 4.6 at 16x lower cost per passing task. On the nine tasks both models can actually solve, the results are identical.
The base model that proved versioning works — May 24, 2026
MiniMax M2 scores 24/30 on agentic-core-v1 -- four points below M2.1 and three below M2.5. That result completes the family dataset. The version numbers actually mean something.
The older model is better (this time) — May 24, 2026
MiniMax M2.1 scores 28/30 on agentic-core-v1 -- one point above its flagship sibling M2.5. The version inversion traces to a single difference: M2.1 hedges before committing to an impossible answer. M2.5 does not. That habit wins task_09 and loses task_06.
The plan stayed in its head — May 23, 2026
Kimi K2 Thinking scored 12/30 on agentic-core-v1, down from K2.5's 24/30. The reasoning trace fired. The tool calls didn't.
The legacy format that couldn't shortcut became the thing that made it work. — May 23, 2026
Magistral Small 2509 scores 23/30 (76.67%) on agentic-core-v1 -- the first reasoning-format model to pass task_07. The [TOOL_CALLS] text format built as a workaround may be why it succeeded where Kimi K2 Thinking failed.
The model fixed the bug. The test runner didn't have pytest. — May 23, 2026
Ministral 3 14B scores 23/30 (76.67%) on agentic-core-v1 at $0.00103/pass. The headline: a correct code fix, a missing pytest install, and a score three points lower than it should have been.
The 3B matches the 14B. One task apart. — May 23, 2026
Ministral 3 3B scores 22/30 (73.33%) on agentic-core-v1 at $0.000787/pass. One task behind the 14B sibling, 24% cheaper per pass. The Ministral 3 family is now fully mapped, and the smallest model is not where the floor is.
The 8B model that tied the dataset leaders -- and made every prediction wrong. — May 23, 2026
Ministral 3 8B scores 28/30 (93.33%) on agentic-core-v1 at $0.00067/pass -- tied with Claude Sonnet 4.6 and GLM-4.7 at the top of the leaderboard, at a fraction of the cost. Every pre-run prediction was wrong.
Jamba 1.5 Large diagnosed the bugs. It did not fix any of them. — May 22, 2026
AI21 Jamba 1.5 Large scores 8/30 (26.67%) on agentic-core-v1 — last in the dataset at $0.0044/pass. The first SSM-Transformer hybrid tested. The headline finding: the model reads the broken code, writes a correct diagnosis in prose, and then stops without executing the fix.
We ran AWS's own flagship LLM on its own cloud. It came 16th. — May 22, 2026
Amazon Nova Pro scores 20/30 on agentic-core-v1, 16th of 21 models. No measurable advantage from running AWS's own model on AWS's own cloud. A new task_02 failure mode: the refactor was correct, but Nova Pro rewrote the test API instead of preserving it.
The flash tax on GLM-4.7 is 3 points and 85% — May 22, 2026
GLM-4.7-Flash scores 25/30 on agentic-core-v1 at $0.000565 per passing task. Three points below its parent, 85% cheaper per pass. Whether that trade works depends on the workload.
We gave Alibaba's biggest Qwen3 model a text-only benchmark. It scored 40%. — May 22, 2026
Qwen3 VL 235B A22B scores 12/30 on agentic-core-v1, the weakest result in the Qwen3 family despite 22B active parameters. VL pre-training extracted a 11-point penalty over the 32B dense variant. Two tasks produced identical wrong outputs across all three runs.
Version inversion — May 21, 2026
GLM-4.7 scores 28/30 on agentic-core-v1, one point above its own successor GLM-5. At $0.0038 per passing task (42% cheaper), the model the upgrade was supposed to replace is now joint-first in the dataset.
Not the architecture — May 21, 2026
Nemotron Nano 3 30B scores 10/30 on agentic-core-v1, lower than the Super 120B's 12/30. Dense parameters didn't fix the family's agentic problem. NVIDIA's training did this, not the MoE design.
The smaller model wins — May 21, 2026
GPT-OSS 20B scored 25/30 on agentic-core-v1, beating its 120B sibling by 2 points at 2.7x lower cost per pass. New cost floor: $0.000481/pass. Cheapest model in the dataset.
Dense enough — May 21, 2026
Qwen3 32B scores 23/30 on agentic-core-v1, marginally ahead of its MoE siblings at 21 and 22. Ten times the active compute per forward pass translated to 1–2 extra passes. The production-relevant finding is task_08: the model doesn't stall when it's wrong, it completes confidently.
Zero redundancy at the top tier — May 20, 2026
GLM-5 scored 27/30 on agentic-core-v1, joining MiniMax M2.5, Mistral Large 3, and Devstral 2 at the highest mark any model outside Claude reaches on this harness. From the oldest Chinese LLM lineage in the dataset. Zero tool redundancy across 30 runs.
27/30 from a Beijing lab — May 20, 2026
MiniMax M2.5 scored 27/30 on agentic-core-v1, matching Mistral Large 3 and Devstral 2 at the top tier. First Chinese-lab model to reach that mark. The internal reasoning block explains why the prediction was off by three, and why task_09 still won.
The refactor paradox — May 20, 2026
Qwen3-Coder-30B-A3B scored 22/30 on agentic-core-v1 at $0.0018/pass, beating the generalist Qwen3 Next 80B A3B at the same activation cost. But it failed the refactor task the generalist passed.
The easy ones aren't free — May 19, 2026
Moonshot AI's first entry in the dataset scores 24/30. The surprise isn't the score. The task that broke it should have been routine.
Hardware expertise, software failure — May 19, 2026
NVIDIA Nemotron Super 3 120B scores 12/30 on agentic-core-v1, placing last among all 100B+ models. The chip-maker thesis doesn't survive contact with the eval harness.
The Bedrock arbitrage that didn't work out — May 18, 2026
DeepSeek V3.2 scored 19/30 on agentic-core-v1 — second-lowest in the dataset, below Llama 3.3 70B. We had a falsification condition on the books before the campaign ran. It fired.
The activation ceiling — May 18, 2026
Qwen3 Next 80B A3B scored 21/30 on agentic-core-v1 at $0.00122/pass, cheapest per successful run in the harness, and became the first model to score 0/3 on the debugging task that every other model passed.
30 runs in 7 seconds: DeepSeek R1 and the API boundary — May 17, 2026
We wanted to know if DeepSeek R1's reasoning translates to agentic performance. Bedrock rejected every request before the model was invoked once. 0 tokens consumed, $0 spent, and a clean finding about what reasoning models actually are.
The specialist wins at 123B — May 17, 2026
Devstral 2 123B scored 27/30 on agentic-core-v1 at $0.0019/pass, identical to the 675B Mistral Large 3, 7.6× faster on the hard planning task, and matching DeepSeek-V4-Flash's prior 1/3 on the impossible-computation trap.
27/30 for six cents — May 17, 2026
Mistral Large 3 scored 27/30 on agentic-core-v1 at $0.06 total, 23x cheaper per passing run than Claude Sonnet 4.6 and 32x cheaper than GPT-5.5 at identical quality.
What OpenAI kept for itself — May 17, 2026
GPT-OSS 120B scored 23/30 on agentic-core-v1 across two independent runs. Second from the bottom, above Llama 3.3 70B's 20/30. Also the model that falls apart when asked to maintain a four-step plan — and the one its smaller sibling (GPT-OSS 20B, 25/30) outscores on this harness.
90% and the one model that never refused — May 16, 2026
GPT-5.5 Instant scored 27/30 on agentic-core-v1 — one point behind Claude, 31% more expensive, and 3/3 on tool-error recovery. Its distinctive gap: 0/3 on impossible-task recognition, worse than every other top-tier model.
Gemma 4 31B on agentic-core-v1: 76.7% at zero token cost — May 15, 2026
Gemma 4 31B ran 30 agentic tasks locally and passed 23. It beat Llama 3.3 70B with 39B fewer parameters, at $0.00. Here is what happened.
0/30 and $0.011: when the adapter speaks a dialect the harness doesn't understand — May 15, 2026
Llama 3.3 70B on agentic-core-v1. Every run failed. Every run also identified the correct starting file. The problem was never the model; it was the wire format.
36x cheaper. Same score. — May 13, 2026
DeepSeek-V4-Flash (13B activated MoE) matched Claude Sonnet 4.6 on agentic-core-v1 at 93.33% — and cost $0.04 total versus $1.44. Four of six pre-run predictions were wrong.
Three runs to a number: Llama 3.3 70B reaches 20/30 after two infrastructure detours — May 8, 2026
0/30 (format mismatch), 14/30 (conversation-history bug), 20/30 (clean eval). The first two runs were adapter debugging. The third is the real score — 66.7%, 16× cheaper than Claude, and six task types where it's perfect.
agentic-core-v1: What We Actually Measure and Why — May 5, 2026
Ten tasks, three runs each, thirty total. A breakdown of the agentic-core-v1 benchmark suite: what it tests, why we built it this way, and what it misses.
The pass that only worked because pandas wasn't installed — May 4, 2026
First campaign: Claude Sonnet 4.6 on agentic-core-v1. 28/30 runs passed. Total cost $1.44. One task produced both failures, and the single run that passed did so for the wrong reason.

Articles

ClawWorks Weekly