Articles
- Amazon Nova Pro scores 9/15 on casino-strategy-v1 for eight cents —
Amazon Nova Pro (Bedrock) scores 9/15 (60%) on casino-strategy-v1, placing third on the leaderboard for $0.08. It passes three tasks cleanly and stops mid-game on the other two — at exactly the same hand every time.
- Opus 4.8 wins at blackjack. The gap is 20 points. —
Seven models on casino-strategy-v1: Claude Opus 4.8 at 86.7%, Mistral Large 3 at 66.7%, Amazon Nova Pro at 60.0%, Llama 3.3 70B at 53.3%. The Bedrock expansion fills in the leaderboard and answers the ceiling question.
- Opus 4.8 dominates casino-strategy-v1 — except for the one task it can't crack —
Claude Opus 4.8 scores 13/15 (86.7%) on casino-strategy-v1, the highest mark in the dataset. It's 20 points ahead of the prior leader and $36.79 more expensive than the model in second. The gap tells you something real about what task complexity costs at the frontier.
- Llama 3.3 70B solved the task nobody could crack. Then failed the easy one. —
Meta Llama 3.3 70B scores 8/15 (53.3%) on casino-strategy-v1 and becomes the first model to pass task_04 (bankroll survival). It also fails task_01 — the same floor that trips every other model on this harness.
- Gemma 3 27B knows what to do. It just won't do it. —
Google's best-available Bedrock model scored 0/30 on agentic-core-v1 — not because it can't use tools, but because it won't invoke them on its own. Approximately 400× cheaper than Sonnet 4.6. Zero passes.
- OpenAI's flagship didn't move. The leaderboard did. —
GPT-5.5 flagship scored 27/30 in June 2026, matching its May result exactly. Same task, same failure mode, one fewer dollar. The difference is what happened around it: DeepSeek V4-Pro now sits at 30/30 for $0.12.
- The second perfect score. It costs $0.12. —
DeepSeek V4-Pro in thinking mode on agentic-core-v1: 30/30 (100%), $0.12 total. The first perfect score on this harness, Claude Opus 4.8, cost $7.34. Thinking mode closed the 7-point gap from V4-Flash and added nothing to the failure-mode count.
- Gemini 3.1 Pro on agentic-core-v1: 23/30, and the 40-point gap from Flash —
Gemini 3.1 Pro scored 23/30 (76.7%) on agentic-core-v1 — a 40-point jump from Flash. Multi-file nav and SQL landed cleanly. task_09 is still 0/3 for the whole family.
- We ran a 7-day production eval. Opus 4.7 did not earn the upgrade. —
We set up a shadow eval to find out whether Opus 4.7 writes better content than Sonnet 4.6. The routing infrastructure ran for seven days and collected zero comparable articles. With no positive evidence for a 10x cost premium, Sonnet 4.6 stays.
- Scout knew the fix. It described the fix. It never wrote the fix. —
Meta Llama 4 Scout on agentic-core-v1: 10/30 (33%). The model diagnosed bugs correctly, produced working solutions, and then stopped. fs_write was never called.
- 59 tool calls. Zero mid-plan reversals. Opus 4.8's first run on frontier-eval-v1. —
Claude Opus 4.8 on frontier-eval-v1: 25/35 total, 25/25 on the five tasks the harness can score ($166.26 actual). The diagnosis-then-regression pattern is absent across all 35 runs. Two task failures trace to harness infrastructure bugs. Sets the Opus 4.8 baseline before Fable 5 comparison.
- Gemma 4 12B on agentic-core-v1: 91.7% on the tasks it actually ran —
Gemma 4 12B IT ran 30 agentic tasks on a local EC2 A10G and passed 22. Six failures were infrastructure errors, not the model. Adjusted for that, it outscored its 31B sibling.
- Flagship pricing, third-place finish. Claude Fable 5 on agentic-core-v1. —
Claude Fable 5 scored 25/30 (83.3%) on agentic-core-v1 at $1.97 total. Haiku 4.5 scored 27/30 at $0.085. Sonnet 4.6 scored 28/30 at $1.44. The new Mythos-class flagship trails both in the family that built it.
- Claude's worst model became its best. One version bump, 30/30. —
Claude Opus 4.8 on agentic-core-v1: 30/30 (100%), $7.34 total, $0.245 per passing task. First perfect score in the ModelClaw dataset. The gave_up_mid_plan failure mode that accounted for 7 of 9 failures in Opus 4.7 is entirely absent.
- Two models at 27/30 are not the same model —
Claude Haiku 4.5 and Devstral 2 123B both score 27/30 on agentic-core-v1. Their failure profiles are completely different. Why the per-task breakdown matters more than the headline number.
- The model you expected to win didn't —
Three case studies from the agentic-core-v1 dataset where the bigger, more expensive model lost to a smaller one from the same lab. Claude Opus 4.7 at $0.528 per pass. GPT-OSS 120B beaten by its own 20B sibling. Mistral Small 4 leading the leaderboard at $0.001 per pass.
- 35 models. One solved it. —
task_09 is the hardest task in agentic-core-v1: compute a 10-day moving average on 3 rows of data. Across 35 model campaigns, the pass rate is 16%. Only one model has ever passed all three runs. Here's what divides the ones that pass from the ones that loop forever.
- 29/30 for three cents —
Mistral Small 4 scored 29/30 on agentic-core-v1, the new harness leader, at $0.03 total and $0.001 per passing run. That is ~50x cheaper per pass than Claude Sonnet 4.6 at 28/30.
- We predicted 26. It scored 11. —
Gemini 3.5 Flash posted 11/30 on agentic-core-v1, the worst result by any model that successfully ran the harness. Our prediction was 26. Here's what happened, why it happened, and what 'gave_up_mid_plan' means for anyone building on top of Google's agentic flagship.
- A model that never played the game —
GPT-4o Turbo scored 6/15 on casino-strategy-v1 without making a single tool call. After hardening the checker to require real game interaction, its score dropped to 3/15.
- Anthropic's most expensive model scored worst in its own family —
Claude Opus 4.7 on agentic-core-v1: 21/30 (70%), $0.528 per passing task. Haiku 4.5 scored 27/30 for $0.00316/pass. Sonnet 4.6 scored 28/30. This is the complete Anthropic 4.x picture.
- Sonnet 4.6 can count cards. It cannot play basic strategy. —
Claude Sonnet 4.6 scores 3/15 (20%) on casino-strategy-v1, the baseline run that launched the harness. The 73-point gap from its 28/30 agentic-core-v1 result is real, and the failure modes are specific.
- The architecture question, revisited —
A second run of NVIDIA Nemotron Nano 3 30B scores 15/30 on agentic-core-v1 -- five points above the first run's 10/30, and now clearly ahead of the 120B Super sibling. The architecture finding is updated: dense 30B does beat MoE 120B. The family ceiling is still low.
- 12/30 was real —
A second run of NVIDIA Nemotron Super 3 120B confirms 11/30 on agentic-core-v1 -- one point from the prior result, same failure pattern. The replication also surfaces a sharper finding: Nemotron's impossibility detection now outscores GPT-OSS 120B despite a 12-point overall deficit.
- The coder trap —
Qwen3 Coder Next scores 20/30 on agentic-core-v1 -- below every prior Qwen3 model in this dataset, at 5x the cost-per-pass of Qwen3 32B. Coder specialisation trades away the ambiguity-handling the harness requires.
- Why basic strategy alone isn't enough: what card counting actually does and why AI could do it better —
Basic strategy gets you to a ~0.5% house edge. That's still a steady drain. Card counting flips the edge to the player, but only when executed with near-perfect accuracy. That's the problem we're building a tool to solve.
- One point apart —
Claude Haiku 4.5 scored 27/30 on agentic-core-v1 -- one point below Anthropic's flagship Sonnet 4.6 at 16x lower cost per passing task. On the nine tasks both models can actually solve, the results are identical.
- The base model that proved versioning works —
MiniMax M2 scores 24/30 on agentic-core-v1 -- four points below M2.1 and three below M2.5. That result completes the family dataset. The version numbers actually mean something.
- The older model is better (this time) —
MiniMax M2.1 scores 28/30 on agentic-core-v1 -- one point above its flagship sibling M2.5. The version inversion traces to a single difference: M2.1 hedges before committing to an impossible answer. M2.5 does not. That habit wins task_09 and loses task_06.
- The plan stayed in its head —
Kimi K2 Thinking scored 12/30 on agentic-core-v1, down from K2.5's 24/30. The reasoning trace fired. The tool calls didn't.
- The legacy format that couldn't shortcut became the thing that made it work. —
Magistral Small 2509 scores 23/30 (76.67%) on agentic-core-v1 -- the first reasoning-format model to pass task_07. The [TOOL_CALLS] text format built as a workaround may be why it succeeded where Kimi K2 Thinking failed.
- The model fixed the bug. The test runner didn't have pytest. —
Ministral 3 14B scores 23/30 (76.67%) on agentic-core-v1 at $0.00103/pass. The headline: a correct code fix, a missing pytest install, and a score three points lower than it should have been.
- The 3B matches the 14B. One task apart. —
Ministral 3 3B scores 22/30 (73.33%) on agentic-core-v1 at $0.000787/pass. One task behind the 14B sibling, 24% cheaper per pass. The Ministral 3 family is now fully mapped, and the smallest model is not where the floor is.
- The 8B model that tied the dataset leaders -- and made every prediction wrong. —
Ministral 3 8B scores 28/30 (93.33%) on agentic-core-v1 at $0.00067/pass -- tied with Claude Sonnet 4.6 and GLM-4.7 at the top of the leaderboard, at a fraction of the cost. Every pre-run prediction was wrong.
- Jamba 1.5 Large diagnosed the bugs. It did not fix any of them. —
AI21 Jamba 1.5 Large scores 8/30 (26.67%) on agentic-core-v1 — last in the dataset at $0.0044/pass. The first SSM-Transformer hybrid tested. The headline finding: the model reads the broken code, writes a correct diagnosis in prose, and then stops without executing the fix.
- We ran AWS's own flagship LLM on its own cloud. It came 16th. —
Amazon Nova Pro scores 20/30 on agentic-core-v1, 16th of 21 models. No measurable advantage from running AWS's own model on AWS's own cloud. A new task_02 failure mode: the refactor was correct, but Nova Pro rewrote the test API instead of preserving it.
- The flash tax on GLM-4.7 is 3 points and 85% —
GLM-4.7-Flash scores 25/30 on agentic-core-v1 at $0.000565 per passing task. Three points below its parent, 85% cheaper per pass. Whether that trade works depends on the workload.
- We gave Alibaba's biggest Qwen3 model a text-only benchmark. It scored 40%. —
Qwen3 VL 235B A22B scores 12/30 on agentic-core-v1, the weakest result in the Qwen3 family despite 22B active parameters. VL pre-training extracted a 11-point penalty over the 32B dense variant. Two tasks produced identical wrong outputs across all three runs.
- Version inversion —
GLM-4.7 scores 28/30 on agentic-core-v1, one point above its own successor GLM-5. At $0.0038 per passing task (42% cheaper), the model the upgrade was supposed to replace is now joint-first in the dataset.
- Not the architecture —
Nemotron Nano 3 30B scores 10/30 on agentic-core-v1, lower than the Super 120B's 12/30. Dense parameters didn't fix the family's agentic problem. NVIDIA's training did this, not the MoE design.
- The smaller model wins —
GPT-OSS 20B scored 25/30 on agentic-core-v1, beating its 120B sibling by 2 points at 2.7x lower cost per pass. New cost floor: $0.000481/pass. Cheapest model in the dataset.
- Dense enough —
Qwen3 32B scores 23/30 on agentic-core-v1, marginally ahead of its MoE siblings at 21 and 22. Ten times the active compute per forward pass translated to 1–2 extra passes. The production-relevant finding is task_08: the model doesn't stall when it's wrong, it completes confidently.
- Zero redundancy at the top tier —
GLM-5 scored 27/30 on agentic-core-v1, joining MiniMax M2.5, Mistral Large 3, and Devstral 2 at the highest mark any model outside Claude reaches on this harness. From the oldest Chinese LLM lineage in the dataset. Zero tool redundancy across 30 runs.
- 27/30 from a Beijing lab —
MiniMax M2.5 scored 27/30 on agentic-core-v1, matching Mistral Large 3 and Devstral 2 at the top tier. First Chinese-lab model to reach that mark. The internal reasoning block explains why the prediction was off by three, and why task_09 still won.
- The refactor paradox —
Qwen3-Coder-30B-A3B scored 22/30 on agentic-core-v1 at $0.0018/pass, beating the generalist Qwen3 Next 80B A3B at the same activation cost. But it failed the refactor task the generalist passed.
- The easy ones aren't free —
Moonshot AI's first entry in the dataset scores 24/30. The surprise isn't the score. The task that broke it should have been routine.
- Hardware expertise, software failure —
NVIDIA Nemotron Super 3 120B scores 12/30 on agentic-core-v1, placing last among all 100B+ models. The chip-maker thesis doesn't survive contact with the eval harness.
- The Bedrock arbitrage that didn't work out —
DeepSeek V3.2 scored 19/30 on agentic-core-v1 — second-lowest in the dataset, below Llama 3.3 70B. We had a falsification condition on the books before the campaign ran. It fired.
- The activation ceiling —
Qwen3 Next 80B A3B scored 21/30 on agentic-core-v1 at $0.00122/pass, cheapest per successful run in the harness, and became the first model to score 0/3 on the debugging task that every other model passed.
- 30 runs in 7 seconds: DeepSeek R1 and the API boundary —
We wanted to know if DeepSeek R1's reasoning translates to agentic performance. Bedrock rejected every request before the model was invoked once. 0 tokens consumed, $0 spent, and a clean finding about what reasoning models actually are.
- The specialist wins at 123B —
Devstral 2 123B scored 27/30 on agentic-core-v1 at $0.0019/pass, identical to the 675B Mistral Large 3, 7.6× faster on the hard planning task, and matching DeepSeek-V4-Flash's prior 1/3 on the impossible-computation trap.
- 27/30 for six cents —
Mistral Large 3 scored 27/30 on agentic-core-v1 at $0.06 total, 23x cheaper per passing run than Claude Sonnet 4.6 and 32x cheaper than GPT-5.5 at identical quality.
- What OpenAI kept for itself —
GPT-OSS 120B scored 23/30 on agentic-core-v1 across two independent runs. Second from the bottom, above Llama 3.3 70B's 20/30. Also the model that falls apart when asked to maintain a four-step plan — and the one its smaller sibling (GPT-OSS 20B, 25/30) outscores on this harness.
- 90% and the one model that never refused —
GPT-5.5 Instant scored 27/30 on agentic-core-v1 — one point behind Claude, 31% more expensive, and 3/3 on tool-error recovery. Its distinctive gap: 0/3 on impossible-task recognition, worse than every other top-tier model.
- Gemma 4 31B on agentic-core-v1: 76.7% at zero token cost —
Gemma 4 31B ran 30 agentic tasks locally and passed 23. It beat Llama 3.3 70B with 39B fewer parameters, at $0.00. Here is what happened.
- 0/30 and $0.011: when the adapter speaks a dialect the harness doesn't understand —
Llama 3.3 70B on agentic-core-v1. Every run failed. Every run also identified the correct starting file. The problem was never the model; it was the wire format.
- 36x cheaper. Same score. —
DeepSeek-V4-Flash (13B activated MoE) matched Claude Sonnet 4.6 on agentic-core-v1 at 93.33% — and cost $0.04 total versus $1.44. Four of six pre-run predictions were wrong.
- Three runs to a number: Llama 3.3 70B reaches 20/30 after two infrastructure detours —
0/30 (format mismatch), 14/30 (conversation-history bug), 20/30 (clean eval). The first two runs were adapter debugging. The third is the real score — 66.7%, 16× cheaper than Claude, and six task types where it's perfect.
- agentic-core-v1: What We Actually Measure and Why —
Ten tasks, three runs each, thirty total. A breakdown of the agentic-core-v1 benchmark suite: what it tests, why we built it this way, and what it misses.
- The pass that only worked because pandas wasn't installed —
First campaign: Claude Sonnet 4.6 on agentic-core-v1. 28/30 runs passed. Total cost $1.44. One task produced both failures, and the single run that passed did so for the wrong reason.