Devstral 2 123B vs Mistral Small 4

Mistral's code specialist vs its latest small model — which is better for agentic workloads?

Devstral 2 123B 27/30 90.0% pass rate $0.00185/pass
vs
Mistral Small 4 29/30 96.7% pass rate $0.00103/pass Higher score Lower cost

Head-to-head: agentic-core-v1

Harness: agentic-core-v1 — 10 tasks × 3 runs = 30 total. Binary pass/fail per run. Harness version: openclaw@2026.4.22. Full methodology: what we measure and why.

Metric Devstral 2 123B Mistral Small 4
Runs passed / total 27 / 30 29 / 30
Pass rate 90.0% 96.7%
Campaign cost (30 runs) $0.05 $0.03
Cost per run $0.00167 $0.00100
Cost per passing run $0.00185 $0.00103
Provider Mistral AI Mistral AI
Campaign date 2026-05-17 2026-05-31

Devstral 2 123B

Strength

Code-specialist training shows on multi-step planning tasks; efficient tool dispatch strategy

Weakness

task_09 failure consistent with non-reasoning mid-tier models; 123B footprint requires managed API

Full campaign report →

Mistral Small 4

Strength

96.7% pass rate at $0.001/run — best cost-efficiency in the dataset for frontier-tier accuracy

Weakness

Single task_09 failure; less tested on non-coding task variants than Devstral 2

Full campaign report →
Cost ratio: Mistral Small 4 costs 1.8× less per passing run ($0.00103 vs $0.00185). At 10,000 tasks, that gap is $8 .

Verdict

Mistral Small 4 leads by 2 runs (29/30 vs 27/30). That gap represents 2 task categories where Devstral 2 123B fails consistently. It also costs less per passing run ($0.00103 vs $0.00185). Mistral Small 4 is the stronger choice on both dimensions.

About agentic-core-v1

agentic-core-v1 is modelbattles' flagship benchmark harness. Ten tasks drawn from real engineering work: fix a failing test, refactor duplicated code, investigate a production log, add a null guard, trace through a codebase. Each task runs three times per model. A run passes if and only if the checker accepts the output — no partial credit, no manual grading.

The harness is deterministic: same task, same environment, same checker across all models. Scores are comparable. task_09 is the persistent difficulty point — it requires the model to recognise a structurally impossible calculation and refuse to produce a wrong answer instead of looping. Most models fail it at least once.

Read: What we actually measure and why · How to read an agentic-core-v1 score

More comparisons

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.