The specialist wins at 123B

Campaign: 2026-05-17-devstral-2-agentic-core-v1
Model: Devstral 2 123B (mistral.devstral-2-123b, AWS Bedrock us-east-1)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-05-17


The GPT-OSS 120B campaign answered one question and raised another. At 120B parameters, OpenAI’s open-source generalist scored 23/30 on agentic-core-v1, four points behind the 675B Mistral Large 3, with task_07 (multi-step sequential planning) as the primary failure point. The result established a 120B ceiling for general-purpose models on this suite. Or it appeared to.

Devstral 2 is also 123B. But it was not trained the same way. Mistral AI built it specifically for code and agentic workflows. If the GPT-OSS 120B failure at task_07 was a knowledge problem rather than a scale problem (lacking training signal on sequential file-writing workflows, not reasoning capacity), then Devstral 2 should close the gap. That was the hypothesis going in. Rigg predicted 25–28/30.

The result: 27/30. Same score as Mistral Large 3 at 5.5× fewer parameters. task_07 3/3. And one more thing that wasn’t predicted.


What the harness actually tests

[Observed: harness spec]

agentic-core-v1 runs each model on 10 tasks, 3 times each, for 30 total runs. The tasks are the kinds of work a real software agent does: fix a failing test, refactor duplicated code, investigate a log, trace through a codebase, handle an ambiguous requirement, recover from a tool error, execute a multi-step sequential plan (task_07), recover under byte-count ambiguity (task_08), identify an impossible computation (task_09), and run a SQL investigation (task_10).

A pass means the model completed the task correctly and completely. Failure modes are wrong_answer (incorrect result returned), gave_up_mid_plan (model halted before finishing), or a tool-loop that never produced usable output. Two tasks are explicit traps: task_09 supplies 3 rows of data and asks for a 10-day moving average. The correct answer is to recognise the insufficiency and refuse. Every non-reasoning model so far has failed it. task_07 requires four sequential file writes with verification after each step before proceeding to the next. It tests whether a model can maintain a plan without collapsing mid-execution.


What Devstral 2 did

[Observed]

27 of 30 runs passed. Pass rate: 90.0% (verified: pass_rate_by_task.csv). Nine of ten task types were clean. The two exceptions: task_08 at 2/3, and task_09 at 1/3. All three failures were wrong_answer (model returned an incorrect result) (verified: failure_mode_histogram.csv). No infrastructure errors, no turn-limit hits.

TaskResultAvg tool callsAvg latency
task_01 fix failing test3/34.48s
task_02 refactor duplicated code3/36.18s
task_03 investigate log3/34.16s
task_04 trace through codebase3/36.68s
task_05 minimal fix3/37.13s
task_06 handle ambiguous requirement3/35.50s
task_07 multi-step plan3/34.02.91s
task_08 recover from tool error2/32.07s
task_09 know when to stop1/31.95s
task_10 SQL investigation3/34.59s

(verified: pass_rate_by_task.csv, latency_distribution.csv)


Does training beat parameter count?

[Observed]

GPT-OSS 120B and Devstral 2 are the same scale. Both around 120B parameters. Both on Bedrock. The results on task_07 are not close:

Modeltask_07 scoreParameters
GPT-OSS 120B1/3120B (generalist)
Devstral 23/3123B (code-specialist)
Mistral Large 33/3675B (generalist)

task_07 asks for four sequential file writes (step1.txt through step4.txt) with verification after each step before proceeding. It is structured, ordered, and unambiguous. The model that fails it is not confused about what it needs to do; it either doesn’t commit a step before moving on, or it doesn’t check after each write.

Devstral 2 ran task_07 with exactly 4 tool calls per run, consistent across all 3 runs (min=4, max=4; verified: tool_calls_by_task.csv). No variance, no exploration. It knew the pattern going in.

[Speculation]

The most straightforward explanation: Devstral 2’s training data included code that writes files in sequence and verifies them. A model that has seen for step in steps: write_step(step); verify_step(step) in thousands of contexts does not need to reason about task_07 from scratch. GPT-OSS 120B at the same scale does not have that pattern loaded. Mistral Large 3 at 675B closes the gap through sheer scale. Devstral 2 closes it through domain knowledge at a fifth of the parameter count.

The implication: for agentic code workflows, the parameter count on the box is not the right variable. Devstral 2’s training lineage matters more than its size.


What happened on the impossible task?

[Observed]

task_09_know_when_to_stop has been a clean failure across every campaign. The task supplies a 3-row CSV and asks for a 10-day moving average of the revenue column. There is no 10-day moving average to compute from 3 rows. The correct answer is to recognise the insufficiency and refuse.

Most non-reasoning models have scored 0/3 on task_09: Mistral Large 3, GPT-5.5 Instant, and Llama 3.3 70B are all in that group. DeepSeek-V4-Flash had already scored 1/3 in non-thinking mode. Devstral 2 scored 1/3: one run passed, two returned wrong_answer (model returned an incorrect result).

Modeltask_09 scoreFailure mode
Claude Sonnet 4.61/3Caught impossibility once
DeepSeek-V4-Flash1/3Caught it once (non-thinking mode)
Mistral Large 30/3wrong_answer × 3
GPT-5.5 Instant0/3wrong_answer × 3
GPT-OSS 120B0/3
Gemma 4 31B IT0/3gave_up_mid_plan (model abandoned the task before producing output) × 3
Llama 3.3 70B0/3
Devstral 2 123B1/3wrong_answer × 2

1/3 is not a reliable signal. Two of three runs still failed. Treat this as an anomaly to watch, not a capability unlock.

[Speculation]

Why did Devstral 2 catch it once when Mistral Large 3 (5.5× larger, same model family) did not? The training hypothesis again: code that validates window sizes before computing rolling averages is common. if len(data) < window: return None is a pattern that appears regularly in data-processing code. Devstral 2 may have seen that check often enough to apply it reflexively on one of three runs, where Mistral Large 3’s general training did not establish the same reflex.

Whether this is reproducible, and whether Devstral 2 consistently outperforms Mistral Large 3 specifically on input-validation tasks, requires more runs. What we have is one pass on a task where only DeepSeek-V4-Flash had previously managed the same result. A second data point, not a new capability unlock.


How fast is Devstral 2?

[Observed]

Devstral 2 is fast. The fastest in the series on the tasks that matter most.

task_07 averaged 2.91s. Mistral Large 3 averaged 22.11s on the same task (7.6× slower at identical quality, both 3/3). No run of task_07 on Devstral 2 exceeded 4s. Mistral Large 3’s task_07 max was 62s.

The entire campaign ran in under 4 minutes. No p99 latency spike. The biggest task (task_05_minimal_fix) capped at 9.15s.

For workflows that run agentic tasks at volume, this gap compounds. 27 passing runs × (2.91s vs 22.11s) on a task_07-heavy pipeline is a meaningful throughput difference. The cost argument for Devstral 2 over Mistral Large 3 is already clear; the latency argument is separate and additional.


The cost position

[Observed]

ModelScoreCost/passTotal cost
Claude Sonnet 4.628/30$0.0514$1.44
GPT-5.5 Instant27/30$0.0700$1.89
Mistral Large 327/30$0.0022$0.06
Devstral 2 123B27/30$0.0019$0.05
DeepSeek V4-Flash25/30$0.0756
GPT-OSS 120B23/30$0.0013$0.03
Llama 3.3 70B20/30$0.0045$0.09

(verified: cost_breakdown.csv)

At $0.0019/pass, Devstral 2 is marginally cheaper than Mistral Large 3 ($0.0022/pass) at the same score. The cost range within the 27/30 tier now spans 37×: from $0.0019 (Devstral 2) to $0.0700 (GPT-5.5 Instant). All three models at this tier score identically on this suite. The case for running anything other than Devstral 2 at 27/30 quality requires a reason that isn’t in this data.

GPT-OSS 120B is cheaper per pass at $0.0013 but scores 4 points lower (23/30). The efficiency trade-off depends on your tolerance for the 23/30 failure modes (task_07 failures and task_09 behaviour). If those tasks are absent from your pipeline, the cost case for GPT-OSS 120B improves.


What the predictions got wrong

[Observed]

5 of 6 predictions were correct. The miss:

P2 — task_09 0/3: WRONG. Predicted 0/3 based on the prior pattern: every non-reasoning model before Devstral 2 had scored 0/3. Actual: 1/3. The prediction held for 7 models in a row before breaking. The prior was reasonable; the result was not unreasonable, since 2 of 3 runs still failed. But the miss matters because it’s the first evidence that coding specialisation may improve input-validation behaviour on task_09.

The other 5 landed: score range (P1: 27/30 in 25–28 range), task_07 performance (P3: 3/3 as predicted), cost range (P4: $0.05 in $0.04–$0.20), no infrastructure errors (P5), and the specialist outperforming GPT-OSS 120B (P6: 27 > 23).


What we don’t know yet

[Speculation]

The task_09 result (one pass from a non-reasoning model for the first time) needs more runs to mean anything. The question is whether Devstral 2’s coding-specialisation training produces consistent input-validation behaviour or whether the one pass was noise. A targeted re-run of task_09 with 9 or 12 runs would be informative.

The task_08 result (2/3, one wrong_answer) is unexplained. GPT-5.5 Instant, Claude Sonnet 4.6, and DeepSeek-V4-Flash all got 3/3. One run failing the byte-count recovery task could be stochastic or could indicate Devstral 2 is less consistent on byte/character ambiguity under tool constraints. Not enough information from 3 runs.

Whether the latency advantage holds under concurrent load (multiple parallel task runs hitting Bedrock) is unknown. The 2.91s task_07 average was measured in sequential single-threaded execution.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.