27/30 for six cents

Campaign: 2026-05-17-mistral-large-3-agentic-core-v1
Model: Mistral Large 3 (675B instruct, mistral.mistral-large-3-675b-instruct, AWS Bedrock us-east-1)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-05-17 (run at 06:21–06:23 UTC)


Mistral AI’s Large 3 — a 675B parameter instruction-tuned model available on AWS Bedrock — had been on the agentic-core-v1 candidate list since the first campaign. The question holding it back wasn’t capability. Mistral Large 3 has a reputation for strong instruction-following, and the Bedrock pricing for the 675B instruct variant is low relative to the rest of the frontier cluster. The question was whether the cost advantage survives contact with real agentic task structure. Cheap tokenising and low per-pass rates don’t help anyone.

Five campaigns had already run before this one. Claude Sonnet 4.6 led at 28/30 and $0.0514 per passing run. GPT-5.5 and DeepSeek V4-Flash clustered at 25–27/30 but cost $0.07–$0.08 per pass. Llama 3.3 70B was the only sub-cent option at $0.0045 per pass — and scored 20/30, below the threshold where the tasks this harness covers are consistently solvable. The leaderboard had a gap: near-frontier quality or low cost, but not both. Mistral Large 3 was the candidate most positioned to close it. Rigg queued it expecting 24–27 passes and total cost under $0.50.

The other five campaigns in this series cost between $0.04 and $1.89. Mistral Large 3 cost $0.06 total: 30 runs, 152 seconds wall clock, 27 passes.

That’s $0.0022 per passing run (verified: verification/cost_breakdown.csv). For comparison: Claude Sonnet 4.6 runs at $0.0514 per pass. GPT-5.5 at $0.0700. DeepSeek V4-Flash at $0.0756. The cheapest-per-pass model before this campaign was Llama 3.3 70B at $0.0045, which scored 20/30.

Mistral Large 3 is 23x cheaper per passing run than Claude Sonnet 4.6, at one point lower. That’s the story here. The 27 vs 28 gap is probably within harness variance. The cost gap is not.


What agentic-core-v1 tests

[Observed]

The suite has 10 tasks, each run 3 times, for 30 total runs. Tasks have deterministic checkers and structured acceptance criteria: fix a failing test, refactor duplicated code, investigate a log file, trace through a codebase, apply a minimal fix under a line-count constraint, handle an ambiguous requirement, execute a multi-step sequential plan, recover from a deliberate tool error, recognise a structurally impossible problem and refuse to compute, and run a SQL investigation using native tool calls.

A run passes when the model’s output clears the checker before the 15-turn budget runs out. Failure modes are wrong_answer (checker rejects output) or gave_up_mid_plan (turn limit reached without a committed answer).

task_09 stands apart from the others. It provides a 3-row CSV and asks for a 10-day moving average. The correct answer is to recognise that 3 data points cannot produce a 10-day moving average and decline to compute. No model we have run so far consistently does this.


What Mistral Large 3 did

[Observed]

27 of 30 runs passed. Pass rate: 90.00% (verified: verification/pass_rate_by_task.csv). Nine of ten task types were 3/3. task_09 was 0/3. All three failures were wrong_answer (verified: verification/failure_mode_histogram.csv). No infrastructure errors.

TaskResultAvg tool callsAvg latency
task_01 fix failing test3/34.74.1s
task_02 refactor duplicated code3/34.75.3s
task_03 investigate log3/32.02.2s
task_04 trace through codebase3/36.02.5s
task_05 minimal fix3/33.73.3s
task_06 handle ambiguous requirement3/36.04.9s
task_07 multi-step plan3/34.022.1s
task_08 recover from tool error3/32.01.5s
task_09 know when to stop0/31.01.3s
task_10 SQL investigation3/34.02.5s

(verified: verification/pass_rate_by_task.csv, verification/tool_calls_by_task.csv, verification/latency_distribution.csv)

Bedrock Converse with toolConfig worked on every call. The campaign ran cleanly from first run to last.


task_07: passed, but watch the latency

[Observed]

task_07 is the multi-step planning task: write 4 files in a specified order, each depending on context from the previous. Mistral got 3/3, consistent tool call count at 4 per run.

The latency distribution was not consistent:

Every other task averaged under 6 seconds. task_07 averaged 22s with one run touching 62s (verified: verification/latency_distribution.csv). The tool call count per run was identical, so the latency isn’t from extra exploration. The model called the same 4 tools in the same order. The variance is think time before execution begins.

[Speculation]

A 62-second pause before a multi-file write sequence may not be bad behaviour. The task requires ordering 4 dependent writes correctly. A model that pauses to commit to the correct plan before acting is preferable to one that starts writing and discovers the ordering problem mid-run. The 3/3 result is consistent with that interpretation.

What it is, regardless of interpretation, is a latency cliff. Builders running agentic pipelines with strict timeout budgets need to know that task_07-class work (sequential multi-file planning) occasionally produces a p99 latency 35x the median for this model. Passing and slow is still slow.


task_09: confident wrong answers

[Observed]

task_09_know_when_to_stop: 0/3, wrong_answer × 3. Avg tool calls: 1.0. Avg latency: 1.3s. Mistral read the 3-row CSV, computed a number, and wrote it to answer.txt in roughly one second per run. No hesitation.

Across the full campaign series, task_09 performance looks like this (verified: prior campaign briefs):

Modeltask_09 scoreFailure mode
Claude Sonnet 4.61/3Caught impossibility once
GPT-5.5 Instant0/3wrong_answer × 3
DeepSeek V4-Flash1/3Caught it once
Gemma 4 31B IT0/3gave_up_mid_plan × 3
Llama 3.3 70B0/3
Mistral Large 30/3wrong_answer × 3

Six models tested. One caught the impossibility more than once. This is now a consistent cross-model signal: models running under strong instruction-following post-training do not reliably detect structural impossibility. They compute. They commit. They are wrong.

The fail modes differ. Gemma hit turn limits without committing an answer. GPT-5.5 and Mistral wrote confident wrong answers in under 2 seconds. The outcome is the same either way: a task that cannot be solved produces an output that looks like a solved task.

[Unobserved]

We have not seen any model on this suite produce a structured refusal with reasoning for task_09 across all three runs. We have not tested whether explicit system-prompt instructions to reason about input validity before computing changes this pattern.


Tool call consistency

[Observed]

Mistral’s tool call counts across runs are tight. Most tasks show identical or near-identical min/max (verified: verification/tool_calls_by_task.csv). task_03 and task_08 both averaged 2.0 tool calls: read the data, find the answer, write it, done.

This is worth noting for production use. A model whose tool call count varies significantly run-to-run produces unpredictable turn latency and cost. Mistral’s consistency here means that if you profile a task once, you have a reasonable estimate of what it will cost on future runs. The exception is task_07, where think time (not tool calls) is the variance source.


The cost comparison

[Observed]

ModelScoreTotal costCost/pass
Claude Sonnet 4.628/30$1.44$0.0514
GPT-5.5 Instant27/30$1.89$0.0700
DeepSeek V4-Flash25/30$1.89$0.0756
Llama 3.3 70B20/30$0.09$0.0045
Mistral Large 327/30$0.06$0.0022

(verified: verification/cost_breakdown.csv for Mistral; prior campaign briefs for other models)

At $0.0022/pass:

A workload that costs $144/100 campaigns on Claude Sonnet 4.6 costs $6 on Mistral Large 3. One more failing run per 30 is the tradeoff.


Predictions: 6/6 correct

[Observed]

Rigg filed six predictions before the campaign. All six were correct (verified: predictions/mistral-large-3-agentic-core-v1.md):

The $0.06 outcome was below even the optimistic cost scenario in the predictions. The task_07 call was better than expected on the score, and the latency variance was the real story the prediction missed.


What we don’t know yet

[Speculation]

The task_07 latency cliff raises a question the current data can’t answer: is the 62-second max a structural property of multi-file planning on Mistral Large 3, or is it harness-dependent? We ran each task 3 times. A p99 latency estimate from 3 data points is unreliable. We’d need 20+ runs on task_07 specifically to characterise the tail distribution.

Six models tested. Two caught the impossibility once. The task_09 pattern suggests this is a class of problem the current generation of instruction-tuned models systematically mishandles. Whether that changes with explicit prompting, chain-of-thought, or different post-training is unknown. We have not tested any of those conditions on this suite.

We also have not tested Mistral Large 3 at lower temperature settings or with system prompts constraining its output format. The results here are defaults only.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.