Gemini 3.1 Pro on agentic-core-v1: 23/30, and the 40-point gap from Flash

Campaign: 2026-06-15-gemini-3.1-pro-agentic-core-v1
Model: Google Gemini 3.1 Pro (gemini-3.1-pro-preview, Google AI Studio API)
Harness: agentic-core-v1 (10 tasks × 3 runs = 30 runs)
Campaign date: 2026-06-15 (19:55–20:15Z)


In May, Gemini 3.5 Flash scored 11/30 on agentic-core-v1. That was the worst result by any model that successfully ran the harness. The failure mode was consistent: gave_up_mid_plan (the harness failure mode logged when the turn budget expires without required output being written), early and often, across tasks where every other capable model at least attempted a path through.

The question after Flash was simple: is this a Google architecture problem, or a tier problem? The way to find out is to run the Pro tier.

Gemini 3.1 Pro scored 23/30.

That is a 40-point gap between two models from the same family, tested on the same harness within a month. It is one of the largest intra-family splits we have measured. Routine coding tasks, codebase exploration, SQL investigation, and ambiguity handling came out clean. The Pro tier behaves like a different model, not a spec bump.

Two things did not change. task_09 (know_when_to_stop) is 0/3 for Pro and was 0/3 for Flash. Tool call redundancy (repeated reads of the same file, multiple re-verification of things already confirmed) appeared in 17/30 Pro runs. The family has a ceiling and a consistent exploration habit. Both are present at both tiers.


What agentic-core-v1 tests

[Observed — harness spec]

The suite runs 10 tasks, 3 times each. Each task has a deterministic pass/fail checker. The turn budget is fixed at 15 per run. Failure modes are logged at run time: wrong_answer when the checker rejects the output, gave_up_mid_plan when the turn budget expires without the required output, and infrastructure_error when the harness fails to complete the run at all.

The 10 tasks cover the actual work of a coding agent: fix a failing test, refactor duplicated code, investigate a log file, trace execution through a multi-file codebase, apply a constrained minimal fix, handle an ambiguous requirement with two-artifact output, execute a four-step sequential plan, recover when a tool call returns an error, detect when a computation is impossible on the available data, and run a SQL investigation.

Two tasks carry structural traps. task_09 asks for a 10-day moving average on a three-row dataset. The correct move is to acknowledge the data limitation and write a partial or qualified answer. task_08 injects a tool error mid-task to test whether the model can find an alternative path. Neither has an ambiguous success condition.


The results

[Observed — verification: pass_rate_by_task.csv]

TaskPassed / RunsAvg tool callsAvg latencyNotes
task_01 fix_failing_test3/313.341s
task_02 refactor_duplicated_code3/324.0104sSlowest task; highest tool use
task_03 investigate_log2/38.034sOne max-turns failure
task_04 trace_through_codebase3/311.372s
task_05 minimal_fix3/38.029s
task_06 handle_ambiguous_requirement3/314.027s
task_07 multi_step_plan2/38.016sOne wrong_answer
task_08 recover_from_tool_error1/36.023sTwo max-turns failures
task_09 know_when_to_stop0/36.016sFamily floor — 0/3 at Flash tier too
task_10 sql_investigation3/38.035s

Total: 23/30 (76.7%) · $0.845578 total · $0.0282 per run average

(verified: pass_rate_by_task.csv, cost_breakdown.csv, tool_calls_by_task.csv, latency_distribution.csv)


The six tasks it handled cleanly

[Observed — verification: pass_rate_by_task.csv, tool_call_redundancy evidence bundle]

Six tasks were 3/3. The pattern across them: each has a clear output contract and requires structured, sequential exploration of a codebase or log file. The model uses more tool calls than necessary, but on tasks where the success condition is unambiguous, redundant reads do not cause failures.

task_01 (fix_failing_test) averaged 13.3 tool calls per run. The model read the failing test, identified the bug in src/add.py, patched it, and re-ran the tests. The tool_call_redundancy evidence shows repeated bash run_tests.sh calls after the patch was already confirmed — the model ran the tests multiple times before writing its final answer. Redundant, but the task passes on correctness, not efficiency. All three runs passed (verified: tool_call_redundancy.md).

task_02 (refactor_duplicated_code) was the most expensive task: average 24 tool calls and 104 seconds per run. The model re-reads source files heavily across turns. All three passed. Flash also scored 3/3 on task_02, so this sits at the floor for models with basic coding capability — it does not differentiate tiers.

task_04 (trace_through_codebase) is where Pro and Flash split clearly. Flash scored 1/3. Pro hit 3/3. The task requires navigating a multi-file codebase: reading entry points, following imports, writing a structured trace.txt. Flash dropped after shallow reads. Pro explored correctly across all three runs, following the full import chain before writing output (verified: pass_rate_by_task.csv).

task_06 (handle_ambiguous_requirement) shows the same gap: Flash 1/3, Pro 3/3. task_10 (sql_investigation) follows the same pattern: Flash 1/3, Pro 3/3. Both are tasks that require structured investigation across multiple files or schema artifacts. The Pro tier’s more thorough file reading, even when redundant, pays off on tasks where completeness determines the pass.

task_05 (minimal_fix) was 3/3 for both models. Single file, tight line constraint, unambiguous success signal. It does not distinguish tiers.


The gap from Flash: what changed and what did not

[Observed — verification: pass_rate_by_task.csv; comparison baseline: campaign 2026-06-gemini-3.5-flash-agentic-core-v1]

Pro improved on 8 of 10 tasks compared to Flash. The two that did not move: task_02 (both at ceiling) and task_09 (both at floor).

The improvement maps to task type. Multi-file exploration (task_04, task_10): Flash got lost; Pro navigated. Ambiguity handling (task_06): Flash picked the first interpretation and stopped; Pro worked through the space and produced two correct artifacts. Log investigation (task_03): Flash burned its turn budget without completing on any of three runs; Pro passed two. Multi-step execution (task_07): Flash abandoned all three runs; Pro completed two and produced an incorrect answer on the third.

What did not change: tool call redundancy appeared in 17/30 Pro runs (verified: tool_call_redundancy.md, 55 evidence entries), predominantly on passing runs. The same pattern was present in Flash. The Pro tier issues more tool calls per run, which means it can absorb more redundant reads before hitting the turn limit. On tasks where the exit condition is clear, this does not cause failures. On tasks where it is ambiguous, the redundant reads consume turns that could have gone toward finding the answer — which is what links tool_call_redundancy to the gave_up_mid_plan failures below.


Why do runs end in loops?

[Observed — cross_task_consistency evidence bundle, long_tail_turn_count evidence bundle]

6 of 7 failures across task_03, task_08, and task_09 were gave_up_mid_plan. Not early abandonment — late looping. The model stays engaged, re-reads files, and reaches the turn limit still in motion, without having written the required output (verified: cross_task_consistency.md, long_tail_turn_count.md).

This is structurally different from Flash’s gave_up_mid_plan. Flash stopped after a few turns when it ran out of forward momentum. Pro runs more turns, issues more tool calls, and still does not find the exit. The model recognises the task is unfinished. It just cannot resolve the boundary condition that would let it commit to writing an answer.

The pattern cuts across three different task types: log investigation (task_03), error recovery (task_08), and quantitative reasoning (task_09). That consistency across task types, combined with the tool call redundancy data (57% of runs), points to a model-level exploration habit rather than anything task-specific (verified: cross_task_consistency.md).

task_08 (recover_from_tool_error) went 1/3. The harness injects a tool error mid-task. Two of three runs hit max turns after the injection: the transcript evidence shows the model looping on fs_read('data.txt') 4–5 times consecutively in runs 79508107 and 46454e71 without pivoting to an alternative path. The one passing run (bbd5dd33, run 3) showed the same initial loop — 3 redundant reads after the injected error — before eventually shifting approach and completing the task. Same strategy, different outcome. The difference between pass and fail on task_08 appears probabilistic at this sample size (verified: tool_call_redundancy.md).


task_09: the floor that held

[Observed — cross_task_consistency evidence bundle]

task_09 (know_when_to_stop) is 0/3. Flash was 0/3. The pattern is the same in both: the model reads data.csv (three rows: January 1–3, 2026), recognises the data is insufficient for a 10-day moving average window, and then loops — re-reading the file via fs_read, then cat, then fs_read again — without writing a qualified answer. 6 tool calls across 12–13 turns, all reads, no write, max turns hit. All three runs follow the same trajectory (verified: cross_task_consistency.md).

The task name is intentionally ironic. The expected response is to acknowledge the limitation and produce a partial or qualified output anyway. Both Gemini models loop instead. The Pro tier does not add the capability to recognise when a task is unsolvable as stated and terminate gracefully with an explanation.

[Unobserved]

We looked for the diagnosis_then_regression pattern across all 30 runs — a model that correctly identifies the problem, begins writing a qualified answer, and then reverts to another read attempt. The diagnosis_then_regression evidence bundle shows no instances of this pattern (verified: diagnosis_then_regression.md). The model does not diagnose then back off. It loops without ever committing to writing anything.


What we do not know

[Speculation]

The passing run on task_08 (bbd5dd33) is the result I keep returning to. Same model, same task, same injected error, same opening loop — and it eventually found the pivot while the other two runs did not. We can observe the outcome difference but we did not determine the mechanism. Whether that is sampling variation at the temperature used, a subtle difference in how tool responses accumulated across turns, or something in the context window that shifted the probability distribution toward the correct pivot — we do not know.

A follow-up campaign focused on task_08 with a larger run count and transcript diffing across pass and fail runs would be the right test. It would also help determine whether the 1/3 result is stable or whether Pro can reliably pass task_08 with slightly different prompting or retry logic.

The confidence interval on 23/30 is ±15pp at 95% CI. That is wide. The sample is accurate enough to distinguish Pro from Flash (40-point gap) and from models below the 50% pass rate line, but not precise enough to establish stable ordering against Gemma 4 12B (22/30) or Claude Fable 5 (25/30). The real ranking in that band needs a larger sample.


Cost and leaderboard

[Observed — verification: cost_breakdown.csv]

$0.845578 total. $0.0282 per run average. Within campaign budget.

task_03 (investigate_log) is the cost outlier: $0.388324, 46% of campaign spend across three runs. The same task was 57% of Flash’s spend for the same reason: the access log file is large, and the model reads it thoroughly on each pass. Budget-aware deployments with log investigation workloads should weight this task accordingly.

Leaderboard as of 2026-06-15 (current campaign verified: cost_breakdown.csv, pass_rate_by_task.csv; prior campaigns from their respective campaign IDs in intel.db):

ModelScorePass rateCost (30 runs)Date
Claude Opus 4.830/30100%$7.34Jun 2026
Mistral Small 429/3096.7%$0.03May 2026
Claude Fable 5 (rerun)25/3083.3%$1.97Jun 2026
Gemini 3.1 Pro23/3076.7%$0.85Jun 2026
Gemma 4 12B22/3073.3%$0.00*Jun 2026
Llama 4 Scout 17B-MoE10/3033.3%Jun 2026
Gemini 3.5 Flash11/3036.7%$0.72May 2026

*Gemma 4 12B ran on internal infra; cost not billed through Google API.

At $0.85 for 30 runs, Pro is substantially cheaper than Claude Opus 4.8 ($7.34) for 77% of the performance. On the tasks it handles well — structured coding, multi-file investigation, SQL, ambiguity with a clear output contract — the cost efficiency case is real. The caveat is consistent across both Gemini models: pipelines that hand the model open-ended tasks with incomplete or underspecified inputs need explicit output timeouts or an explicit “give up and explain” instruction in the prompt. The model will not self-terminate gracefully on its own.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.