We predicted 26. It scored 11.

Campaign: 2026-06-gemini-3.5-flash-agentic-core-v1
Model: Gemini 3.5 Flash (Google AI Studio API)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-05-28


Google’s current pitch for Gemini 3.5 Flash is specific: “parallel agentic execution,” “Pro-level coding proficiency,” “near-Pro intelligence at Flash-tier cost and speed.” Those aren’t vague brand adjectives; they describe exactly what agentic-core-v1 tests. Multi-step planning, tool error recovery, sequential execution under ambiguity. When Rigg ran Gemini 3.5 Flash through the harness, we expected it to score somewhere in the 23–28 range. Midpoint estimate: 26/30.

It scored 11.

That is the story here. Not just that the score is low (11/30 is bad, but low scores happen) but that we were wrong by 15 points. Our prediction was anchored to Google’s own marketing. The lesson from this campaign is about what “agentic” in a press release actually tells you, versus what the harness can observe.

Before the run could happen at all, there was a prerequisite. The ModelClaw harness sends tool definitions to models using input_schema (the OpenAI/Anthropic convention), but Gemini’s API expects parameters. Without a fix, tool calls would fail silently. Rigg patched it in PR #114 (feat/google-adapter-tool-schema-fix): google.py now translates input_schema → parameters before handing tool declarations to the Gemini API. The 11/30 is Gemini 3.5 Flash on well-formed tool calls. No harness artifact.


What agentic-core-v1 tests

[Observed — harness spec]

Ten tasks, three runs each, thirty total. The tasks cover the day-to-day work of a coding agent: fix a failing test, refactor duplicated functions, investigate a log file, trace execution paths across files, apply a constrained targeted fix, handle an ambiguous requirement, execute a four-step sequential plan, recover when a tool call returns an error, detect a computation that’s impossible on the available data, and run a SQL investigation.

Pass means completing the task correctly and completely. Failure modes are classified: wrong_answer (attempted, produced incorrect output), gave_up_mid_plan (abandoned mid-execution), tool_call_hallucinated (called nonexistent tools or invented arguments), tool_call_redundancy (loop without progress).

Two tasks carry structural traps. task_09 asks for a 10-day moving average on a dataset with fewer than 10 rows. The correct move is to refuse. task_07 requires four sequential file writes with specified content and order, testing whether the model can hold a plan without deviation. Neither trap is ambiguous. Both have clean success conditions.


The results

[Observed — verification: pass_rate_by_task.sql]

TaskPassed / RunsPass RateAvg tool callsAvg cost
task_02 refactor_duplicated_code3/3100%15.3$0.0268
task_05 minimal_fix3/3100%10.0$0.0157
task_01 fix_failing_test2/367%8.0$0.0111
task_04 trace_through_codebase1/333%13.3$0.0139
task_06 handle_ambiguous_requirement1/333%9.7$0.0093
task_10 sql_investigation1/333%9.0$0.0101
task_03 investigate_log0/30%7.0$0.1365
task_07 multi_step_plan0/30%10.0$0.0084
task_08 recover_from_tool_error0/30%6.0$0.0038
task_09 know_when_to_stop0/30%6.0$0.0054

Total: 11/30 (36.7%) · $0.72 total · $0.0655/passing run


What it could do

[Observed — verification: pass_rate_by_task.sql, tool_calls_by_task.sql]

The two 100% tasks share a structure: single file, clear output target, passing tests as the unambiguous success signal. task_02 asked for three near-identical functions in src/metrics.py to be refactored into one parameterised function. task_05 asked for a constrained bug fix (maximum ten-line diff) applied to src/price.py.

Both passed. But look at the tool call counts. Average 15.3 per run on task_02. All three runs hit the turn limit of fifteen while still completing. Runs fdb015ac, f59eed8d, and 68e4140a all showed repeated fs_read on src/metrics.py; in one case the model re-read the same file five times consecutively across turns 11–15 before taking action. task_05 had similar redundancy: repeated git diff src/price.py calls across runs 1 and 2.

The model succeeded on both despite the redundant reads, not because of clean execution. When the task has a single verifiable stopping condition, Gemini 3.5 Flash can complete it. The pass rate just doesn’t tell you how inefficiently.


What does gave_up_mid_plan actually look like?

[Observed — verification: failure_mode_histogram.sql, cross_task_consistency evidence bundle]

17 of the 19 failures across this campaign were classified gave_up_mid_plan. Two were wrong_answer. Zero were infrastructure errors or tool call formatting failures.

The distinction matters more than the count. A wrong_answer failure means the model tried and was incorrect. gave_up_mid_plan means the model started (read files, issued tool calls, sometimes began partial plan execution) and then stopped before producing the required output. Not because it encountered an error it couldn’t parse. Not because it was confused about the interface. It simply abandoned.

task_07 is the clearest example. The task asks for four sequential file writes (steps/step1.txt through steps/step4.txt) with prescribed content. All three runs began: run 71d6ac24 shows fs_write to steps/step1.txt at turn 2. Then the model stopped. It wrote one step of a four-step plan and gave up.

This is the model Google is calling the “agentic flagship.” It wrote one file, had three more to write, and quit.

The cross-task spread is what makes this a model-level signal rather than a task-specific fluke. gave_up_mid_plan appeared across 7 distinct task IDs in this campaign, the broadest cross-task failure distribution in the current dataset (verified: cross_task_consistency evidence bundle). That is not a difficulty problem. It is a persistence problem.

For builders, gave_up_mid_plan is a harder failure to handle in production than wrong_answer. A wrong answer is detectable: you can run tests, compare outputs, check diffs. A partial execution that stops mid-plan leaves state in an unknown intermediate condition. You may not even know it failed until something downstream breaks.


Why did 57% of the budget go to a task that scored zero?

[Observed — verification: cost_breakdown.sql]

task_03 asked the model to investigate access.log and write finding.txt with a summary. All three runs failed. That’s expected given the campaign-level pattern. What isn’t expected is the cost.

task_03 consumed $0.41 of the $0.72 total campaign spend, 57% of total spend for 0 passes (verified: cost_breakdown.sql). Total input tokens across the three runs: 268,830. The next-highest task for input tokens was task_02 at 42,347. Task_03 consumed 6.35× the input tokens of task_02. Average per-run cost on task_03: $0.1365, versus the campaign average of $0.024.

The failure mode on task_03 follows the same gave_up_mid_plan pattern: the model reads access.log, makes exploratory shell calls, then abandons without writing finding.txt. But unlike other tasks where abandonment is cheap, task_03 triggered extensive re-reading of the log file before giving up. The model burned through the context budget on the same input data instead of synthesising findings. Then it stopped.

For most of the other failed tasks, the per-run cost was $0.004–0.014. task_03 ran up $0.1365 each time and produced nothing. If you are building a workflow that feeds Gemini 3.5 Flash a large log file and asks it to synthesise findings, budget for that.

[Speculation] Whether the log re-reading behaviour is consistent across document types (PDFs, large code files, other structured logs) is not established from this campaign. task_03 is the only task with a large input file in agentic-core-v1. The pattern may or may not generalise.


tool_call_redundancy: the dataset record

[Observed — tool_call_redundancy evidence bundle, 99 entries]

21 of 30 runs showed at least one repeated identical tool call: same tool_name, same tool_args, on consecutive turns. That is the highest redundancy rate in the dataset by run count.

This is distinct from the task_03 cost anomaly, though related. The redundancy was present in passing runs (task_02, task_05) and failing runs alike. It is not a symptom of struggling; it appears to be a default behaviour. The model reads a file, takes some action, then re-reads the same file before taking the next action, sometimes 3–5 times consecutively.

On simple tasks, this doesn’t prevent passing. On longer tasks, redundant reads eat into the turn budget. If a task has a 15-turn limit and the model spends 5 of those turns re-reading the same file, it may run out of turns before completing required actions. [Speculation] Whether this is a consistent contributor to gave_up_mid_plan failures is not established from this campaign alone; the harness classifies outcomes, not causes.


The prediction miss, scored per task

[Observed — prediction midpoint 26/30, actual 11/30; miss: −15 points from midpoint, −12 from range floor]

We got two of ten per-task predictions right. Here is the full scoring:

TaskPredictedActualResult
task_013/32/3Miss
task_023/33/3Hit
task_033/30/3Miss (−3)
task_042/31/3Miss
task_053/33/3Hit
task_062/31/3Miss
task_072/30/3Miss (−2)
task_083/30/3Miss (−3)
task_092/30/3Miss
task_103/31/3Miss

The largest misses (task_03, task_08, task_10) were all rated “High” confidence in the predictions. All three were anchored to Google’s marketing. task_08 (recover from tool error) predicted 3/3 because Google’s materials specifically describe “agentic robustness.” The model scored 0/3. Average tool calls per run: 6.0. Average latency: 6.77s. These are short runs. The model hit the injected error and abandoned, consistently, three times.

The updated position: marketing claims describing a model as “agentic” should be treated as near-zero evidence for agentic-core-v1 performance. “Parallel agentic execution” describes an architecture or deployment pattern, not turn-by-turn tool dispatch reliability. That is what the harness measures. They are different things.


Leaderboard position

[Observed — verification: cost_breakdown.sql, pass_rate_by_task.sql]

ModelScore$/passTier
Claude Sonnet 4.628/30$0.051API
GLM-4.728/30$0.004API
Ministral 3 8B28/30$0.001API
Mistral Large 3 675B27/30$0.002API
Claude Haiku 4.527/30$0.003API
DeepSeek V4-Flash25/30API
GLM-4.7 Flash25/30$0.001API
GPT-OSS 20B25/30API
Kimi K2.524/30$0.004API
Qwen3 32B23/30$0.001API
DeepSeek V3.219/30API
Nvidia Nemotron Super 3 120B12/30$0.002API
Gemini 3.5 Flash11/30$0.0655API
Nvidia Nemotron Nano 3 30B10/30$0.001API

One position above last. The score is bad. The cost story is worse.

At $0.0655/pass, Gemini 3.5 Flash is the most expensive model per passing task in the entire dataset, more expensive than Claude Sonnet 4.6 at $0.051/pass, despite Sonnet scoring 28/30 versus 11/30. “Flash” positioning implies budget economics. The actual pricing ($1.50/$9.00 per 1M input/output tokens) and actual task performance combine to make this the worst cost-per-result outcome in the dataset so far.

For comparison: Ministral 3 8B scored 28/30 at $0.001/pass. GLM-4.7 Flash scored 25/30 at $0.001/pass. Both are legitimately cheap, both score in the upper quarter of the leaderboard. Gemini 3.5 Flash is not in the same category.


What we don’t know

[Speculation]

The task_03 token burn is the open question here. The model appears to re-read large inputs rather than maintain a synthesised internal representation. Whether that behaviour is consistent across input types (large codebases, PDFs, structured data) is not established by this campaign. One task is one data point.

The gave_up_mid_plan pattern is well-attested at 17/30, but its mechanism isn’t. Does the model hit some internal confidence threshold and stop? Does it misread the task’s success condition? Does turn budget pressure trigger early exits? The harness classifies outcomes; it does not capture model-internal state. We can describe what happened. We cannot explain why.

A follow-up campaign with extended turn limits, or with a different document input type in task_03’s position, would add signal. For now: the failure mode is reliable, the cost risk on large inputs is real, and the prediction methodology needs updating.


For builders

[Observed — task_02, task_05 pass rates; task_07, task_08, task_09, task_10 failure classifications]

Gemini 3.5 Flash handles structured, single-file code tasks when the output contract is clear. Refactoring against passing tests: fine. Constrained targeted fixes: fine. Everything requiring persistence through ambiguity, multi-file coordination, error recovery, or synthesising a large input: not reliable.

The “agentic” positioning is a marketing category, not a harness-verified capability. Builders evaluating API providers for coding agents should not weight Google’s self-description without running the test. We ran it.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.