Gemma 4 31B on agentic-core-v1: 76.7% at zero token cost

Campaign: 2026-05-09-gemma4-31b-agentic-core-v1
Model: Gemma 4 31B IT (local deployment via LocalAdapter)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-05-15


The motivation for this campaign was a specific question: how much agentic capability does a locally-deployed 31B model carry without any API cost?

Every prior campaign in this pipeline ran against a cloud-hosted model — Bedrock for Claude and Llama, DeepSeek’s direct API for V4-Flash. This one ran on-premises through the LocalAdapter. The cost line in the output: $0.00. Not rounding. Not cached. The inference ran on local hardware, so there were no token charges.

The score that came back was 23/30 (76.7%). Gemma 4 31B passed 7 of 10 task types cleanly, stumbled on two, and hit the same wall as every model before it on task_09.


What agentic-core-v1 tests

[Observed]

The suite runs 10 tasks, 3 times each. Each task has a deterministic checker — either the output matches the acceptance criteria or it does not. The 15-turn budget is fixed. Failure modes are classified at run time: wrong_answer when the checker rejects the output, gave_up_mid_plan when the turn budget runs out without a committed answer, and infrastructure_error when the harness itself fails to complete the run.

The tasks cover: fixing a failing test, refactoring duplicated code, investigating a log file, tracing through a codebase, writing a minimal fix under a line constraint, handling an ambiguous requirement with two-artifact output, multi-step sequential planning, recovering from a tool error involving multibyte character byte-counting, knowing when to stop on an underspecified problem, and SQL investigation using native tool calls.

task_09 (know when to stop) has been the hardest task across every campaign. It involves computing a 10-day moving average on a CSV with three rows. The input is structurally underspecified — no min_periods policy is stated. Passing requires recognising the insufficiency and producing a correct or explicitly hedged output before the turn budget expires. Every model in the pipeline has found this task difficult.


What Gemma 4 31B did

[Observed]

23 of 30 runs passed. 7 of 10 task types were 3/3 clean. Failures came from three tasks.

TaskResultAvg tool callsAvg latency
task_01 fix failing test3/34.020.25s
task_02 refactor duplicated code1/34.0122.66s
task_03 investigate log1/31.310.08s
task_04 trace through codebase3/36.0103.48s
task_05 minimal fix3/35.342.83s
task_06 handle ambiguous requirement3/36.039.19s
task_07 multi-step plan3/34.014.73s
task_08 recover from tool error3/32.033.91s
task_09 know when to stop0/36.0119.69s
task_10 SQL investigation3/33.040.18s

(verified: verification/pass_rate_by_task.csv, verification/tool_calls_by_task.csv, verification/latency_distribution.csv)

Across all 7 failures, the breakdown by mode was: 3 × gave_up_mid_plan, 2 × wrong_answer, 2 × infrastructure_error (verified: verification/failure_mode_histogram.csv).


The three failure points

[Observed]

task_02 (refactor duplicated code) — 1/3: The task asks the model to identify duplicated logic across two files and consolidate it. Gemma 4 31B passed one of three runs and failed two. Latency was 122.66s average, the slowest task in the campaign. A model can spend two minutes on a refactoring task and still produce output the checker rejects. Both failures were classified as wrong_answer: the checker rejected the output. The model produced a result but it did not meet the acceptance criteria.

task_03 (investigate log) — 1/3: The log investigation task involves reading a 500-line log file and producing a diagnosis. Gemma 4 31B averaged 1.3 tool calls per run — far lower than any other task. Two of three runs failed with infrastructure_error (HTTP 400 Bad Request). Both infrastructure errors in this campaign came from task_03. Low tool call count combined with an infrastructure error suggests the run failed before the model could complete a second read — the harness received a 400 before the investigation concluded.

In the DeepSeek-V4-Flash campaign, task_03 was 3/3 and the most expensive task at $0.021 total, driven by the large log file input. Gemma 4 31B’s low tool call count and infrastructure errors on the same task points to a different failure mode entirely.

task_09 (know when to stop) — 0/3: Gemma 4 31B went 0/3 on task_09, consistent with Llama 3.3 70B (also 0/3). DeepSeek-V4-Flash went 1/3 in non-thinking mode; Claude Sonnet 4.6 went 2/3 on run 1 and 0/3 on runs 2–3. task_09 averaged 6.0 tool calls per run with zero variance. Turn count was 13 in every run — 87% of the 15-turn budget. All three runs classified as gave_up_mid_plan: the model ran nearly to the limit and still did not commit a final answer.


What the evidence detectors found

[Unobserved]

Three pattern detectors ran against all 30 transcripts.

tool_call_redundancy: 0/30 runs. No consecutive identical tool calls detected. This null result holds for every campaign in the pipeline so far — Gemma 4 31B did not regress on this pattern relative to cloud-hosted models. The absence of redundancy in the failure runs (task_02, task_03) means the failures were not caused by a looping read pattern.

diagnosis_then_regression: 0/30 runs. No cases of a stated diagnosis being reversed in a later turn.

long_tail_turn_count: 0/30 runs. No run used more than 13 of the 15-turn budget — just below the long-tail threshold. task_09 averaged 6.0 tool calls per run with zero variance, but turn count tells a different story: the model hit 13 of the 15-turn budget in every run (87%). The gave_up_mid_plan classification reflects budget near-exhaustion, not early stopping — the model ran to 87% of its limit and still did not commit a final answer.


Infrastructure errors and local inference

[Observed]

Two of the seven failures were classified as infrastructure_error. Cloud-hosted campaigns (Claude, DeepSeek) produced zero infrastructure errors. Llama 3.3 70B’s run 1 and run 2 had infrastructure errors caused by an adapter bug — that was fixed before run 3. For Gemma 4 31B, the LocalAdapter was in place and the run completed (phase_transition to done at 2026-05-15T22:51Z after starting at 22:23Z). The infrastructure errors here are distinct from adapter-level failures.

Both infrastructure errors came from task_03 (HTTP 400 Bad Request). Local inference introduces resource constraints that API calls do not — GPU memory pressure, generation timeouts, thermal management — and the log investigation task drives higher I/O demand than most others. The 400 responses indicate the harness or the local adapter hit a request limit during the file-read phase.

28-minute wall-clock for 30 runs at local inference speed is consistent with the latency distribution — some tasks exceeded 2 minutes per run (task_02 at 143s max, task_04 at 125s max).


The result in context

[Observed]

ModelScoreCostParams
Claude Sonnet 4.628/30 (93.3%)$1.44
DeepSeek-V4-Flash28/30 (93.3%)$0.0413B activated
Gemma 4 31B23/30 (76.7%)$0.0031B
Llama 3.3 70B (run 3)20/30 (66.7%)$0.0970B

(verified: verification/pass_rate_by_task.csv, verification/cost_breakdown.csv)

Gemma 4 31B outscored Llama 3.3 70B by 3 points using 39B fewer parameters, at zero token cost. Architecture and post-training matter as much as raw parameter count, so the comparison does not tell you everything. But a local 31B model beating a 70B cloud-hosted model on a standardised agentic benchmark is a concrete result.

The gap to the top tier (DeepSeek-V4-Flash and Claude Sonnet 4.6 at 93.3%) is 16.6 percentage points. That is not a rounding error. The local model carries meaningful agentic capability; it is not at frontier level on this suite.


What we don’t know yet

[Speculation]

No pre-run predictions were filed for this campaign. The predictions framework (per SPEC §13.5) requires a predictions.md committed before the campaign runs; that file was not present for this campaign. There is no formal scoring of what we expected versus what happened.

The open questions are structural:

  1. task_03 low tool call count: Why did Gemma 4 31B use only 1–2 tool calls on the log investigation task when the same task drives high tool usage and high cost in cloud models? Whether this reflects a different file-reading strategy or an early failure before a second read is attempted is not visible from the tool call histogram alone.

  2. task_09 fixed-turn pattern: The 6.0 tool call average with zero variance across 3 runs suggests the model follows a fixed path on this task, using 13 of 15 turns in every run. Whether this is instruction-following rigidity or a specific characteristic of how Gemma 4 31B handles underspecified inputs is an open question. Running task_09 with varied min_periods policies specified would test whether the model’s failure is about structural ambiguity or something else.

  3. Local hardware sensitivity: This campaign ran on a single hardware configuration. Whether the infrastructure errors, high latency on task_02, and the 0/3 on task_09 are stable across different local hardware (different GPU memory, different inference backends) is unknown. Cloud-hosted models are less sensitive to hardware variance.


The local deployment case

[Observed]

Gemma 4 31B ran 30 agentic tasks, passed 23, and cost nothing in API fees. It outscored a larger cloud-hosted model on the same suite. The 76.7% result means the model handles 7 of the 10 agentic task types reliably: test fixing, codebase tracing, minimal constrained edits, ambiguous-requirement handling with multi-artifact output, multi-step planning, tool-error recovery, and SQL investigation.

The tasks it missed — code refactoring under quality constraints, log investigation, and knowing when to stop — are the same categories that push every model in the pipeline. task_09 is 0/3 for both local models tested so far (Gemma 4 31B and Llama 3.3 70B). code refactoring (task_02) has been partial for Llama 3.3 70B as well.

If data sovereignty, API cost, or network isolation are constraints you are actually working around, a local 31B model at 76.7% on a standardised agentic benchmark is something to work with. It is not the top tier. The gap to DeepSeek-V4-Flash and Claude is 16.6 points and that is real. But it is a documented, reproducible result at zero token cost.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.