Gemma 4 31B on agentic-core-v1: 76.7% at zero token cost

May 15, 2026 · campaign-reports

Campaign: 2026-05-09-gemma4-31b-agentic-core-v1
Model: Gemma 4 31B IT (local deployment via LocalAdapter)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-05-15

The motivation for this campaign was a specific question: how much agentic capability does a locally-deployed 31B model carry without any API cost?

Every prior campaign in this pipeline ran against a cloud-hosted model — Bedrock for Claude and Llama, DeepSeek’s direct API for V4-Flash. This one ran on-premises through the LocalAdapter. The cost line in the output: $0.00. Not rounding. Not cached. The inference ran on local hardware, so there were no token charges.

The score that came back was 23/30 (76.7%). Gemma 4 31B passed 7 of 10 task types cleanly, stumbled on two, and hit the same wall as every model before it on task_09.

What agentic-core-v1 tests

[Observed]

The suite runs 10 tasks, 3 times each. Each task has a deterministic checker — either the output matches the acceptance criteria or it does not. The 15-turn budget is fixed. Failure modes are classified at run time: wrong_answer when the checker rejects the output, gave_up_mid_plan when the turn budget runs out without a committed answer, and infrastructure_error when the harness itself fails to complete the run.

The tasks cover: fixing a failing test, refactoring duplicated code, investigating a log file, tracing through a codebase, writing a minimal fix under a line constraint, handling an ambiguous requirement with two-artifact output, multi-step sequential planning, recovering from a tool error involving multibyte character byte-counting, knowing when to stop on an underspecified problem, and SQL investigation using native tool calls.

task_09 (know when to stop) has been the hardest task across every campaign. It involves computing a 10-day moving average on a CSV with three rows. The input is structurally underspecified — no min_periods policy is stated. Passing requires recognising the insufficiency and producing a correct or explicitly hedged output before the turn budget expires. Every model in the pipeline has found this task difficult.

What Gemma 4 31B did

[Observed]

23 of 30 runs passed. 7 of 10 task types were 3/3 clean. Failures came from three tasks.

Task	Result	Avg tool calls	Avg latency
task_01 fix failing test	3/3	4.0	20.25s
task_02 refactor duplicated code	1/3	4.0	122.66s
task_03 investigate log	1/3	1.3	10.08s
task_04 trace through codebase	3/3	6.0	103.48s
task_05 minimal fix	3/3	5.3	42.83s
task_06 handle ambiguous requirement	3/3	6.0	39.19s
task_07 multi-step plan	3/3	4.0	14.73s
task_08 recover from tool error	3/3	2.0	33.91s
task_09 know when to stop	0/3	6.0	119.69s
task_10 SQL investigation	3/3	3.0	40.18s

(verified: verification/pass_rate_by_task.csv, verification/tool_calls_by_task.csv, verification/latency_distribution.csv)

Across all 7 failures, the breakdown by mode was: 3 × gave_up_mid_plan, 2 × wrong_answer, 2 × infrastructure_error (verified: verification/failure_mode_histogram.csv).

The three failure points

[Observed]

task_02 (refactor duplicated code) — 1/3: The task asks the model to identify duplicated logic across two files and consolidate it. Gemma 4 31B passed one of three runs and failed two. Latency was 122.66s average, the slowest task in the campaign. A model can spend two minutes on a refactoring task and still produce output the checker rejects. Both failures were classified as wrong_answer: the checker rejected the output. The model produced a result but it did not meet the acceptance criteria.

task_03 (investigate log) — 1/3: The log investigation task involves reading a 500-line log file and producing a diagnosis. Gemma 4 31B averaged 1.3 tool calls per run — far lower than any other task. Two of three runs failed with infrastructure_error (HTTP 400 Bad Request). Both infrastructure errors in this campaign came from task_03. Low tool call count combined with an infrastructure error suggests the run failed before the model could complete a second read — the harness received a 400 before the investigation concluded.

In the DeepSeek-V4-Flash campaign, task_03 was 3/3 and the most expensive task at $0.021 total, driven by the large log file input. Gemma 4 31B’s low tool call count and infrastructure errors on the same task points to a different failure mode entirely.

task_09 (know when to stop) — 0/3: Gemma 4 31B went 0/3 on task_09, consistent with Llama 3.3 70B (also 0/3). DeepSeek-V4-Flash went 1/3 in non-thinking mode; Claude Sonnet 4.6 went 2/3 on run 1 and 0/3 on runs 2–3. task_09 averaged 6.0 tool calls per run with zero variance. Turn count was 13 in every run — 87% of the 15-turn budget. All three runs classified as gave_up_mid_plan: the model ran nearly to the limit and still did not commit a final answer.

What the evidence detectors found

[Unobserved]

Three pattern detectors ran against all 30 transcripts.

tool_call_redundancy: 0/30 runs. No consecutive identical tool calls detected. This null result holds for every campaign in the pipeline so far — Gemma 4 31B did not regress on this pattern relative to cloud-hosted models. The absence of redundancy in the failure runs (task_02, task_03) means the failures were not caused by a looping read pattern.

diagnosis_then_regression: 0/30 runs. No cases of a stated diagnosis being reversed in a later turn.

long_tail_turn_count: 0/30 runs. No run used more than 13 of the 15-turn budget — just below the long-tail threshold. task_09 averaged 6.0 tool calls per run with zero variance, but turn count tells a different story: the model hit 13 of the 15-turn budget in every run (87%). The gave_up_mid_plan classification reflects budget near-exhaustion, not early stopping — the model ran to 87% of its limit and still did not commit a final answer.

Infrastructure errors and local inference

[Observed]

Two of the seven failures were classified as infrastructure_error. Cloud-hosted campaigns (Claude, DeepSeek) produced zero infrastructure errors. Llama 3.3 70B’s run 1 and run 2 had infrastructure errors caused by an adapter bug — that was fixed before run 3. For Gemma 4 31B, the LocalAdapter was in place and the run completed (phase_transition to done at 2026-05-15T22:51Z after starting at 22:23Z). The infrastructure errors here are distinct from adapter-level failures.

Both infrastructure errors came from task_03 (HTTP 400 Bad Request). Local inference introduces resource constraints that API calls do not — GPU memory pressure, generation timeouts, thermal management — and the log investigation task drives higher I/O demand than most others. The 400 responses indicate the harness or the local adapter hit a request limit during the file-read phase.

28-minute wall-clock for 30 runs at local inference speed is consistent with the latency distribution — some tasks exceeded 2 minutes per run (task_02 at 143s max, task_04 at 125s max).

The result in context

[Observed]

Model	Score	Cost	Params
Claude Sonnet 4.6	28/30 (93.3%)	$1.44	—
DeepSeek-V4-Flash	28/30 (93.3%)	$0.04	13B activated
Gemma 4 31B	23/30 (76.7%)	$0.00	31B
Llama 3.3 70B (run 3)	20/30 (66.7%)	$0.09	70B

(verified: verification/pass_rate_by_task.csv, verification/cost_breakdown.csv)

Gemma 4 31B outscored Llama 3.3 70B by 3 points using 39B fewer parameters, at zero token cost. Architecture and post-training matter as much as raw parameter count, so the comparison does not tell you everything. But a local 31B model beating a 70B cloud-hosted model on a standardised agentic benchmark is a concrete result.

The gap to the top tier (DeepSeek-V4-Flash and Claude Sonnet 4.6 at 93.3%) is 16.6 percentage points. That is not a rounding error. The local model carries meaningful agentic capability; it is not at frontier level on this suite.

What we don’t know yet

[Speculation]

No pre-run predictions were filed for this campaign. The predictions framework (per SPEC §13.5) requires a predictions.md committed before the campaign runs; that file was not present for this campaign. There is no formal scoring of what we expected versus what happened.

The open questions are structural:

task_03 low tool call count: Why did Gemma 4 31B use only 1–2 tool calls on the log investigation task when the same task drives high tool usage and high cost in cloud models? Whether this reflects a different file-reading strategy or an early failure before a second read is attempted is not visible from the tool call histogram alone.
task_09 fixed-turn pattern: The 6.0 tool call average with zero variance across 3 runs suggests the model follows a fixed path on this task, using 13 of 15 turns in every run. Whether this is instruction-following rigidity or a specific characteristic of how Gemma 4 31B handles underspecified inputs is an open question. Running task_09 with varied min_periods policies specified would test whether the model’s failure is about structural ambiguity or something else.
Local hardware sensitivity: This campaign ran on a single hardware configuration. Whether the infrastructure errors, high latency on task_02, and the 0/3 on task_09 are stable across different local hardware (different GPU memory, different inference backends) is unknown. Cloud-hosted models are less sensitive to hardware variance.

The local deployment case

[Observed]

Gemma 4 31B ran 30 agentic tasks, passed 23, and cost nothing in API fees. It outscored a larger cloud-hosted model on the same suite. The 76.7% result means the model handles 7 of the 10 agentic task types reliably: test fixing, codebase tracing, minimal constrained edits, ambiguous-requirement handling with multi-artifact output, multi-step planning, tool-error recovery, and SQL investigation.

The tasks it missed — code refactoring under quality constraints, log investigation, and knowing when to stop — are the same categories that push every model in the pipeline. task_09 is 0/3 for both local models tested so far (Gemma 4 31B and Llama 3.3 70B). code refactoring (task_02) has been partial for Llama 3.3 70B as well.

If data sovereignty, API cost, or network isolation are constraints you are actually working around, a local 31B model at 76.7% on a standardised agentic benchmark is something to work with. It is not the top tier. The gap to DeepSeek-V4-Flash and Claude is 16.6 points and that is real. But it is a documented, reproducible result at zero token cost.

Gemma 4 31B on agentic-core-v1: 76.7% at zero token cost

What agentic-core-v1 tests

What Gemma 4 31B did

The three failure points

What the evidence detectors found

Infrastructure errors and local inference

The result in context

What we don’t know yet

The local deployment case

ClawWorks Weekly