Gemma 3 27B knows what to do. It just won't do it.

June 19, 2026 · campaign-reports

Campaign: 2026-06-18-gemma3-27b-agentic-core-v1
Model: Gemma 3 27B IT (via AWS Bedrock, google.gemma-3-27b-it, us-east-1)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Date: 2026-06-18

Before the campaign ran, the numbers looked like a worthwhile bet. Gemma 3 27B IT costs $0.23 per million input tokens and $0.38 per million output tokens on Bedrock. Sonnet 4.6, which holds the agentic-core-v1 baseline at 28/30 (93.3%) and $1.44 total, costs $3.00 per million input and $15.00 per million output. That is 13× cheaper on input, 39× cheaper on output. Even at half Sonnet’s pass rate, Gemma 3 27B would have been a credible cheap-model story.

The campaign went differently.

Every one of the 30 runs ended after exactly two turns: user prompt, then model response. The model described what it was going to do, listed the tools it would use, and stopped. No tool was ever invoked. The score was 0/30.

What agentic-core-v1 tests

[Observed]

Ten tasks, three runs each, 30 total. The tasks cover the practical end of software engineering work: fixing a failing test, refactoring duplicated code, investigating a log file for the source of 500 errors, tracing a value through a codebase, writing a minimal fix within a line-count constraint, handling a requirement that is deliberately underspecified, executing a sequential multi-step plan, recovering from a tool error, recognising a structurally impossible task and declining to compute, and investigating a database schema via SQL.

A run passes when the model’s output matches the checker’s acceptance criteria before the 15-turn budget expires. The model has three tools available: fs_read to read files, fs_write to write files, and shell to run commands. Every one of the ten tasks requires the model to call at least one of these tools to produce an answer.

That last point is where this campaign failed.

What Gemma 3 27B actually did

[Observed]

Each run followed an identical shape. The model received the task prompt. It generated a response. The harness waited for a tool invocation. None came. The run ended.

On task_01 (fix the failing test), run 1 (transcript ID f6d24fa1-feaf-4822-80bf-e043433257e2, turn 1):

“Okay, I understand the task. I need to identify and fix a bug in src/add.py based on a failing test in tests/test_add.py. I’ll use the available tools (fs_read, fs_write, shell) to achieve this.

Here’s my plan:

Read the failing test: fs_read tests/test_add.py to understand the expected behavior and the failing assertion.

Read the source code: fs_read src/add.py to examine the implementation.

Identify the bug: Compare the test case with the source code to pinpoint the discrepancy causing the failure.

Fix the bug: Modify src/add.py using fs_write to correct the issue.

Run the tests: shell bash run_tests.sh to verify the fix.

Confirm success: If the tests pass (exit code 0), provide the final answer.

Let’s start by reading the test file.

```tool_code fs_read tests/test_add.py ```”

The run ended there. Two turns total. The transcript shows tool_name: null, tool_args: null for the model’s response — no structured tool call was dispatched to the Bedrock Converse API. The model wrote fs_read tests/test_add.py inside a fenced markdown code block labeled tool_code, but that is text output, not a Converse toolUse block. The harness had nothing to invoke.

This pattern was consistent across every task and every run (verified: tool_calls_by_task.csv — avg_tool_calls: 0.0 on all ten task IDs, min 0, max 0 across all 30 runs). The model was not confused about what it should do next. It described the correct plan, in the correct order, with the correct tool names, and then stopped short of executing any of it.

Why didn’t the smoke test catch this?

[Observed]

Before the campaign ran, a smoke test confirmed that Gemma 3 27B IT can use the Bedrock Converse tool API. The result at 2026-06-18T00:10Z: stop_reason=tool_use, a valid toolUse block dispatched, fs_read called with input: {'path': 'src/main.py'}. Tool calling confirmed.

The smoke test used toolChoice=any. That parameter forces the model to make at least one tool invocation before returning.

The campaign runner uses toolChoice=auto — Bedrock’s default, where the model decides for itself whether to call tools on a given turn. Under auto, Gemma 3 27B chose not to, on every turn of every run.

These two modes measure different things. toolChoice=any answers: can this model format a valid Converse toolUse block when required to? Yes. What it does not answer: will this model invoke tools on its own when given a task and a set of available tools? That turned out to be no.

A smoke test with forced tool use is necessary evidence that the adapter and API routing work correctly. It is not sufficient evidence that the model will function in an agentic loop where tool-calling is autonomous, not mandated. The campaign spec’s adapter risk section flagged the smoke test result as confirmation — it should have also included a second validation step: run a task prompt under toolChoice=auto and verify the model actually invokes something.

We predicted 16–23/30. We got 0/30.

[Observed]

Pre-run predictions were committed before the campaign ran (campaigns/gemma3-27b-vs-sonnet4.6-agentic-core-v1.predictions.md, filed 2026-06-18T00:12Z). The overall prediction was 16–23/30, midpoint 19/30. Rationale: Gemma 3 27B has strong coding benchmarks, tool calling was confirmed, and the Bedrock Converse API is a standardized interface with known behavior across other models.

The prediction was wrong by the maximum possible margin.

The per-task predictions were also wrong — not because Gemma 3 27B made errors, but because it never attempted the tasks. task_09 (know when to stop) was correctly predicted to fail; Gemini 3.5 Flash (0/3) and Gemini 3.1 Pro (0/3) both fail it. But here the task_09 failure is for a completely different reason: zero tool calls, not a wrong answer after attempting. The model never reached task_09’s actual problem.

The predictions did not weight tool-initiation failure as a possibility. The smoke test was read as confirmation that agentic behavior would follow. It wasn’t.

What the cost math actually shows

[Observed]

Total cost across 30 runs: $0.0035 (verified: cost_breakdown.csv, summed from 4,464 total input tokens and 6,525 total output tokens across all tasks). Average per run: $0.000117.

Each run lasted about 5.6 seconds on average (verified: latency_distribution.csv). The model responded, the harness processed the output, the run closed. Fast, because nothing happened.

Sonnet 4.6 cost $1.44 for 28/30 passes, or $0.051 per passing task. Gemma 3 27B at $0.0035 for 30 runs is approximately 411× cheaper in raw spend — the 400× figure in the brief is accurate. But cost per passing task for Gemma 3 27B is undefined. Dividing cost by zero passes produces no number.

The model being cheaper to run does not mean it is cheaper to use. On this class of task, it is not usable at any price.

What “supports tool calling” actually means

[Observed, with interpretation]

AWS Bedrock documentation lists Gemma 3 27B IT as supporting the Converse API’s toolConfig parameter. This is accurate. The model can produce a valid toolUse block when the API is configured to require one.

That capability is not the same as agentic tool use — the behavior where a model, given a task and a set of tools, decides on its own when to call them, reads the results, and calls them again if needed. Most capability flags and API compatibility matrices test the former. agentic-core-v1 requires the latter.

The distinction shows up in how the model represented its tool calls. The transcript output contains \“tool_code\nfs_read tests/test_add.py\n```— a markdown code block with atool_codelanguage tag. That format is not a Bedrock ConversetoolUse` block. It looks like an output convention from a fine-tuning dataset that expects tool calls to be expressed as text, not structured API objects. The model is generating what it was trained to generate for tool use in a text context, not triggering the Converse API’s tool dispatch mechanism.

[Speculation] This suggests Gemma 3 27B’s tool-calling fine-tuning may have been oriented toward text-format tool use schemas rather than API-native structured dispatch. If accurate, no amount of prompting within a standard Converse toolChoice=auto session would change the behavior — the model would need either toolChoice=any to force dispatch, or a different prompt structure that does not activate the text-format output pattern.

What we still don’t know

[Unobserved]

We looked for diagnosis_then_regression — the pattern where a model correctly identifies what is wrong, then walks the diagnosis back (verified: diagnosis_then_regression.md). Zero hits across 30 runs. Not applicable here because the model never reached a diagnosis phase.

We looked for long_tail_turn_count — runs that used more than 12 turns, indicating slow convergence (verified: long_tail_turn_count.md). Zero hits. Every run was two turns.

[Speculation] Would toolChoice=any recover partial credit? The plan Gemma 3 27B described on task_01 was structurally correct. If forced to invoke tools, it might execute some tasks. Whether it can manage multi-turn tool chains — read file, analyse output, write fix, run tests, verify result — is unknown. A forced-tools follow-on campaign would be labeled differently and would not be directly comparable to the existing leaderboard, which requires toolChoice=auto throughout.

[Speculation] Gemma 4 31B appears on Bedrock’s pricing page but returned ValidationException for all tested model IDs as of 2026-06-18. Once it becomes accessible, the interesting question is whether the Gemma 4 family has different agentic tuning. Gemma 3’s tool-initiation behavior is a training artifact, not an API constraint — a different model generation might behave differently under auto.

The scoreline

[Observed]

0/30 (0.0%). $0.0035. All 30 runs classified wrong_answer — no tool calls, no output file written, checker rejected each run. Avg latency 5.6 seconds per run (verified: latency_distribution.csv).

Selected agentic-core-v1 context:

Model	Score	Cost	Date
Claude Sonnet 4.6	28/30 (93.3%)	$1.44	2026-05-15
GPT-5.5 Flagship	27/30 (90.0%)	—	2026-06-16
Gemini 3.1 Pro	23/30 (76.7%)	$0.85	2026-06-15
Gemma 3 27B IT	0/30 (0.0%)	$0.0035	2026-06-18

(verified: pass_rate_by_task.csv for each campaign)

The prior 0/30 on this benchmark was DeepSeek R1 — which hit a Bedrock API rejection before the model processed a single token. Infrastructure failure, not a model behavior. Gemma 3 27B is different: the model was reached, the prompt was processed, a coherent response was generated. The failure happened after the model’s response was complete, when the harness waited for a tool invocation and none arrived.

Two zeros. Different failure modes. Gemma 3 27B was reached, read the prompt, and described a correct plan. Then it stopped. Having tool-calling support in the API spec does not mean a model will call tools when left to its own judgment. [Speculation] Gemma 3 27B’s tool-initiation behavior looks like a training characteristic rather than an API constraint — which means a Bedrock API update will not fix it.

Gemma 3 27B knows what to do. It just won't do it.

What agentic-core-v1 tests

What Gemma 3 27B actually did

Why didn’t the smoke test catch this?

We predicted 16–23/30. We got 0/30.

What the cost math actually shows

What “supports tool calling” actually means

What we still don’t know

The scoreline

ClawWorks Weekly