0/30 and $0.011: when the adapter speaks a dialect the harness doesn't understand

May 15, 2026 · campaign-reports

Campaign: 2026-05-15-llama3.3-70b-agentic-core-v1
Model: llama3.3-70b (Meta, via AWS Bedrock Converse API)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Date: 2026-05-15

Meta’s Llama 3.3 70B runs on AWS Bedrock alongside the proprietary models we test here. We put it through agentic-core-v1 — the same benchmark we ran Claude Sonnet 4.6 through — to get a direct comparison. What we got instead was a clean infrastructure finding.

Every single run failed. Not because the model got things wrong. The model correctly identified the right starting file for every task, which is the first move any agentic coding run needs to make. The problem was that when it said “read this file,” it said so in a format the harness doesn’t recognise. Zero tools were ever dispatched. Zero files were ever read.

This campaign produced no capability data on Llama 3.3 70B. What it produced is a clean infrastructure finding: the Bedrock adapter and the model are speaking different dialects of the same language, and the mismatch needs fixing before any meaningful benchmark can happen. This is the documented record.

The benchmark [Context]

agentic-core-v1 is a 10-task suite covering the range of agentic coding work: fixing a failing test, refactoring duplicated code, investigating a log, tracing a bug through a codebase, implementing a minimal fix, handling an ambiguous requirement, planning multi-step changes, recovering from a tool error, knowing when to stop on an underspecified task, and investigating a SQL schema issue.

Each task runs 3 times — 30 runs total. A pass means the task completed correctly: the right file written, the right content, the task checker returns pass. A fail is anything else: the task checker rejects the output (labeled wrong_answer), partially completes (partial_complete), hits an infrastructure error (infrastructure_error), or leaves the task scaffold untouched.

The model has access to three tools: fs_read (read a file), fs_write (write a file), and bash (run a shell command). A real agentic coding run looks like: read a failing test, read the source it tests, understand the mismatch, write a fix. Multiple exchanges, multiple tool calls, the checker runs at the end.

What actually happened on every run?

[Observed]

Every run was over in two exchanges. The task prompt landed. The model responded with a function call encoded as plain text. The harness treated that text as the model’s final answer and closed the run.

Here is a representative example from task_01_fix_failing_test, run 1 (transcript UUID: 07b08240-110b-4a1e-bb31-aee58ce21e8b):

{"type": "function", "name": "fs_read", "parameters": {"path": "tests/test_add.py"}}

This is a tool call. The model is asking to read a file. But it expressed that request as a block of plain text, not as the structured toolUse content block the AWS Bedrock Converse API uses to signal a tool invocation. The harness only dispatches tools when it sees a toolUse block in the response. Seeing plain text with no toolUse entry, it treated this as the model’s final answer, passed the raw JSON string to the task checker, and scored the result: the task checker rejected the output (wrong_answer).

The harness records tool_name: null and tool_args: null for every turn in this campaign, confirming that no tool was ever dispatched across all 30 runs.

The roughly 30-token average output and 0.50-second average latency are both fingerprints of the same failure: one short response, then done. Nothing was read. Nothing was written. No task scaffold file was ever touched.

This two-exchange pattern appears in all 30 transcripts, all 10 tasks, without exception (verified: campaign_event_log.csv).

Why did tools never fire?

[Observed — adapter behaviour]

The AWS Bedrock Converse API carries tool calls as toolUse content blocks: structured objects containing the tool name, inputs, and a use ID. The harness adapter reads these toolUse blocks and dispatches the corresponding tool. Responses that contain only text blocks are treated as final answers.

Llama 3.3 70B, when given a prompt that includes tool definitions, responds by emitting function-call JSON as a text string. The format it uses ({"type": "function", "name": ..., "parameters": ...}) matches an older function-calling convention. The Bedrock layer is not translating this text-format call into a toolUse block. So the harness receives a text response, sees no tool invocation signal, and closes the run.

The 0.50-second average latency per run — consistent with a single short text generation and zero tool round-trips — confirms no tool activity occurred at all.

The model knew where to start

[Observed — from text content only]

Even though tools were never executed, the text responses show consistent, correct intent. In every run, the model’s single response was a JSON object encoding an fs_read call on a file from the task scaffold — the right starting move for each task. A few examples:

task_01_fix_failing_test runs 1, 2, 3: requests to read tests/test_add.py, src/add.py, and tests/test_add.py, all reasonable starting reads for a failing-test fix
task_10_sql_investigation: request to read schema.sql
task_09_know_when_to_stop: request to read data.csv

The model identified the correct entry point for every task. It didn’t guess randomly. It diagnosed where to start, expressed what it understood to be a tool call, and stopped. From its perspective, it had made a move and was waiting for a response that never came.

The intent was correct. The wire format was wrong.

Pattern detectors: explicit null results

[Unobserved — all four pattern detectors]

The harness runs four behavioural pattern detectors on every campaign. All four returned zero results:

Cross-task consistency (does the model show similar behaviour across different tasks?): 0 of 30. Not meaningful here; the failure was uniform and came from a single root cause, not task-specific reasoning.
Diagnosis-then-regression (does the model state a correct diagnosis and then walk it back?): 0 of 30. No multi-turn reasoning occurred; there was nothing to regress from.
Tool call redundancy (does the model make the same tool call twice in a row?): 0 of 30. No tool calls were made at all.
Long-tail turn count (did any run reach an unusually high number of exchanges before giving up?): 0 of 30. Every run terminated after a single model response.

These are explicit null results, not missing data. Every detector ran against all 30 transcripts and found nothing, because there was nothing to find. The campaign produced a single failure mode, repeated 30 times.

Predictions: filed after the data

[Observed — procedural failure]

The predictions file was filed after the campaign ran, which violates SPEC §13.5. A prediction filed after the data is available isn’t a prediction; it’s a post-hoc guess. All three are scored UNTESTABLE as pre-run commitments.

The predictions, for the record:

Llama 3.3 70B passes 18–24/30 (60–80%). WRONG (actual: 0/30, though the root cause was adapter incompatibility, not model failure; the prediction assumed tool dispatch would work).
task_09 would show the highest failure rate. UNTESTABLE (uniform failure across all tasks from the same root cause; task-level comparison is meaningless).
Most common failure mode: wrong_answer or partial_complete. WRONG IN MECHANISM: all 30 were technically labeled wrong_answer, but not from model reasoning failures — from tool dispatch never happening.

Prediction 3 got the label right and the cause wrong. That distinction matters for what the next campaign should test.

The procedural failure (filing predictions after data was available) is noted and must not recur.

The verdict

[Observed]

0 of 30 runs passed. Pass rate: 0%. All 30 failures carry the same label: the task checker rejected the output (wrong_answer) — in this case, a raw JSON string instead of a completed file. Total cost: $0.0113 across 30 runs ($0.00038 per run average, verified: cost_breakdown.csv). That is 127× cheaper than the Claude Sonnet 4.6 baseline of $1.44 total.

Total input tokens across all 30 runs: approximately 14,808. Total output tokens: approximately 894. Average output per run: roughly 30 tokens. Average latency: 0.50 seconds per run (verified: latency_distribution.csv).

These numbers are not evidence that Llama 3.3 70B cannot do agentic coding tasks. Every run terminated after a single model response, not because the model reasoned incorrectly, but because it expressed its intent in a format the harness doesn’t dispatch.

The 127× cost difference is not evidence of efficiency; it measures how early the runs terminated. Llama generated one short response and stopped. Claude worked through multiple tool call cycles. If Llama were executing tools, its cost per completed run would almost certainly be higher than $0.00038.

This result should be read as: the test was not administered. Llama 3.3 70B has not been benchmarked on agentic-core-v1 yet.

What we still don’t know

[Speculation] Whether Llama 3.3 70B actually supports Bedrock Converse native toolUse blocks at all. AWS documentation states Converse API tool use is supported for this model family. The responses in this campaign produced only text blocks. Whether this is a model-level behaviour, a Bedrock configuration issue, or a gap in how the adapter formats toolConfig is unresolved.

[Speculation] Whether a toolConfig change would fix it. The current adapter sends toolConfig: {tools: [{toolSpec: {...}}]}. Some Bedrock models require a different toolChoice or autoToolChoice setting. This is untested.

[Speculation] What Llama 3.3 70B would actually score if tool dispatch worked. The model selected the correct starting file for every task, which is a signal of intent, not a guarantee of completion. A working tool-dispatch run is required before any capability conclusion is possible. This campaign produced zero capability signal, only infrastructure signal.

[Speculation] Whether this affects other Meta models on Bedrock. Llama 3.1 70B and 3.1 8B use the same cross-region inference profile pattern. If the root cause is model-level text-format function calling, the same problem would appear for those models under the current adapter.

What comes next

The article worth writing about Llama 3.3 70B is the one after the adapter is fixed. This campaign is the documented baseline: a clean record that the first attempt produced zero tool executions and zero capability data, with a verified root cause.

If the fix is a toolConfig change, the next campaign will either confirm it (runs start executing tools, scores land somewhere measurable) or falsify it (runs still produce text-format calls, root cause was model-level). Either result is informative.

Evidence pack: verification/ directory. Full transcripts in data/transcripts/2026-05-15-llama3.3-70b-agentic-core-v1/.