Three runs to a number: Llama 3.3 70B reaches 20/30 after two infrastructure detours

May 8, 2026 · campaign-reports

Campaigns: 2026-05-15-llama3.3-70b-agentic-core-v1 (runs 1–2) · 2026-05-08-llama3.3-70b-agentic-core-v1-run3 (run 3)
Model: llama3.3-70b (Meta, via AWS Bedrock Converse API — us.meta.llama3-3-70b-instruct-v1:0)
Harness: openclaw@2026.4.22
Runs: 90 total (30 per campaign × 3 campaigns)
Final result: 20/30 (66.7%)

It took three runs to get a number. The first two weren’t capability measurements — they were infrastructure bugs that happened to reveal themselves one at a time.

Run 1 (2026-05-06): 0/30. The Bedrock adapter couldn’t parse Meta-format tool calls emitted as plain text. Every task ended at turn 1. The model correctly identified the right starting file on every task; the harness never executed the call.

Run 2 (2026-05-07): 14/30. PR #14 fixed the text-block parsing. Tools started executing. But on turn 3+, when the model switched from text-format to native toolUse blocks, it stopped including response text. Bedrock rejected the empty ContentBlock. Thirteen of sixteen failures were this class.

Run 3 (2026-05-08): 20/30. Commit c73ce7f added a synthesis guard: when a turn produces a native toolUse block with no text, the adapter now synthesises a Meta-format JSON string into response_text, keeping conversation history non-empty. Infrastructure errors: zero.

The 20/30 figure is the actual model result. The first two runs were plumbing.

The three runs in detail

[Observed]

Run 1: zero tool calls, $0.011

Every run terminated at turn 1. The model responded with a JSON function call as a text string — {"type": "function", "name": "fs_read", "parameters": {...}} — and the harness, seeing no toolUse block, treated that text as a final answer and closed the run. The task checker received raw JSON and scored it wrong_answer.

Average latency: 0.50 seconds. Average cost: $0.00038/run. Total: $0.0113 across 30 runs (verified: verification/cost_breakdown.csv, campaign 2026-05-06-llama3.3-70b-agentic-core-v1-rerun). Those aren’t efficiency numbers — they reflect runs that never started.

Run 2: tools execute, new error class surfaces

PR #14 (256e179, merged 2026-05-07T08:18Z) fixed the turn-1 parsing by preserving Meta-format JSON in response_text. Tools dispatched across all 10 tasks. Four task types reached 100% pass rate: task_01, task_03, task_05, task_07.

The new problem: after at least one tool executed and a result came back, the model switched to native toolUse blocks on its continuation turn. Those blocks had no accompanying text. The adapter passed an empty ContentBlock to Bedrock, which returned a ValidationException. Thirteen of the 16 failures in run 2 were this class (verified: failure_mode_histogram.csv, campaign 2026-05-15-llama3.3-70b-agentic-core-v1).

Total cost: $0.07. The failing tasks averaged $0.0006–$0.0009/run because they terminated at turn 2. Passing tasks averaged ~$0.003/run.

Run 3: zero infrastructure errors

The synthesis guard (commit c73ce7f) closed the gap. When a turn produces a native toolUse block with no text, the adapter synthesises a Meta-format JSON representation of the tool call into response_text. The conversation history stays non-empty. Bedrock stopped rejecting turns.

Total cost: $0.09 across 30 runs. Average: $0.003/run. Latency: 1.8–4.6 seconds per run across tasks. Zero infrastructure_error classifications (verified: failure_mode_histogram.csv, campaign 2026-05-08-llama3.3-70b-agentic-core-v1-run3).

What 20/30 actually looks like

[Observed]

Six of ten task types passed at 3/3.

Task	Result	Avg tool calls	Notes
task_02_refactor_duplicated_code	3/3	5.0	3.5–5.3s avg
task_03_investigate_log	3/3	2.0	2.2–2.5s — large input context
task_04_trace_through_codebase	3/3	5.0	3.7–4.4s
task_05_minimal_fix	3/3	4.3	3.3–4.3s
task_06_handle_ambiguous_requirement	3/3	6.7	3.9–4.6s — highest tool usage in campaign
task_07_multi_step_plan	3/3	4.3	2.7–3.0s
task_01_fix_failing_test	2/3	—	Run 1 wrong answer
task_08_recover_from_tool_error	0/3	2.0	Wrong byte count: 29 written, actual 35
task_09_know_when_to_stop	0/3	—	1 malformed final turn; 2 wrong numeric answers
task_10_sql_investigation	0/3	—	Correct diagnosis, never written — malformed final turn

Verified: verification/pass_rate_by_task.csv, campaign 2026-05-08-llama3.3-70b-agentic-core-v1-run3.

18 of 18 on those six task types. The failures aren’t spread across the board — they cluster on three specific things.

Why three task types still fail

[Observed]

task_08: byte count, consistently wrong

The task: read data.txt, write its byte length to length.txt. The file is 35 bytes. All three runs wrote 29. The model made the tool calls — 2.0 on average — read the file, and wrote a number. That number was wrong, and it was wrong the same way each time.

A 6-byte discrepancy on a file with multi-byte characters is consistent with a character-count vs byte-count mismatch. If data.txt contains multi-byte UTF-8 sequences, counting characters gives a smaller number than counting bytes. The determinism of the result — 29 across all three runs — suggests this is the same reasoning path each time, not noise.

[Speculation] Whether the model is measuring characters, or measuring bytes but miscounting a newline boundary, isn’t determinable without inspecting the exact file encoding. The 35-vs-29 gap of 6 bytes could indicate two or three multi-byte characters counted as single bytes.

task_09: wrong number and one incomplete turn

The task: compute the 10-day moving average of the revenue column in data.csv. The data has 3 rows, which makes the window specification formally undefined.

Run 1 ended without writing answer.txt — the model’s final output was an awk shell command in JSON form as a text block, never executed. The conversation ended with the model about to write but the last turn was a tool-call text block, not an action that completed. Runs 2 and 3 reported 1000.0 and wrote it. The checker rejected that value.

[Speculation] The 1000.0 value may be a mean of the three revenue values — or it may be a mean calculation on a dataset where the actual values sum to 3000. Neither is verifiable without checking data.csv directly. The Claude baseline also struggled here, though differently: Claude looped until turn limit. Llama finished and wrote the wrong number on two out of three runs.

task_10: correct answer, never written

All three runs produced the same final output:

{"type": "function", "name": "fs_write", "parameters": {"path": "finding.txt", "content": "Query 4: phone column doesn't exist"}}

The diagnosis is right — if a query references a column that doesn’t exist in the schema, that is the failure. But finding.txt was never written because the model emitted this as a text block rather than executing the write.

[Observed] The synthesis guard that fixed run 3’s infrastructure failures didn’t catch this case. The guard triggers when a turn produces a native toolUse block with no text. On task_10, the model is doing the reverse: outputting the call as plain text instead of a native toolUse block. Those are different conditions. The guard handles one; the other still produces a non-executed final turn.

This is the same failure mode as run 1’s underlying cause, but scoped to the final write on task_10 specifically. The model reads the inputs correctly, forms the right answer, then reverts to text-format tool calls on the last step. Whether this is specific to task_10’s multi-step chain pattern, or a more general tendency to regress on final writes after a certain turn depth, isn’t clear from three runs.

What we didn’t see

[Unobserved — all four pattern detectors]

Diagnosis-then-regression: 0 of 30 runs. No cases where the model stated a correct diagnosis and then walked it back.
Tool call redundancy: 0 of 30 runs. No repeated identical reads. The Claude Sonnet 4.6 baseline showed 7/30 runs with redundant reads (23.3%). Llama didn’t re-read files it had already read.
Long-tail turn count: 0 of 30 runs. No run exceeded 12 turns. The model finished or failed early — no evidence of runs that dragged before giving up.
Cross-task consistency: 1 entry (verified: evidence/cross_task_consistency.md).

The zero redundancy result is worth noting. Whether that’s confidence — the model is sure enough of its read that it doesn’t re-verify — or brevity — it simply doesn’t consider re-reading — isn’t answerable from the transcript shapes alone.

Against the Claude baseline

[Observed]

Metric	Claude Sonnet 4.6	Llama 3.3 70B
Pass rate	28/30 (93.3%)	20/30 (66.7%)
Total cost	$1.44	$0.09
Cost per run	$0.048	$0.003
Infrastructure errors (clean run)	0	0
Tool redundancy	7/30 (23.3%)	0/30
Long-tail turns	0/30	0/30

The 26.6 percentage-point gap is real. On the six task types Llama passes at 3/3, the overlap with Claude is nearly complete — both models handle refactoring, log investigation, code tracing, minimal fixes, ambiguous specs, and multi-step plans cleanly. The differentiation is in the three task types Llama doesn’t reach: measurement precision (task_08), underspecified computation (task_09), and the final-write format regression on task_10.

The 16× cost difference — $0.003/run vs $0.048/run — is real too. At that ratio, deploying Llama 3.3 70B on a pipeline covering those six task types and routing the others to Claude makes sense in principle. Whether the 3 failing task types actually appear in a given pipeline is a product decision, not a benchmark one.

What we still don’t know

task_08 encoding: The exact byte representation of data.txt would confirm whether 29 reflects a character-count convention or a different miscounting. Not checked.
task_10 regression scope: Is the final-write text-format reversion specific to task_10, or does it appear on other tasks with long multi-step chains? Three runs on one task can’t answer that.
task_09 dataset: What are the actual revenue values in data.csv? 1000.0 might be a mean of three values summing to 3000, or it might reflect something else. The dataset wasn’t inspected directly.
Zero redundancy as signal: Llama re-read zero files. Claude re-read on 23.3% of runs. Is Llama’s brevity a sign of higher confidence, or does it miss cases where re-reading would have caught a mistake? The passing tasks don’t distinguish these.

The infra debugging record

The reason it took three runs to get here matters separately from the 20/30 figure itself.

Run 1 produced a clean infrastructure finding: Bedrock + Llama + the then-current adapter speak different dialects of the function-calling protocol. That took a few hours to diagnose.

Run 2 produced a second infrastructure finding: a first fix isn’t always a complete fix. Solving the turn-1 parsing revealed a turn-3 failure that only appears once the model actually runs for multiple turns. You can’t see the second bug until the first one is gone.

Run 3 is what “clean” looks like on this stack. Both adapter layers patched, zero infrastructure errors, 30 runs with real tool execution. The 20/30 number comes from a model evaluation, not an adapter evaluation.

The three-run path from 0/30 to 20/30 is a reasonable record of what it takes to get a non-OpenAI model running correctly in a multi-turn agentic harness. It’s not unusual — but it is worth documenting, because the 20/30 number is only interpretable if the 0/30 and 14/30 context is visible alongside it.

Evidence pack: verification/ directory, campaign 2026-05-08-llama3.3-70b-agentic-core-v1-run3. Prior run data: data/campaigns/2026-05-15-llama3.3-70b-agentic-core-v1/.

Three runs to a number: Llama 3.3 70B reaches 20/30 after two infrastructure detours

The three runs in detail

Run 1: zero tool calls, $0.011

Run 2: tools execute, new error class surfaces

Run 3: zero infrastructure errors

What 20/30 actually looks like

Why three task types still fail

task_08: byte count, consistently wrong

task_09: wrong number and one incomplete turn

task_10: correct answer, never written

What we didn’t see

Against the Claude baseline

What we still don’t know

The infra debugging record

ClawWorks Weekly