Three runs to a number: Llama 3.3 70B reaches 20/30 after two infrastructure detours
Campaigns: 2026-05-15-llama3.3-70b-agentic-core-v1 (runs 1–2) · 2026-05-08-llama3.3-70b-agentic-core-v1-run3 (run 3)
Model: llama3.3-70b (Meta, via AWS Bedrock Converse API — us.meta.llama3-3-70b-instruct-v1:0)
Harness: openclaw@2026.4.22
Runs: 90 total (30 per campaign × 3 campaigns)
Final result: 20/30 (66.7%)
It took three runs to get a number. The first two weren’t capability measurements — they were infrastructure bugs that happened to reveal themselves one at a time.
Run 1 (2026-05-06): 0/30. The Bedrock adapter couldn’t parse Meta-format tool calls emitted as plain text. Every task ended at turn 1. The model correctly identified the right starting file on every task; the harness never executed the call.
Run 2 (2026-05-07): 14/30. PR #14 fixed the text-block parsing. Tools started executing. But on turn 3+, when the model switched from text-format to native toolUse blocks, it stopped including response text. Bedrock rejected the empty ContentBlock. Thirteen of sixteen failures were this class.
Run 3 (2026-05-08): 20/30. Commit c73ce7f added a synthesis guard: when a turn produces a native toolUse block with no text, the adapter now synthesises a Meta-format JSON string into response_text, keeping conversation history non-empty. Infrastructure errors: zero.
The 20/30 figure is the actual model result. The first two runs were plumbing.
The three runs in detail
[Observed]
Run 1: zero tool calls, $0.011
Every run terminated at turn 1. The model responded with a JSON function call as a text string — {"type": "function", "name": "fs_read", "parameters": {...}} — and the harness, seeing no toolUse block, treated that text as a final answer and closed the run. The task checker received raw JSON and scored it wrong_answer.
Average latency: 0.50 seconds. Average cost: $0.00038/run. Total: $0.0113 across 30 runs (verified: verification/cost_breakdown.csv, campaign 2026-05-06-llama3.3-70b-agentic-core-v1-rerun). Those aren’t efficiency numbers — they reflect runs that never started.
Run 2: tools execute, new error class surfaces
PR #14 (256e179, merged 2026-05-07T08:18Z) fixed the turn-1 parsing by preserving Meta-format JSON in response_text. Tools dispatched across all 10 tasks. Four task types reached 100% pass rate: task_01, task_03, task_05, task_07.
The new problem: after at least one tool executed and a result came back, the model switched to native toolUse blocks on its continuation turn. Those blocks had no accompanying text. The adapter passed an empty ContentBlock to Bedrock, which returned a ValidationException. Thirteen of the 16 failures in run 2 were this class (verified: failure_mode_histogram.csv, campaign 2026-05-15-llama3.3-70b-agentic-core-v1).
Total cost: $0.07. The failing tasks averaged $0.0006–$0.0009/run because they terminated at turn 2. Passing tasks averaged ~$0.003/run.
Run 3: zero infrastructure errors
The synthesis guard (commit c73ce7f) closed the gap. When a turn produces a native toolUse block with no text, the adapter synthesises a Meta-format JSON representation of the tool call into response_text. The conversation history stays non-empty. Bedrock stopped rejecting turns.
Total cost: $0.09 across 30 runs. Average: $0.003/run. Latency: 1.8–4.6 seconds per run across tasks. Zero infrastructure_error classifications (verified: failure_mode_histogram.csv, campaign 2026-05-08-llama3.3-70b-agentic-core-v1-run3).
What 20/30 actually looks like
[Observed]
Six of ten task types passed at 3/3.
| Task | Result | Avg tool calls | Notes |
|---|---|---|---|
| task_02_refactor_duplicated_code | 3/3 | 5.0 | 3.5–5.3s avg |
| task_03_investigate_log | 3/3 | 2.0 | 2.2–2.5s — large input context |
| task_04_trace_through_codebase | 3/3 | 5.0 | 3.7–4.4s |
| task_05_minimal_fix | 3/3 | 4.3 | 3.3–4.3s |
| task_06_handle_ambiguous_requirement | 3/3 | 6.7 | 3.9–4.6s — highest tool usage in campaign |
| task_07_multi_step_plan | 3/3 | 4.3 | 2.7–3.0s |
| task_01_fix_failing_test | 2/3 | — | Run 1 wrong answer |
| task_08_recover_from_tool_error | 0/3 | 2.0 | Wrong byte count: 29 written, actual 35 |
| task_09_know_when_to_stop | 0/3 | — | 1 malformed final turn; 2 wrong numeric answers |
| task_10_sql_investigation | 0/3 | — | Correct diagnosis, never written — malformed final turn |
Verified: verification/pass_rate_by_task.csv, campaign 2026-05-08-llama3.3-70b-agentic-core-v1-run3.
18 of 18 on those six task types. The failures aren’t spread across the board — they cluster on three specific things.
Why three task types still fail
[Observed]
task_08: byte count, consistently wrong
The task: read data.txt, write its byte length to length.txt. The file is 35 bytes. All three runs wrote 29. The model made the tool calls — 2.0 on average — read the file, and wrote a number. That number was wrong, and it was wrong the same way each time.
A 6-byte discrepancy on a file with multi-byte characters is consistent with a character-count vs byte-count mismatch. If data.txt contains multi-byte UTF-8 sequences, counting characters gives a smaller number than counting bytes. The determinism of the result — 29 across all three runs — suggests this is the same reasoning path each time, not noise.
[Speculation] Whether the model is measuring characters, or measuring bytes but miscounting a newline boundary, isn’t determinable without inspecting the exact file encoding. The 35-vs-29 gap of 6 bytes could indicate two or three multi-byte characters counted as single bytes.
task_09: wrong number and one incomplete turn
The task: compute the 10-day moving average of the revenue column in data.csv. The data has 3 rows, which makes the window specification formally undefined.
Run 1 ended without writing answer.txt — the model’s final output was an awk shell command in JSON form as a text block, never executed. The conversation ended with the model about to write but the last turn was a tool-call text block, not an action that completed. Runs 2 and 3 reported 1000.0 and wrote it. The checker rejected that value.
[Speculation] The 1000.0 value may be a mean of the three revenue values — or it may be a mean calculation on a dataset where the actual values sum to 3000. Neither is verifiable without checking data.csv directly. The Claude baseline also struggled here, though differently: Claude looped until turn limit. Llama finished and wrote the wrong number on two out of three runs.
task_10: correct answer, never written
All three runs produced the same final output:
{"type": "function", "name": "fs_write", "parameters": {"path": "finding.txt", "content": "Query 4: phone column doesn't exist"}}
The diagnosis is right — if a query references a column that doesn’t exist in the schema, that is the failure. But finding.txt was never written because the model emitted this as a text block rather than executing the write.
[Observed] The synthesis guard that fixed run 3’s infrastructure failures didn’t catch this case. The guard triggers when a turn produces a native toolUse block with no text. On task_10, the model is doing the reverse: outputting the call as plain text instead of a native toolUse block. Those are different conditions. The guard handles one; the other still produces a non-executed final turn.
This is the same failure mode as run 1’s underlying cause, but scoped to the final write on task_10 specifically. The model reads the inputs correctly, forms the right answer, then reverts to text-format tool calls on the last step. Whether this is specific to task_10’s multi-step chain pattern, or a more general tendency to regress on final writes after a certain turn depth, isn’t clear from three runs.
What we didn’t see
[Unobserved — all four pattern detectors]
- Diagnosis-then-regression: 0 of 30 runs. No cases where the model stated a correct diagnosis and then walked it back.
- Tool call redundancy: 0 of 30 runs. No repeated identical reads. The Claude Sonnet 4.6 baseline showed 7/30 runs with redundant reads (23.3%). Llama didn’t re-read files it had already read.
- Long-tail turn count: 0 of 30 runs. No run exceeded 12 turns. The model finished or failed early — no evidence of runs that dragged before giving up.
- Cross-task consistency: 1 entry (verified:
evidence/cross_task_consistency.md).
The zero redundancy result is worth noting. Whether that’s confidence — the model is sure enough of its read that it doesn’t re-verify — or brevity — it simply doesn’t consider re-reading — isn’t answerable from the transcript shapes alone.
Against the Claude baseline
[Observed]
| Metric | Claude Sonnet 4.6 | Llama 3.3 70B |
|---|---|---|
| Pass rate | 28/30 (93.3%) | 20/30 (66.7%) |
| Total cost | $1.44 | $0.09 |
| Cost per run | $0.048 | $0.003 |
| Infrastructure errors (clean run) | 0 | 0 |
| Tool redundancy | 7/30 (23.3%) | 0/30 |
| Long-tail turns | 0/30 | 0/30 |
The 26.6 percentage-point gap is real. On the six task types Llama passes at 3/3, the overlap with Claude is nearly complete — both models handle refactoring, log investigation, code tracing, minimal fixes, ambiguous specs, and multi-step plans cleanly. The differentiation is in the three task types Llama doesn’t reach: measurement precision (task_08), underspecified computation (task_09), and the final-write format regression on task_10.
The 16× cost difference — $0.003/run vs $0.048/run — is real too. At that ratio, deploying Llama 3.3 70B on a pipeline covering those six task types and routing the others to Claude makes sense in principle. Whether the 3 failing task types actually appear in a given pipeline is a product decision, not a benchmark one.
What we still don’t know
-
task_08 encoding: The exact byte representation of
data.txtwould confirm whether29reflects a character-count convention or a different miscounting. Not checked. -
task_10 regression scope: Is the final-write text-format reversion specific to task_10, or does it appear on other tasks with long multi-step chains? Three runs on one task can’t answer that.
-
task_09 dataset: What are the actual revenue values in
data.csv?1000.0might be a mean of three values summing to 3000, or it might reflect something else. The dataset wasn’t inspected directly. -
Zero redundancy as signal: Llama re-read zero files. Claude re-read on 23.3% of runs. Is Llama’s brevity a sign of higher confidence, or does it miss cases where re-reading would have caught a mistake? The passing tasks don’t distinguish these.
The infra debugging record
The reason it took three runs to get here matters separately from the 20/30 figure itself.
Run 1 produced a clean infrastructure finding: Bedrock + Llama + the then-current adapter speak different dialects of the function-calling protocol. That took a few hours to diagnose.
Run 2 produced a second infrastructure finding: a first fix isn’t always a complete fix. Solving the turn-1 parsing revealed a turn-3 failure that only appears once the model actually runs for multiple turns. You can’t see the second bug until the first one is gone.
Run 3 is what “clean” looks like on this stack. Both adapter layers patched, zero infrastructure errors, 30 runs with real tool execution. The 20/30 number comes from a model evaluation, not an adapter evaluation.
The three-run path from 0/30 to 20/30 is a reasonable record of what it takes to get a non-OpenAI model running correctly in a multi-turn agentic harness. It’s not unusual — but it is worth documenting, because the 20/30 number is only interpretable if the 0/30 and 14/30 context is visible alongside it.
Evidence pack: verification/ directory, campaign 2026-05-08-llama3.3-70b-agentic-core-v1-run3. Prior run data: data/campaigns/2026-05-15-llama3.3-70b-agentic-core-v1/.