Scout knew the fix. It described the fix. It never wrote the fix.
Campaign: 2026-06-14-llama4-scout-agentic-core-v1
Model: Meta Llama 4 Scout (17B active / 109B total MoE, Q4_K_M, llama.cpp local)
Harness: agentic-core-v1 (openclaw@2026.4.22)
Runs: 30 (10 tasks × 3 runs each)
Date: 2026-06-14T05:56–06:01Z
When Meta shipped Llama 4 Scout, the story was the context window. 10 million tokens. The MoE architecture that keeps inference costs down (17B active params out of 109B total) was the other part of the pitch. The implied claim was that Scout is capable at the price point of something much smaller.
We ran Scout on agentic-core-v1 to check whether “capable” extended to software engineering tasks. It scored 10/30.
Scout cleared every task that fit a clean pattern: read one thing, write one thing, in sequence, no ambiguity. On anything harder — tasks that required reasoning about intermediate state, or mixing exploration with file modification — it stopped short, described what it would do, and called that done.
One of our pre-run predictions said task_09 (know_when_to_stop) would score at least 1/3. It scored 0/3, and not in the way we expected. We were wrong about how it would fail.
What the harness tests
agentic-core-v1 runs a model through 10 software engineering tasks, each repeated 3 times independently. The tasks include: fixing a failing test, refactoring duplicated code, investigating a log file, tracing through a codebase, handling an ambiguous requirement, multi-step file creation, recovering from a tool error, and knowing when to stop. A run passes when the model writes the correct answer to an output file before hitting the turn limit.
The harness measures agentic execution, not comprehension. A model can understand every task perfectly and still fail every run if it never calls the tools.
The score
[Observed]
10 of 30 runs passed. Pass rate: 33.3% (verified: pass_rate_by_task.csv). Campaign compute cost: approximately $0.32 EC2 (g6.12xlarge, 5-minute runtime). Cost per passing task: approximately $0.032 (verified: cost_breakdown.csv).
| Task | Passes | Avg tool calls | Avg latency |
|---|---|---|---|
| task_07 (multi_step_plan) | 3/3 | 4.0 | 7.0s |
| task_08 (recover_from_tool_error) | 3/3 | 2.0 | 7.0s |
| task_10 (sql_investigation) | 3/3 | 3.0 | 4.0s |
| task_02 (refactor_duplicated_code) | 1/3 | 3.3 | 11.7s |
| task_01 (fix_failing_test) | 0/3 | 2.0 | 13.7s |
| task_03 (investigate_log) | 0/3 | 1.0 | 8.1s |
| task_04 (trace_through_codebase) | 0/3 | 0.7 | 14.0s |
| task_05 (minimal_fix) | 0/3 | 0.0 | 23.3s |
| task_06 (handle_ambiguous_req) | 0/3 | 2.0 | 19.4s |
| task_09 (know_when_to_stop) | 0/3 | 0.0 | 7.5s |
Comparison against the same harness, same 30-run structure:
| Model | Score | Pass rate | Approx $/pass | Tier | Source |
|---|---|---|---|---|---|
| Claude Sonnet 4.6 | 28/30 | 93.3% | $0.051 | API | (source: 2026-05-15-claude-sonnet-4.6-agentic-core-v1.md) |
| Llama 3.3 70B | ~20/30 | 60–73% | ~$0.80 | Local (multi-campaign avg) | (source: 2026-05-07-llama3.3-70b-agentic-core-v1-rerun.md, 2026-05-08-llama3.3-70b-agentic-core-v1-run3.md, 2026-05-15-llama3.3-70b-agentic-core-v1.md) |
| Llama 4 Scout 17B-active MoE | 10/30 | 33.3% | ~$0.032 | Local (llama.cpp) | (verified: pass_rate_by_task.csv, cost_breakdown.csv) |
The 60-point gap between Scout and Sonnet 4.6 is not rounding error. Scout also sits below Llama 3.3 70B’s multi-campaign average, despite being newer hardware-optimised architecture. On per-pass compute cost, Scout wins. That comparison only holds for the tasks where it actually passes.
The two ways it failed
[Observed]
Two distinct patterns account for the 20 failing runs. They are different problems with different root causes, and it matters which is which.
Failure mode 1: correct diagnosis, execution never fires
On task_05 (minimal_fix), Scout reads the scaffold files, produces a fully correct implementation in a code block in its text response, and stops. fs_write is never called. The solution exists in the model’s output as a suggestion, not an action.
From task_05, run 1 (verified: transcript at data/transcripts/, turn 2, role=assistant):
“To implement the count_users(path) function… [reads file, produces correct solution in markdown code block] …I will assume I need to make the following change: [code block]”
No tool_use event follows. The task fails because the file was never written.
task_01 (fix_failing_test) runs the same way. Scout reads the broken test, reads the implementation file, correctly identifies that return a - b is the bug, produces the fix, and then describes writing the file rather than doing it. Average tool calls: 2.0. Both were reads. Zero writes (verified: tool_calls_by_task.csv).
This pattern is consistent across all three runs of both tasks. It is not a fluke.
Failure mode 2: inline bracket notation instead of structured tool calls
On tasks that require investigation mixed with code modification, Scout reverts to embedding tool calls as inline text in the assistant message. [fs_read(path="src/price.py")] appears in the text content block, not as a structured tool_use event. The harness never executes them. The rest of the turn is Scout reasoning about code it has not seen.
From task_06, run 1 (verified: transcript at data/transcripts/, turn 1, role=assistant):
“To solve this task, I will first investigate… [fs_read(path=“tests/test_price.py”)] Next, I’ll read the price calculator code: [fs_read(path=“src/price.py”)] After reviewing the code, I notice…”
The reads never happened. What follows is Scout guessing at code structure and producing incorrect analysis.
Rigg identified this as partial-jinja degradation. The --jinja flag was required to enable structured tool calls at all on Scout — without it, the model ignored the tool schema entirely. With it, structured calls work on straightforward tasks. When cognitive load increases (mixed reasoning, file I/O, code modification in the same turn), Scout falls back to the inline prose format that appears in its pretraining data more frequently than structured JSON tool calls.
[Speculation] This suggests the instruction-following fine-tuning for tool use is shallower than it appears from the clean tasks. The behavior on simple tasks looks reliable; the behavior on complex tasks suggests the model has not internalized structured tool dispatch as a consistent mode, but learned it as a surface-level pattern that breaks under load.
What worked
[Observed]
The three passing task types share a structural property: unambiguous, sequential, minimal-reasoning operations.
task_07 (multi_step_plan): write 4 files in a specified order, each with fixed content. Scout formats structured tool_use events correctly, executes all four writes cleanly, 3/3 (verified: pass_rate_by_task.csv, tool_calls_by_task.csv).
task_08 (recover_from_tool_error): read a file, hit a deliberate error, retry, write a result. Error recovery triggered and resolved. 3/3, 2.0 avg tool calls.
task_10 was the fastest in the campaign: 4.0s average latency, 3/3. Read a schema file and a log, identify the inconsistency, write a finding. Three operations, unambiguous order.
The pattern: when the decision about which tool to call next is obvious from the task state, Scout calls it. When the model has to weigh intermediate results against a goal before deciding on the next action, the failure modes appear.
We were wrong about task_09
[Unobserved]
Pre-run prediction P2 said task_09 would score at least 1/3 (verified: predictions/llama4-scout-agentic-core-v1-2026-06-08.md). It scored 0/3.
task_09 (know_when_to_stop) asks the model to compute a 10-day moving average on a CSV with 3 rows of data. The task is impossible. 3 data points cannot produce a 10-day window. The correct response is to recognize the constraint and write a clean “cannot compute” finding.
We predicted Scout would catch this at least once, because smaller models in the dataset have generally handled the refusal pattern correctly. We were wrong about the failure mode. Scout did not get confused and attempt a bad calculation. It tried to execute the computation via inline shell syntax and failed silently. The model did not refuse. It hallucinated a solution path, produced inline [shell(command="...")] notation that the harness never ran, and recorded 0 tool calls (verified: tool_calls_by_task.csv).
This is a different failure than what task_09 tests for. The task asks whether the model knows when to stop. Scout did not stop — it attempted something that looked like forward progress. That behavior is arguably harder to detect in production than a clean refusal, because the model’s output reads as if it tried.
The campaign itself needed a fix first
This campaign required a --jinja flag fix before any structured tool calls were possible. Before that fix, Scout’s llama-server was ignoring the Jinja chat template, and the model was running with Content-only format in /props. All tool calls appeared as inline prose.
30 runs completed before that was caught and corrected. Rigg reran the full campaign post-fix (05:56–06:01Z, 2026-06-14). The scores in this article are from the corrected run.
Anyone running Llama 4 Scout locally: check /props on your llama-server. If chat_format shows Content-only, the model’s own template is being bypassed. The --jinja flag is required to inject the correct template.
[Unobserved] We did not collect comparative data from the pre-fix runs in a way that would let us quantify how much of the inline notation failure mode was harness setup versus model behavior. The post-fix run showed structured tool calls on simple tasks and inline notation on complex tasks — so the regression is real, not entirely an artifact of the misconfiguration.
On the MoE question
[Observed, with speculation noted]
The original research question was whether Scout’s MoE architecture translates to adequate agentic execution at local inference cost. The answer from this campaign: adequate on narrow tasks, inadequate on anything requiring sustained tool-loop reasoning.
task_04 (trace_through_codebase) averaged 0.7 tool calls (verified: tool_calls_by_task.csv). The task requires reading 5–8 files in logical sequence and synthesizing a trace. Scout read one file and stopped. For comparison, Devstral 2 averaged approximately 6–7 tool calls on the same task with a 100% pass rate (verified: 2026-05-17-devstral-2-agentic-core-v1.md).
[Speculation] The 17B active-param constraint is not what caused the tool-call formatting instability — that is a training issue, not an architecture limit. What the active-param count may affect is multi-hop reasoning depth: the capacity to hold intermediate results, plan the next file read, and commit a tool call based on that plan. The task_04 data is consistent with a model that runs out of context-tracking capacity on complex traversal tasks, though we did not instrument for that specifically.
The cost curve is real. At approximately $0.032/pass on local EC2 compute, Scout undercuts Llama 3.3 70B’s ~$0.80/pass by a significant margin. If a builder’s task mix is 70% simple sequential operations (the kind that look like task_07, task_08, task_10), Scout’s economics are defensible. If the task mix includes open-ended code investigation or modification, the math inverts. Failed runs that require rework or human intervention cost more than the per-pass savings.
Predictions scoring
| Prediction | Result | Evidence |
|---|---|---|
| P1: task_04 fails ≤1/3, avg tool_calls ≤3.0 | CORRECT | task_04: 0/3, avg 0.7 tool calls (verified: pass_rate_by_task.csv, tool_calls_by_task.csv) |
| P2: task_09 scores ≥1/3 | WRONG | task_09: 0/3 — inline shell syntax rather than refusal (verified: pass_rate_by_task.csv) |
| P3: Campaign compute <$25, cost/pass beats Llama 3.3 70B | CORRECT | ~$0.32 EC2 total; ~$0.032/pass vs ~$0.80/pass for Llama 3.3 70B (verified: cost_breakdown.csv) |
Two of three correct. The miss on P2 matters: we predicted task_09 as the most likely passing candidate from the failed tasks because the refusal pattern is simpler than execution. Scout showed a third behavior, neither pass nor clean refusal, and it was the harder one to catch.
What we don’t know yet
The two failure modes are distinguishable in transcript data, but we have not yet tested whether fine-tuning on structured tool calls would close the gap on mode 2 (inline notation) without fixing mode 1 (execution commitment). They look like separate problems. Mode 1 is a failure to commit to action after correct reasoning. Mode 2 is a formatting regression under cognitive load. It is plausible they have different training interventions.
We also do not have data on Scout with larger context utilization. The 10M token window is central to the product pitch, but none of agentic-core-v1’s tasks approach that scale. Whether the MoE routing stability degrades further at large context is unobserved from this campaign.
A follow-up campaign using a fine-tuned Scout variant, or Scout with explicit chain-of-thought prompting for tool dispatch decisions, would let us distinguish architecture limits from training gaps.