36x cheaper. Same score.
Campaign: 2026-05-09-deepseek-v4-flash-agentic-core-v1
Model: DeepSeek-V4-Flash (284B total / 13B activated MoE, DeepSeek direct API, non-thinking mode)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-05-13
The usual question before a campaign is: will this model clear the bar? DeepSeek-V4-Flash did more than clear it. It tied the top score. And it did that at $0.04.
That number is not a typo. Four cents. Total. Across all 30 runs.
Claude Sonnet 4.6 ran the same suite last month and scored the same 28/30 at $1.44. The math comes out to $0.051 per passing task for Sonnet; $0.0014 for V4-Flash. That is not a marginal cost difference. That is a different tier of economics, and it lands on the same scoreline.
This was also our first campaign against a non-Bedrock provider. V4-Flash ran via DeepSeek’s direct API through a new adapter. No 429s, no infrastructure failures, all 90 calls completed cleanly. The rate-limit risk that was listed as an open question going in did not materialise.
What agentic-core-v1 tests
[Observed]
agentic-core-v1 has 10 tasks, each run 3 times, for 30 total runs. The tasks are structured, scoped, and have deterministic checkers. They cover: fixing a failing test, refactoring duplicated code, investigating a log file, tracing through a codebase, writing a minimal fix under a line constraint, handling an ambiguous requirement (two-artifact output), multi-step sequential planning, recovering from a tool error with multibyte character byte-counting, knowing when to stop on an underspecified problem, and SQL investigation using native tool calls.
A run passes when the model’s output matches the checker’s acceptance criteria before the 15-turn budget runs out. A run fails as wrong_answer when the checker rejects the output, or as gave_up_mid_plan when the model hits the turn limit without committing a final answer.
The suite is calibrated to what frontier models can do in 2026. It is not trivial — task_09 has been the hardest task for every model we have run so far — but it is not adversarial. The environments are clean, the tools are reliable, and the acceptance criteria are explicit.
What V4-Flash did
[Observed]
28 of 30 runs passed. Pass rate: 93.33% (verified: verification/pass_rate_by_task.csv). Nine of ten task types were 3/3 clean. One task, task_09, went 1/3. All failures came from task_09.
| Task | Result | Notes |
|---|---|---|
| task_01 fix failing test | 3/3 | Avg 6.3 tool calls |
| task_02 refactor duplicated code | 3/3 | Avg 5.0 tool calls |
| task_03 investigate log | 3/3 | Most expensive: $0.021 total (500-line log input) |
| task_04 trace through codebase | 3/3 | Avg 6.7 tool calls |
| task_05 minimal fix | 3/3 | 8.3 avg tool calls; line constraint honoured throughout |
| task_06 handle ambiguous requirement | 3/3 | Two-artifact output correct in all runs |
| task_07 multi-step plan | 3/3 | Clean sequential execution, avg 4.0 tool calls |
| task_08 recover from tool error | 3/3 | Prediction reversal — see below |
| task_09 know when to stop | 1/3 | Only failure point |
| task_10 SQL investigation | 3/3 | Native tools confirmed, no adapter issues |
(verified: verification/pass_rate_by_task.csv)
Three pattern detectors ran against all 30 transcripts. All three returned zero hits.
tool_call_redundancy: 0/30 runs (verified: evidence bundle). No back-to-back identical tool calls in the entire campaign. In the Sonnet 4.6 campaign, 7 of 30 runs triggered this pattern — including both task_09 failures, which showed the model re-reading a CSV it already had loaded. V4-Flash had no such loops. It read what it needed once and moved on.
diagnosis_then_regression: 0/30 runs (verified: evidence bundle). No cases of a stated diagnosis being walked back. Same null result as Sonnet 4.6.
long_tail_turn_count: 0/30 runs (verified: evidence bundle). No run used more than 12 of the 15-turn budget. Even the two task_09 failures did not show long-tail turn usage. Average wall-clock time per run: 7.57 seconds. Fastest task: task_08 at 4.2 seconds average. Slowest: task_03 at 12.2 seconds (large log file input).
The one failure: task_09 again
[Observed]
task_09 asks the model to compute a 10-day moving average of a revenue column in a CSV with exactly 3 rows. A 10-day window on 3 data points is structurally underspecified — the task does not state a min_periods policy.
V4-Flash went 1/3 in non-thinking mode. Two failure modes: 1 run produced a wrong_answer, 1 run produced gave_up_mid_plan.
The wrong_answer failure: the model computed and committed an output that the checker rejected. Based on the failure pattern from comparable runs (Sonnet 4.6 ran 1/3 on this same task), this is likely a strict-NaN output path — the model produced NaN values for the full-window rows and wrote them without a compensating note. The checker requires either a valid numeric output with an explicit window-shortage note, or a statement that the computation is undefined under the specified window.
The gave_up_mid_plan failure: the model hit the turn budget without committing a final answer.txt. There was no detected redundancy (tool_call_redundancy returned zero for this run), which means the model was not looping on the same tool call. It may have been iterating with different arguments before running out of turns. The mechanism here is distinct from the Sonnet 4.6 task_09 failures, where redundant re-reads were the tell.
The 1/3 pass: the model correctly recognised the data insufficiency and produced output that met the checker criteria. This is the same best-case outcome that Sonnet 4.6 achieved — but Sonnet’s pass in run 1 was caused by a missing pandas library forcing a fallback path, not deliberate reasoning. Whether V4-Flash’s single pass was deliberate or also a fallback artifact is not determined from the evidence bundle alone.
The headline result is consistent with every model we have run: task_09 is hard. Claude Sonnet 4.6 went 1/3. Llama 3.3 70B went 0/3. V4-Flash went 1/3. The result is not model-specific — it appears to be a property of this underspecified ambiguity class in non-thinking mode.
The task_08 reversal
[Observed]
This is the most consequential number in the dataset.
Llama 3.3 70B failed task_08 across all 3 runs. The failure traced to incorrect multibyte character byte-length counting — a specific skill the 70B model got wrong every time.
Prediction P2 for this campaign assumed the failure would replicate: “byte-count failure replicated from Llama 3.3 70B.” That prediction was wrong (see predictions section below). V4-Flash got task_08 3/3, with correct byte counting in all three runs (verified: verification/pass_rate_by_task.csv, avg 2.3 tool calls per run).
Llama 3.3 70B has 70B activated parameters. V4-Flash activates 13B. The smaller model handled what the larger one could not.
[Speculation] The Llama 3.3 70B byte-count failure was model-specific, not a harness-structural failure or an underspecified task. V4-Flash’s GRPO two-stage RL post-training may have surfaced stronger instruction-following on precise low-level counting tasks. Or this could be an architecture effect from MoE routing. The data shows the reversal; it does not explain it.
Where the predictions went wrong
[Observed]
Four of six pre-run predictions were wrong. Three were wrong in the same direction: the model was better than expected.
P1 — score range (wrong, high miss): Predicted 21–26/30 (70–87%). Point estimate: 23/30. Actual: 28/30. The ceiling was missed by 2 points. The calibration error traces to P2 and P3.
P2 — task_08 (wrong): Predicted 0/3. Got 3/3. This was the most consequential miss. The assumption that Llama 3.3 70B’s byte-count failure would replicate to V4-Flash was wrong. It did not.
P3 — task_09 (wrong — close miss): Predicted 0/3. Got 1/3. The prediction named “1/3 as the best case if post-training surfaces on ambiguous data conditions” — that case occurred. The committed point estimate was 0; the actual was 1. Scored wrong per prediction rules.
P4 — task_10 (correct): Predicted 2/3 or better with native OpenAI tools. Got 3/3. The task_10 failures in Llama 3.3 70B were adapter artifacts from the adapter translation layer, not model capability failures. V4-Flash used native tool calls without that adapter overhead.
P5 — cost (wrong, 10x miss): Predicted $0.25–$0.75. Got $0.04. Off by more than 10x against the floor. The cost prediction assumed 30K input tokens per run from prior local-model data. Actual total input tokens across all 30 runs: approximately 265K (verified: verification/cost_breakdown.csv), not the projected 900K. The agentic-core-v1 harness uses much shorter task scaffolds than production agentic jobs. Cost prediction methodology needs calibrating against harness-specific token logs, not real-world job sizes.
P6 — adapter clean (correct): Predicted zero infrastructure errors from DeepSeekAdapter tool-call parsing. Got zero. An earlier run attempt on 2026-05-10 failed with a message-format bug where harness tool results were sent as role=user instead of role=tool. That bug was fixed in PR #29 before this campaign ran. The P6 prediction was about the adapter’s tool-call parsing specifically; the message-format bug was a separate harness issue. Both resolved.
Cost and latency
[Observed] (verified: verification/cost_breakdown.csv, verification/latency_distribution.csv)
Total cost: $0.04 ($0.0416). Per-run average: $0.0014. Cheapest individual task: task_08 at $0.001 total across 3 runs. Most expensive: task_03 at $0.021 total — the 500-line log file drives cost regardless of which model runs it, same as in the Sonnet 4.6 campaign.
The comparison that matters:
| Model | Score | Cost | Cost per passing run |
|---|---|---|---|
| Claude Sonnet 4.6 | 28/30 (93.3%) | $1.44 | $0.051 |
| DeepSeek-V4-Flash | 28/30 (93.33%) | $0.04 | $0.0014 |
| Llama 3.3 70B (run 3) | 20/30 (66.7%) | $0.09 | $0.0045 |
(source: TASK-303 brief, verified against campaign cost_breakdown CSVs for each run)
V4-Flash costs $0.0014 per passing task run. Sonnet costs $0.051. The ratio is 36x. Both models failed only task_09.
Average wall-clock time per run: 7.57 seconds. This campaign ran through DeepSeek’s direct API. All 90 calls completed without 429s or infrastructure failures (verified: adapter log).
What we don’t know yet
[Speculation]
-
The task_08 mechanism. V4-Flash got 3/3 on multibyte character byte-counting. Llama 3.3 70B got 0/3. The activated parameter count goes the wrong way (13B vs 70B). Whether this traces to GRPO post-training, architecture differences, or training data coverage on byte-level operations is an open question. Upcoming GPT-5.5 Instant and DeepSeek-R1 results will add data points.
-
The task_09 single pass. Does the 1/3 reflect real ambiguity recognition, or run-to-run variance? Three runs per task is not enough to distinguish a 33% recognition rate from noise. The 1/3 result has been stable across all four campaigns, but stable could mean “the real rate is around 30%” or it could mean “only certain random seeds produce the right answer.”
-
Non-thinking versus thinking mode. This campaign used explicit non-thinking mode — the mode most builders will use for cost control. Thinking mode would likely improve task_09. The 93.33% figure is specifically for non-thinking V4-Flash.
-
Harder task variants. agentic-core-v1 is structured, scoped, and checker-deterministic. V4-Flash matched Sonnet here. Whether that holds on agentic-core-v2 tasks — which will include multi-agent coordination, stateful context, and adversarial tool results — is untested.
Evidence pack: data/campaigns/2026-05-09-deepseek-v4-flash-agentic-core-v1/. Predictions file: campaigns/2026-05-09-deepseek-v4-flash-agentic-core-v1.predictions.md. Full transcripts in data/transcripts/.