The second perfect score. It costs $0.12.

June 16, 2026 · campaign-reports

Campaign: 2026-06-15-deepseek-v4-pro-thinking-agentic-core-v1
Model: DeepSeek V4-Pro with thinking mode enabled (DeepSeek direct API)
Harness: agentic-core-v1 (openclaw@2026.4.22)
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-06-15 (23:47–23:51Z)

Two weeks ago Claude Opus 4.8 became the first model to score 30/30 on agentic-core-v1. That campaign cost $7.34.

DeepSeek V4-Pro in thinking mode scored 30/30 last night. The campaign cost $0.12.

$7.34 versus $0.12 is a 98% reduction for the same outcome: zero failures, zero gave_up_mid_plan (where the model starts work and stops before committing an answer), zero wrong_answer, zero anything. Both models passed all 30 runs across all 10 tasks. The cost gap is purely economics.

V4-Pro thinking mode also fixes the one gap that V4-Flash left open. Our previous DeepSeek campaign (2026-05-09) scored 28/30 at $0.04. Flash scored 1/3 on task_09, the task designed to test whether a model knows when a problem is unsolvable. V4-Pro thinking scored 3/3. Two points, same task, different ending.

What the harness actually tests

agentic-core-v1 runs a model through 10 tasks: fix a failing test, refactor duplicated code, investigate a log file, trace through a codebase, handle an ambiguous requirement, execute a multi-step plan, recover from a tool error, know when to stop, and run an SQL investigation. Each task runs 3 times independently. A run passes when the model writes the correct answer to an output file before hitting the turn limit.

Each failure gets a label. gave_up_mid_plan means the model started work and stopped before the turn limit without a complete answer — it began executing but did not finish. wrong_answer means the model did commit something to the output file, but what it wrote did not satisfy the acceptance criteria. tool_call_redundancy means the model made the same tool call twice back-to-back with identical arguments. diagnosis_then_regression means the model stated a diagnosis and then walked it back. cross_task_consistency failures are logged when the same failure mode recurs across three or more distinct task types.

Scoring 30/30 means passing all three runs on every task, with no failure of any kind across all 30 runs.

The score

[Observed]

30 of 30 runs passed. Pass rate: 100.0% (verified: pass_rate_by_task.csv)¹. Total cost: $0.12. Cost per passing task: $0.004. The campaign ran in 4 minutes and 20 seconds.

The full top-of-leaderboard picture, by score (verified: pass_rate_by_task.csv for all campaigns):

Model	Score	Pass rate	Cost (30 runs)	Date
Claude Opus 4.8	30/30	100%	$7.34	Jun 2026
DeepSeek V4-Pro (thinking)	30/30	100%	$0.12	Jun 2026
Mistral Small 4	29/30	96.7%	$0.03	May 2026
DeepSeek V4-Flash	28/30	93.3%	$0.04	May 2026
Claude Sonnet 4.6	28/30	93.3%	$1.44	May 2026

Zero failure modes across all 30 runs (verified: failure_mode_histogram.csv):

gave_up_mid_plan: 0 of 30
wrong_answer: 0 of 30
tool_call_redundancy: 0 of 30
diagnosis_then_regression: 0 of 30
cross_task_consistency failures: 0 of 30
long_tail_turn_count violations (runs that use an unusually high number of turns before producing output): 0 of 30

For context: Gemini 3.1 Pro, which scored 23/30 one week earlier (verified: pass_rate_by_task.csv, 2026-06-15-gemini-3.1-pro-agentic-core-v1), showed tool_call_redundancy in 17 of 30 runs ². V4-Pro thinking shows none of that across a larger passing set.

Why did V4-Flash miss what V4-Pro thinking nailed?

[Observed]

task_09 (know_when_to_stop) is the task that separates the two campaigns. The setup: a file called data.csv contains 3 rows of data. The prompt asks for a 10-day moving average. The correct response is to recognize that 3 data points cannot support a 10-day window, write a clean finding to the output file, and stop.

V4-Flash (2026-05-09) passed task_09 once in three runs (verified: pass_rate_by_task.csv, V4-Flash campaign). Two runs ended in gave_up_mid_plan — the model detected the data-availability constraint but did not commit a final answer to the output file.

V4-Pro thinking passed task_09 all three times. Average: 3.7 tool calls per run, 9.3 seconds. Each run produced a clean output. One verbatim example from run 2 (verified: work/task_09_run_2/output.txt):

“The dataset contains only 3 records, which is fewer than the 10 required for a 10-day moving average calculation.”

Read the data. Recognize the impossibility. Write the finding. Stop.

The contrast with other models on this task is worth the comparison. Gemini 3.5 Flash scored 0/3 on task_09 in its campaign (verified: pass_rate_by_task.csv, 2026-06-gemini-3.5-flash-agentic-core-v1). Gemini 3.1 Pro also scored 0/3 (verified: pass_rate_by_task.csv, 2026-06-15-gemini-3.1-pro-agentic-core-v1). Both showed looping behavior: re-read data.csv, attempt partial calculations, re-read again, exhaust the turn budget. V4-Pro thinking does not loop. It reasons about the constraint and commits an answer.

[Speculation] The thinking mode appears to act as a buffer between “I’ve read the data” and “I’ll start writing output.” That buffer is where constraint reasoning happens. V4-Flash, without thinking mode, had the detection capability (1/3 pass shows it can) but not the consistency to convert that detection into committed output across all three runs.

What zero failure modes actually looks like

[Observed]

The per-task breakdown (verified: pass_rate_by_task.csv, tool_calls_by_task.csv, latency_distribution.csv, cost_breakdown.csv):

Task	Score	Avg tool calls	Avg latency	Avg cost/run
task_01 fix failing test	3/3	6.0	6.5s	$0.003
task_02 refactor duplicated code	3/3	7.0	10.5s	$0.004
task_03 investigate log	3/3	9.3	12.1s	$0.017
task_04 trace through codebase	3/3	6.0	6.8s	$0.003
task_05 minimal fix	3/3	7.7	11.4s	$0.004
task_06 handle ambiguous requirement	3/3	8.0	12.2s	$0.005
task_07 multi-step plan	3/3	4.0	6.0s	$0.004
task_08 recover from tool error	3/3	2.0	4.9s	$0.002
task_09 know when to stop	3/3	3.7	9.3s	$0.004
task_10 SQL investigation	3/3	3.0	5.5s	$0.003

task_08 (recover from tool error) averages 2.0 tool calls. The model finds the fix and stops. In Gemini 3.1 Pro’s campaign the same task averaged 6.0 tool calls with a 1/3 pass rate — more searching, worse outcome. V4-Pro thinking does not over-explore.

task_03 (investigate log) is the most expensive task at $0.017/run, driven by 9.3 average tool calls to work through an access log and identify database connection pool exhaustion as the cause of HTTP 500s on POST /api/orders. All three runs identified the same correct root cause with no wrong diagnoses. That consistency is not universal: Gemini 3.1 Pro showed diagnosis_then_regression in multiple task_03 runs in its campaign (verified: failure_mode_histogram.csv, 2026-06-15-gemini-3.1-pro-agentic-core-v1).

task_06 (handle ambiguous requirement) produced documented assumptions in all three runs. The function signature in the task has an unclear input contract. Each run named its interpretation before implementing. That’s observable behavior, not a claim about intent.

Were our predictions right?

[Observed]

Five predictions were committed before the campaign ran (verified: campaigns/deepseek-v4-pro-thinking-agentic-core-v1.notes.md):

Label	Prediction	Verdict	Detail
P1	V4-Pro thinking > V4-Flash — expect 25–29/30	✅ EXCEEDED	30/30, beyond the upper bound
P2	task_09 will pass 3/3	✅ CORRECT	3/3 confirmed
P3	task_03 + task_10 will be 3/3	✅ CORRECT	Both 3/3
P4	task_07 will be 3/3 but show reasoning token waste	✅ PARTIAL	3/3, but no waste observed
P5	Cost approx $0.19–$0.30 total	❌ WRONG	$0.12 actual — cheaper than estimated

Four of five. P5 was wrong in the interesting direction: thinking tokens cost almost nothing in practice. At $0.435/M input and $0.87/M output, the brief reasoning passes per run are cheap. The campaign came in at $0.12 against the $0.19–$0.30 estimate.

P4 gets an asterisk. task_07 (multi-step plan) passed 3/3 with 4.0 average tool calls. Opus 4.7 averaged 40.0 on the same task. Whether thinking mode substitutes internal reasoning for external tool-call loops, or whether V4-Pro just does not show that behavior on this task, is not visible from the data (see “What we don’t know”).

The miss on P1 is directional: the model scored beyond the upper bound of the prediction range. The prediction spec estimated V4-Flash at 22/30; the actual V4-Flash result (2026-05-09) was 28/30. That narrowed the expected improvement gap from 8+ points to 2 points. V4-Pro thinking still closed it.

What we don’t know

[Unobserved] Whether the zero failure mode result holds under replication. This campaign ran once at n=30. The absence of tool_call_redundancy and diagnosis_then_regression across all 30 runs is consistent, but a single campaign is not a certified asymptotic result. A replication run would strengthen the claim.

[Unobserved] The mechanism behind task_07’s low tool-call count. V4-Pro thinking averaged 4.0 calls on task_07; Opus 4.7 averaged 40.0 on the same task. Whether thinking mode changes verification behavior on structured multi-step work, or whether the difference reflects something else about the model, is not observable from this data alone.

[Speculation] The thinking mode may be doing more work than the current harness can measure. On tasks where boundary-condition reasoning matters, task_09 being the clearest case, the gap between thinking and non-thinking is visible. On tasks where the answer is more mechanical (task_01, task_04, task_10), V4-Flash already scored 3/3 without thinking mode. A harness with more edge-case reasoning tasks would show more signal.

[Unobserved] Performance outside agentic-core-v1. This is a coding and log-investigation task suite. V4-Pro thinking’s behavior on longer-form reasoning tasks or other harnesses is not established from this campaign.

The cost picture, directly stated

[Observed]

Opus 4.8 holds the same score. It costs 61× more per campaign ($7.34 vs $0.12). Cost per passing task: $0.245 for Opus 4.8, $0.004 for V4-Pro thinking (verified: cost_breakdown.csv for both campaigns).³

At volume, this scales linearly. 1,000 campaign runs: Opus 4.8 at $245, V4-Pro thinking at $4. Same pass rate on this harness.

The comparison against other models in the 93–97% range is cleaner still. Gemini 3.1 Pro scored 76.7% at $0.85 (verified: pass_rate_by_task.csv, cost_breakdown.csv, 2026-06-15-gemini-3.1-pro-agentic-core-v1). V4-Pro thinking is both more accurate and cheaper. Mistral Small 4 (96.7%, $0.03) and V4-Flash (93.3%, $0.04) are both cheaper than V4-Pro thinking, but neither hit 100%. The $0.08 gap between V4-Flash ($0.04) and V4-Pro thinking ($0.12) is the cost of thinking mode. It bought two additional passes on task_09, closing the last gap.

The result

[Observed]

V4-Pro thinking is the second model in the agentic-core-v1 dataset to score 100%. The first cost $7.34. This one cost $0.12.

The two tasks that V4-Flash missed were both task_09 failures — the model detected the unsolvable condition but did not commit an answer. V4-Pro thinking passed task_09 all three times. That is where thinking mode earns its keep on this harness: not on the tasks that mid-tier models already handle cleanly, but on the task where correct behavior requires reasoning about what cannot be computed before deciding what to write.

Zero failure modes across 30 runs. Four minutes and 20 seconds. $0.004 per passing task.

Pass rates and per-task scores from pass_rate_by_task.csv (2026-06-15-deepseek-v4-pro-thinking-agentic-core-v1 campaign data pack). ↩
Failure mode counts from failure_mode_histogram.csv. ↩
Cost figures from cost_breakdown.csv. Opus 4.8 figures from cost_breakdown.csv (2026-06-06-claude-opus-4-8-agentic-core-v1 campaign data pack). ↩

The second perfect score. It costs $0.12.

What the harness actually tests

The score

Why did V4-Flash miss what V4-Pro thinking nailed?

What zero failure modes actually looks like

Were our predictions right?

What we don’t know

The cost picture, directly stated

The result

Footnotes

ClawWorks Weekly