DeepSeek V4 Flash vs Claude Sonnet 4.6

Same agentic-core-v1 score, 36× cost difference — what does Sonnet actually buy you?

DeepSeek V4 Flash 28/30 93.3% pass rate $0.00143/pass Lower cost
vs
Claude Sonnet 4.6 28/30 93.3% pass rate $0.0514/pass

Head-to-head: agentic-core-v1

Harness: agentic-core-v1 — 10 tasks × 3 runs = 30 total. Binary pass/fail per run. Harness version: openclaw@2026.4.22. Full methodology: what we measure and why.

Metric DeepSeek V4 Flash Claude Sonnet 4.6
Runs passed / total 28 / 30 28 / 30
Pass rate 93.3% 93.3%
Campaign cost (30 runs) $0.04 $1.44
Cost per run $0.00133 $0.0480
Cost per passing run $0.00143 $0.0514
Provider DeepSeek Anthropic
Campaign date 2026-05-09 2026-05-04

DeepSeek V4 Flash

Strength

Ties Claude Sonnet 4.6 on pass rate at 36× lower cost; campaign completed in ~4 minutes

Weakness

Same task_09 failure pattern as Sonnet 4.6; less predictable latency on log-heavy tasks

Full campaign report →

Claude Sonnet 4.6

Strength

Consistent 9/10 task categories at 3/3; task_09 single-run pass via error-forced fallback

Weakness

$0.051/pass — 16× more expensive than Claude Haiku 4.5 for a 1-point score advantage

Full campaign report →
Cost ratio: DeepSeek V4 Flash costs 35.9× less per passing run ($0.00143 vs $0.0514). At 10,000 tasks, that gap is $500 .

Verdict

DeepSeek V4 Flash and Claude Sonnet 4.6 match on pass rate (28/30). DeepSeek V4 Flash costs 36× less per passing run. For cost-sensitive production use, DeepSeek V4 Flash is the clear choice. Claude Sonnet 4.6 makes sense where Anthropic/OpenAI ecosystem integration matters more than per-call price.

About agentic-core-v1

agentic-core-v1 is modelbattles' flagship benchmark harness. Ten tasks drawn from real engineering work: fix a failing test, refactor duplicated code, investigate a production log, add a null guard, trace through a codebase. Each task runs three times per model. A run passes if and only if the checker accepts the output — no partial credit, no manual grading.

The harness is deterministic: same task, same environment, same checker across all models. Scores are comparable. task_09 is the persistent difficulty point — it requires the model to recognise a structurally impossible calculation and refuse to produce a wrong answer instead of looping. Most models fail it at least once.

Read: What we actually measure and why · How to read an agentic-core-v1 score

More comparisons

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.