DeepSeek V4 Flash vs Claude Sonnet 4.6
Same agentic-core-v1 score, 36× cost difference — what does Sonnet actually buy you?
Head-to-head: agentic-core-v1
Harness: agentic-core-v1 — 10 tasks × 3 runs = 30 total. Binary pass/fail per run. Harness version: openclaw@2026.4.22. Full methodology: what we measure and why.
| Metric | DeepSeek V4 Flash | Claude Sonnet 4.6 |
|---|---|---|
| Runs passed / total | 28 / 30 | 28 / 30 |
| Pass rate | 93.3% | 93.3% |
| Campaign cost (30 runs) | $0.04 | $1.44 |
| Cost per run | $0.00133 | $0.0480 |
| Cost per passing run | $0.00143 | $0.0514 |
| Provider | DeepSeek | Anthropic |
| Campaign date | 2026-05-09 | 2026-05-04 |
DeepSeek V4 Flash
Ties Claude Sonnet 4.6 on pass rate at 36× lower cost; campaign completed in ~4 minutes
Same task_09 failure pattern as Sonnet 4.6; less predictable latency on log-heavy tasks
Claude Sonnet 4.6
Consistent 9/10 task categories at 3/3; task_09 single-run pass via error-forced fallback
$0.051/pass — 16× more expensive than Claude Haiku 4.5 for a 1-point score advantage
Verdict
DeepSeek V4 Flash and Claude Sonnet 4.6 match on pass rate (28/30). DeepSeek V4 Flash costs 36× less per passing run. For cost-sensitive production use, DeepSeek V4 Flash is the clear choice. Claude Sonnet 4.6 makes sense where Anthropic/OpenAI ecosystem integration matters more than per-call price.
About agentic-core-v1
agentic-core-v1 is modelbattles' flagship benchmark harness. Ten tasks drawn from real engineering work: fix a failing test, refactor duplicated code, investigate a production log, add a null guard, trace through a codebase. Each task runs three times per model. A run passes if and only if the checker accepts the output — no partial credit, no manual grading.
The harness is deterministic: same task, same environment, same checker across all models. Scores are comparable. task_09 is the persistent difficulty point — it requires the model to recognise a structurally impossible calculation and refuse to produce a wrong answer instead of looping. Most models fail it at least once.
Read: What we actually measure and why · How to read an agentic-core-v1 score
More comparisons
- Claude Sonnet 4.6 vs Claude Haiku 4.5 — A 1-point score gap and a 16× cost gap — is Sonnet worth it?
- Mistral Small 4 vs Claude Haiku 4.5 — The two most cost-efficient models in the top tier — one passes more, the other costs less.
- GPT-5.5 Instant vs Claude Sonnet 4.6 — Both mid-tier flagships at 90%+ — OpenAI vs Anthropic on real agentic tasks.
- Claude Opus 4.8 vs DeepSeek V4 Pro Thinking — Both 30/30 on agentic-core-v1. One costs $7.34. The other costs $0.12.
- Devstral 2 123B vs Mistral Small 4 — Mistral's code specialist vs its latest small model — which is better for agentic workloads?