Claude Opus 4.8 vs DeepSeek V4 Pro Thinking

Both 30/30 on agentic-core-v1. One costs $7.34. The other costs $0.12.

Claude Opus 4.8 30/30 100.0% pass rate $0.245/pass
vs
DeepSeek V4 Pro Thinking 30/30 100.0% pass rate $0.00400/pass Lower cost

Head-to-head: agentic-core-v1

Harness: agentic-core-v1 — 10 tasks × 3 runs = 30 total. Binary pass/fail per run. Harness version: openclaw@2026.4.22. Full methodology: what we measure and why.

Metric Claude Opus 4.8 DeepSeek V4 Pro Thinking
Runs passed / total 30 / 30 30 / 30
Pass rate 100.0% 100.0%
Campaign cost (30 runs) $7.34 $0.12
Cost per run $0.245 $0.00400
Cost per passing run $0.245 $0.00400
Provider Anthropic DeepSeek
Campaign date 2026-06-06 2026-06-16

Claude Opus 4.8

Strength

Only model to achieve 30/30 — no failures on any task in any run

Weakness

Most expensive per passing run in the dataset at $0.245; task_07 still averaged 36 tool calls

Full campaign report →

DeepSeek V4 Pro Thinking

Strength

30/30 at $0.004/run — 61× cheaper per passing run than Claude Opus 4.8

Weakness

Thinking token overhead adds latency; reasoning traces are non-zero cost even on simple tasks

Full campaign report →
Cost ratio: DeepSeek V4 Pro Thinking costs 61.3× less per passing run ($0.00400 vs $0.245). At 10,000 tasks, that gap is $2410 .

Verdict

Claude Opus 4.8 and DeepSeek V4 Pro Thinking match on pass rate (30/30). DeepSeek V4 Pro Thinking costs 61× less per passing run. For cost-sensitive production use, DeepSeek V4 Pro Thinking is the clear choice. Claude Opus 4.8 makes sense where provider ecosystem factors outweigh per-call economics.

About agentic-core-v1

agentic-core-v1 is modelbattles' flagship benchmark harness. Ten tasks drawn from real engineering work: fix a failing test, refactor duplicated code, investigate a production log, add a null guard, trace through a codebase. Each task runs three times per model. A run passes if and only if the checker accepts the output — no partial credit, no manual grading.

The harness is deterministic: same task, same environment, same checker across all models. Scores are comparable. task_09 is the persistent difficulty point — it requires the model to recognise a structurally impossible calculation and refuse to produce a wrong answer instead of looping. Most models fail it at least once.

Read: What we actually measure and why · How to read an agentic-core-v1 score

More comparisons

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.