30 runs in 7 seconds: DeepSeek R1 and the API boundary

Campaign: 2026-05-17-deepseek-r1-agentic-core-v1
Model: DeepSeek R1 (via AWS Bedrock, us.deepseek.r1-v1:0, us-east-1)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Date: 2026-05-17


DeepSeek R1 has real benchmark numbers. AIME, MATH, Codeforces — it scores well on hard reasoning tasks, and those scores come from a model that generates extended chain-of-thought before answering. The question we filed predictions against before running: does that reasoning depth translate to agentic coding performance?

We didn’t find out. Bedrock stopped the campaign before R1 generated a single token.


What agentic-core-v1 actually tests

Ten scenarios, three runs each, 30 total. The scenarios cover: fixing a failing test, refactoring duplicated code, tracing a bug through a codebase, log investigation, targeted fix implementation, handling an ambiguous requirement, multi-step planning, recovering from a tool error, knowing when to stop on an underspecified task, and SQL schema investigation.

A pass means the task checker returns pass — correct file written, correct content. Each run calls the Bedrock Converse API with toolConfig set, dispatches tools based on model responses, and collects results. The whole loop depends on the model being able to call tools.

That last part is where R1 stopped.


Run 1: ValidationException [Observed]

The first Bedrock Converse call returned:

ValidationException: An error occurred (ValidationException) when calling the Converse operation: This model doesn't support tool use.

Run 2 returned the same error. Run 3. All 30 (verified: task_outcomes table, data/intel.db — all 30 run_id entries, failure_mode=infrastructure_error). Each run took under 0.4 seconds. Total wall clock for the full 30-run campaign: 7 seconds. Total cost: $0.00. Total tokens consumed: 0.

The harness classified each run infrastructure_error per SPEC §4.4. There was nothing else to classify.


A different kind of failure than Llama 3.3 70B [Observed]

This campaign produced R1’s second appearance in a row in the failure column, but the failure mode is not the same as Llama 3.3 70B’s campaign.

Llama 3.3 70BDeepSeek R1
Failure pointAfter model respondedBefore model invoked
Error typeTool-call format mismatchBedrock API rejection
Model was reachedYesNo
Tokens consumed~500/run (verified: data/intel.db)0
Fix pathAdapter parse text-format JSONDifferent API path or different model

For Llama, the model ran. It correctly identified the right starting file. The harness couldn’t parse the tool call because the wire format was wrong. That’s a solvable adapter problem. For R1, Bedrock closed the door before R1 processed anything — there was no response to parse.


The direct API fallback is also closed [Observed]

Before running, we had a backup assumption: if Bedrock’s R1 deployment doesn’t work, try the direct DeepSeek API. That path has since been investigated.

As of 2026-05-17T02:21Z, api.deepseek.com/v1/models returns two models: deepseek-v4-flash and deepseek-v4-pro. R1 is not listed. The deepseek-reasoner model ID previously associated with R1 now resolves to deepseek-v4-flash in thinking mode — a live API call returned "model": "deepseek-v4-flash" in the response body. No R1 architecture endpoint exists on the direct API.

R1 is only available on Bedrock. Bedrock’s R1 deployment does not support toolConfig. Both access paths are closed.


Reasoning versus agentic: what the API is telling you [Observed, with architectural interpretation]

Bedrock deploys R1 as a text generation endpoint. You send a prompt; you get reasoning-rich text back. What Bedrock’s R1 deployment does not support is the toolConfig parameter in a Converse call — the API rejects the request before the model is invoked at all. This is not a Bedrock bug; it is an accurate reflection of what R1 is designed to do.

R1 was built for single-pass deep reasoning. A problem goes in. The model thinks hard. An answer comes out. The design is optimised for mathematics, logic, and code reasoning — tasks where you want extended chain-of-thought and a single correct answer, not tasks where you want a model to call a tool, read the result, call another tool, and repeat.

An agentic loop requires: (1) decide what tool to call, (2) format a structured tool call, (3) receive the tool result, (4) reason about that result, (5) repeat. R1’s chain-of-thought handles step 4 well. Steps 1–3 are not supported on Bedrock’s R1 deployment because the infrastructure does not expose tool dispatch.

R1’s AIME and Codeforces numbers are real. Those benchmarks do not require a tool loop. agentic-core-v1 does. The two are not measuring the same capability.


The scoreline [Observed]

0/30 (verified: task_outcomes table, data/intel.db). $0.00 cost. R1 does not place on the leaderboard — an infrastructure block before the first token is not a capability measurement.

The leaderboard after five valid campaigns:

ModelScoreCostTier
Claude Sonnet 4.628/30 (93.3%)$1.44API (Bedrock)
DeepSeek-V4-Flash28/30 (93.3%)$0.04API (direct)
GPT-5.5 Instant27/30 (90.0%)$1.89API (OpenAI)
Gemma 4 31B IT23/30 (76.7%)$0.00Local (EC2)
Llama 3.3 70B (run 3)20/30 (66.7%)$0.09Local (EC2)

The predictions were wrong before they could be scored [Observed — null result]

Predictions were committed before the campaign ran, in PR #34 (predictions/deepseek-r1-agentic-core-v1.md, commit 4e89e0b). All five are UNTESTABLE — no model output to score against.

The one worth noting: P5 predicted that R1 would emit native toolUse blocks and the Bedrock adapter would handle them cleanly. It was the right prediction to watch. What actually happened was that Bedrock never let the adapter see anything. P5 was wrong directionally, and wrong worse than we predicted — not a parsing failure, but a request that never got through.


What we still don’t know [Unobserved]

Two questions this campaign left open:

task_09 and structural impossibility: task_09 is designed around a structurally impossible requirement. We predicted a reasoning model might correctly decline rather than hallucinate a path forward. R1’s chain-of-thought would have been interesting here. We have no data.

Whether a ReAct prompting path would work: Bedrock’s invokeModel endpoint bypasses Converse and allows raw prompt construction. Tool calling could be simulated by injecting text-format schemas into the system prompt and parsing R1’s text output for invocations. That changes the evaluation surface entirely — simulated tools are not native Converse tools, and results would not be comparable to the current leaderboard. We have not tried it, and if we do, it will be reported as a separate test, not a continuation of this campaign.

Whether AWS will add Converse tool support for R1 on Bedrock: no public roadmap item found as of this writing.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.