The Bedrock arbitrage that didn't work out

Campaign: 2026-05-17-deepseek-v3-2-agentic-core-v1
Model: DeepSeek V3.2 (deepseek.v3.2, AWS Bedrock us-east-1)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-05-18


After testing DeepSeek V4-Flash at 25/30 on the direct DeepSeek API, the next logical question was whether V3.2, the previous generation model available on AWS Bedrock, could serve as a cheaper routing alternative. V4-Flash runs on DeepSeek’s own infrastructure at $0.0756/pass. V3.2 on Bedrock has different pricing, and if the quality held, that routing choice would matter for teams building on AWS.

Before the campaign started, Rigg filed a falsification condition:

If V3.2 scores ≤22/30: the V4-Flash architectural improvements — Hybrid Attention, mHC residuals, Muon optimizer, extended 32T-token training — are directly relevant to agentic task quality. The V3→V4 upgrade is not incremental.

V3.2 scored 19/30. The condition fired.


What the harness actually tests

agentic-core-v1 runs 10 tasks × 3 runs each = 30 total. The tasks are practical engineering problems: fix a failing test, refactor duplicated code, investigate an access log for 500 errors, trace through a codebase, write a minimal patch in ≤10 lines, handle an ambiguous requirement, build a multi-step plan, recover from a tool error, recognise an impossible computation and stop, and run a SQL investigation.

A pass means the model completed the task correctly within the stated constraints. A gave_up_mid_plan (model abandoned mid-investigation without producing output) or wrong_answer (model returned an incorrect result) counts as a failure. No partial credit.


Scores by task

[Observed — verified: pass_rate_by_task.sql]

TaskPassesFail mode
task_01 — fix failing test3/3
task_02 — refactor duplicated code3/3
task_03 — investigate log0/3gave_up_mid_plan
task_04 — trace through codebase2/3gave_up_mid_plan (1 run)
task_05 — minimal fix0/3wrong_answer / gave_up_mid_plan
task_06 — handle ambiguous requirement3/3
task_07 — multi-step plan3/3
task_08 — recover from tool error3/3
task_09 — know when to stop0/3wrong_answer
task_10 — SQL investigation2/3gave_up_mid_plan (1 run)

V3.2 is a capable model on structured, bounded tasks. Five task types went 3/3. task_07 (multi-step plan: sequential file writes, 3/3, avg 4.3 tool calls, 10.47s avg latency) aligns with the broader V3-lineage pattern in the dataset: all V3-class models handle explicit sequential planning cleanly. The problem is concentrated in the three task types that failed.


Why task_03 is the whole campaign

[Observed — tool_call_redundancy evidence bundle, cross_task_consistency bundle, verified: tool_call_count_by_task.sql and campaign_cost_breakdown.sql]

task_03 asks the model to investigate a burst of 500 errors in an access log and write the root cause to a file. It is not a trick question. It requires systematic log examination, pattern recognition, and writing one clear finding.

Every other model in the dataset passed task_03: Claude Sonnet 4.6 (3/3), Mistral Large 3 (3/3), GPT-5.5 (3/3), DeepSeek V4-Flash (3/3), Devstral 2 123B (3/3), GPT-OSS 120B (3/3), Llama 3.3 70B (3/3). V3.2 went 0/3.

The tool_call_redundancy evidence bundle documents what happened across all three runs: grep -n "500" access.log | head -20 issued twice in sequence, wc -l access.log called three separate times within a single run. The model reads and re-reads the same data, fails to converge on a finding, and exits without producing output. The cross_task_consistency bundle tags this as a model-level pattern: gave_up_mid_plan appeared on task_03, task_09, and task_10.

The cost signature is the clearest evidence. task_03 consumed $0.1354 across 3 runs (212,577 input tokens) for zero passing results (verified: campaign_cost_breakdown.sql). That is 50% of the entire $0.27 campaign budget, on tasks that all failed.

Compare that to Mistral Large 3: 2 average tool calls per task_03 run, $0.0066/run, 3/3 passes (verified: tool_call_count_by_task.sql). V3.2 averaged 8 tool calls per run at $0.0451/run. Four times more tool calls, 6.8 times higher cost per run, no passes.

V3.2 does not fail fast on task_03. It burns tokens trying, fails to converge, and exits without a result.


Did the architecture explain the loop?

[Speculation — architectural hypothesis, unverified against internal model behaviour]

The V4 changelog documents Hybrid Attention (combining compressed sliding window attention with periodic full attention layers) and mHC residual connections. Both are designed to improve long-context coherence: maintaining state across extended sequences of tool calls, remembering what was already read, recognising that the last tool call returned the same result as three calls ago.

That description maps directly to the failure pattern in V3.2’s task_03 transcripts. A model with better long-context coherence would notice the redundant grep and stop repeating. V3.2 doesn’t notice.

This is a hypothesis. The transcripts show the symptom (redundant tool calls, no convergence), not the mechanism. The architectural story is plausible, but we did not instrument internal attention patterns and we cannot confirm causation. A targeted campaign running V3.2 on investigation-only tasks at varying context lengths would help isolate whether this is a token-budget issue, an attention issue, or something else.


The cost position

[Observed — verified: campaign_cost_breakdown.sql, leaderboard_by_score.sql]

Predicted: $0.03–$0.15. Actual: $0.27. The entire overrun is explained by task_03.

Strip task_03 out: the remaining 27 runs cost $0.1346, around $0.0050/run, within the predicted range. The budget blow-up is a direct consequence of the investigation loop.

At $0.0142/pass, V3.2’s cost efficiency is worse than every comparable model in the dataset:

The Bedrock cost-arbitrage story collapses. V3.2 on Bedrock is not a cheaper path to V4-Flash quality. It’s a qualitatively different model, and its cost-per-pass is worse than most of the other Bedrock options we have tested.


Predictions scoring: 3/6

[Observed — verified: predictions_scoring.md in evidence index]

PredictionResult
P1: 23–26/30 (point estimate 25/30)❌ Actual 19/30 — missed range by 4 points
P2: task_09 0/3 wrong_answer✅ 0/3, wrong_answer confirmed
P3: task_07 2–3/3✅ 3/3 — V3 planning coherence holds
P4: total cost $0.03–$0.15❌ Actual $0.27 — task_03 loop consumed 50% of budget
P5: no INFRASTRUCTURE_ERROR✅ 30/30 runs completed cleanly
P6: V3.2 ≥ GPT-OSS 120B (≥24/30)❌ 19/30 — below GPT-OSS 120B (23/30) and Llama 3.3 70B (20/30)

The three misses are structurally related. P1 (quality shortfall), P4 (cost overrun), and P6 (quality below a 120B model) all trace to the same source: task_03’s investigation loop. A model that fails 0/3 on the one task that every other model passes is going to underperform on score, cost, and relative ranking simultaneously.

This is the largest prediction error in the series to date. The hypothesis going in was that V3.2 would approximate V4-Flash quality because of the shared lineage. It was wrong.


Where the leaderboard stands

[Observed — verified: leaderboard_by_score.sql]

ModelScorePass rateCost/passInfra
Claude Sonnet 4.628/3093.3%$0.0514Bedrock eu-west-1
Devstral 2 123B27/3090.0%$0.0019Bedrock us-east-1
Mistral Large 327/3090.0%$0.0022Bedrock us-east-1
GPT-5.527/3090.0%$0.0700OpenAI API
DeepSeek V4-Flash25/3083.3%$0.0756DeepSeek API
GPT-OSS 120B23/3076.7%$0.0013Bedrock us-east-1
Llama 3.3 70B20/3066.7%$0.0045Bedrock eu-west-1
DeepSeek V3.219/3063.3%$0.0142Bedrock us-east-1
DeepSeek R1DNFBedrock (toolConfig unsupported)

V3.2 enters second from the bottom, below a model with 10× fewer activated parameters. At this quality and cost position, there is no Bedrock routing scenario where V3.2 is the right choice over other available options.


What we don’t know yet

[Unobserved]

We did not run V3.2 on the direct DeepSeek API. If there is any latency or batching behaviour on Bedrock routing that affects long-context coherence differently from the direct API path, this campaign would not detect it. The Bedrock ON_DEMAND routing was clean (30/30 runs reached a result, no INFRASTRUCTURE_ERROR), so the failure is not a routing artifact. But a direct-API comparison was not part of this campaign.

We also do not know the full task_03 failure envelope. The evidence bundles cover task_03 across three runs, all showing the same loop pattern. Whether that pattern generalises to any sufficiently open-ended log analysis task, or is specific to the access log format and tool set in agentic-core-v1, is untested. A campaign targeting investigation-only tasks at different context depths would narrow it down.


Harness: agentic-core-v1. Run date: 2026-05-18. Campaign author: Rigg. Article: Jenn.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.