The Bedrock arbitrage that didn't work out

May 18, 2026 · campaign-reports

Campaign: 2026-05-17-deepseek-v3-2-agentic-core-v1
Model: DeepSeek V3.2 (deepseek.v3.2, AWS Bedrock us-east-1)
Harness: openclaw@2026.4.22
Runs: 30 (10 tasks × 3 runs each)
Campaign date: 2026-05-18

After testing DeepSeek V4-Flash at 25/30 on the direct DeepSeek API, the next logical question was whether V3.2, the previous generation model available on AWS Bedrock, could serve as a cheaper routing alternative. V4-Flash runs on DeepSeek’s own infrastructure at $0.0756/pass. V3.2 on Bedrock has different pricing, and if the quality held, that routing choice would matter for teams building on AWS.

Before the campaign started, Rigg filed a falsification condition:

If V3.2 scores ≤22/30: the V4-Flash architectural improvements — Hybrid Attention, mHC residuals, Muon optimizer, extended 32T-token training — are directly relevant to agentic task quality. The V3→V4 upgrade is not incremental.

V3.2 scored 19/30. The condition fired.

What the harness actually tests

agentic-core-v1 runs 10 tasks × 3 runs each = 30 total. The tasks are practical engineering problems: fix a failing test, refactor duplicated code, investigate an access log for 500 errors, trace through a codebase, write a minimal patch in ≤10 lines, handle an ambiguous requirement, build a multi-step plan, recover from a tool error, recognise an impossible computation and stop, and run a SQL investigation.

A pass means the model completed the task correctly within the stated constraints. A gave_up_mid_plan (model abandoned mid-investigation without producing output) or wrong_answer (model returned an incorrect result) counts as a failure. No partial credit.

Scores by task

[Observed — verified: pass_rate_by_task.sql]

Task	Passes	Fail mode
task_01 — fix failing test	3/3	—
task_02 — refactor duplicated code	3/3	—
task_03 — investigate log	0/3	`gave_up_mid_plan`
task_04 — trace through codebase	2/3	`gave_up_mid_plan` (1 run)
task_05 — minimal fix	0/3	`wrong_answer` / `gave_up_mid_plan`
task_06 — handle ambiguous requirement	3/3	—
task_07 — multi-step plan	3/3	—
task_08 — recover from tool error	3/3	—
task_09 — know when to stop	0/3	`wrong_answer`
task_10 — SQL investigation	2/3	`gave_up_mid_plan` (1 run)

V3.2 is a capable model on structured, bounded tasks. Five task types went 3/3. task_07 (multi-step plan: sequential file writes, 3/3, avg 4.3 tool calls, 10.47s avg latency) aligns with the broader V3-lineage pattern in the dataset: all V3-class models handle explicit sequential planning cleanly. The problem is concentrated in the three task types that failed.

Why task_03 is the whole campaign

[Observed — tool_call_redundancy evidence bundle, cross_task_consistency bundle, verified: tool_call_count_by_task.sql and campaign_cost_breakdown.sql]

task_03 asks the model to investigate a burst of 500 errors in an access log and write the root cause to a file. It is not a trick question. It requires systematic log examination, pattern recognition, and writing one clear finding.

Every other model in the dataset passed task_03: Claude Sonnet 4.6 (3/3), Mistral Large 3 (3/3), GPT-5.5 (3/3), DeepSeek V4-Flash (3/3), Devstral 2 123B (3/3), GPT-OSS 120B (3/3), Llama 3.3 70B (3/3). V3.2 went 0/3.

The tool_call_redundancy evidence bundle documents what happened across all three runs: grep -n "500" access.log | head -20 issued twice in sequence, wc -l access.log called three separate times within a single run. The model reads and re-reads the same data, fails to converge on a finding, and exits without producing output. The cross_task_consistency bundle tags this as a model-level pattern: gave_up_mid_plan appeared on task_03, task_09, and task_10.

The cost signature is the clearest evidence. task_03 consumed $0.1354 across 3 runs (212,577 input tokens) for zero passing results (verified: campaign_cost_breakdown.sql). That is 50% of the entire $0.27 campaign budget, on tasks that all failed.

Compare that to Mistral Large 3: 2 average tool calls per task_03 run, $0.0066/run, 3/3 passes (verified: tool_call_count_by_task.sql). V3.2 averaged 8 tool calls per run at $0.0451/run. Four times more tool calls, 6.8 times higher cost per run, no passes.

V3.2 does not fail fast on task_03. It burns tokens trying, fails to converge, and exits without a result.

Did the architecture explain the loop?

[Speculation — architectural hypothesis, unverified against internal model behaviour]

The V4 changelog documents Hybrid Attention (combining compressed sliding window attention with periodic full attention layers) and mHC residual connections. Both are designed to improve long-context coherence: maintaining state across extended sequences of tool calls, remembering what was already read, recognising that the last tool call returned the same result as three calls ago.

That description maps directly to the failure pattern in V3.2’s task_03 transcripts. A model with better long-context coherence would notice the redundant grep and stop repeating. V3.2 doesn’t notice.

This is a hypothesis. The transcripts show the symptom (redundant tool calls, no convergence), not the mechanism. The architectural story is plausible, but we did not instrument internal attention patterns and we cannot confirm causation. A targeted campaign running V3.2 on investigation-only tasks at varying context lengths would help isolate whether this is a token-budget issue, an attention issue, or something else.

The cost position

[Observed — verified: campaign_cost_breakdown.sql, leaderboard_by_score.sql]

Predicted: $0.03–$0.15. Actual: $0.27. The entire overrun is explained by task_03.

Strip task_03 out: the remaining 27 runs cost $0.1346, around $0.0050/run, within the predicted range. The budget blow-up is a direct consequence of the investigation loop.

At $0.0142/pass, V3.2’s cost efficiency is worse than every comparable model in the dataset:

Llama 3.3 70B: 20/30 at $0.0045/pass — higher score, 3× cheaper per pass, 10× fewer activated parameters
GPT-OSS 120B: 23/30 at $0.0013/pass — significantly higher quality, 11× cheaper per pass
Mistral Large 3: 27/30 at $0.0022/pass — much higher quality, 6× cheaper per pass

The Bedrock cost-arbitrage story collapses. V3.2 on Bedrock is not a cheaper path to V4-Flash quality. It’s a qualitatively different model, and its cost-per-pass is worse than most of the other Bedrock options we have tested.

Predictions scoring: 3/6

[Observed — verified: predictions_scoring.md in evidence index]

Prediction	Result
P1: 23–26/30 (point estimate 25/30)	❌ Actual 19/30 — missed range by 4 points
P2: task_09 0/3 `wrong_answer`	✅ 0/3, wrong_answer confirmed
P3: task_07 2–3/3	✅ 3/3 — V3 planning coherence holds
P4: total cost $0.03–$0.15	❌ Actual $0.27 — task_03 loop consumed 50% of budget
P5: no INFRASTRUCTURE_ERROR	✅ 30/30 runs completed cleanly
P6: V3.2 ≥ GPT-OSS 120B (≥24/30)	❌ 19/30 — below GPT-OSS 120B (23/30) and Llama 3.3 70B (20/30)

The three misses are structurally related. P1 (quality shortfall), P4 (cost overrun), and P6 (quality below a 120B model) all trace to the same source: task_03’s investigation loop. A model that fails 0/3 on the one task that every other model passes is going to underperform on score, cost, and relative ranking simultaneously.

This is the largest prediction error in the series to date. The hypothesis going in was that V3.2 would approximate V4-Flash quality because of the shared lineage. It was wrong.

Where the leaderboard stands

[Observed — verified: leaderboard_by_score.sql]

Model	Score	Pass rate	Cost/pass	Infra
Claude Sonnet 4.6	28/30	93.3%	$0.0514	Bedrock eu-west-1
Devstral 2 123B	27/30	90.0%	$0.0019	Bedrock us-east-1
Mistral Large 3	27/30	90.0%	$0.0022	Bedrock us-east-1
GPT-5.5	27/30	90.0%	$0.0700	OpenAI API
DeepSeek V4-Flash	25/30	83.3%	$0.0756	DeepSeek API
GPT-OSS 120B	23/30	76.7%	$0.0013	Bedrock us-east-1
Llama 3.3 70B	20/30	66.7%	$0.0045	Bedrock eu-west-1
DeepSeek V3.2	19/30	63.3%	$0.0142	Bedrock us-east-1
DeepSeek R1	DNF	—	—	Bedrock (toolConfig unsupported)

V3.2 enters second from the bottom, below a model with 10× fewer activated parameters. At this quality and cost position, there is no Bedrock routing scenario where V3.2 is the right choice over other available options.

What we don’t know yet

[Unobserved]

We did not run V3.2 on the direct DeepSeek API. If there is any latency or batching behaviour on Bedrock routing that affects long-context coherence differently from the direct API path, this campaign would not detect it. The Bedrock ON_DEMAND routing was clean (30/30 runs reached a result, no INFRASTRUCTURE_ERROR), so the failure is not a routing artifact. But a direct-API comparison was not part of this campaign.

We also do not know the full task_03 failure envelope. The evidence bundles cover task_03 across three runs, all showing the same loop pattern. Whether that pattern generalises to any sufficiently open-ended log analysis task, or is specific to the access log format and tool set in agentic-core-v1, is untested. A campaign targeting investigation-only tasks at different context depths would narrow it down.

Harness: agentic-core-v1. Run date: 2026-05-18. Campaign author: Rigg. Article: Jenn.