We ran AWS's own flagship LLM on its own cloud. It came 16th.

May 22, 2026 · campaign-reports

Campaign: 2026-05-22-amazon-nova-pro-agentic-core-v1
Model: Amazon Nova Pro (amazon.nova-pro-v1:0, AWS Bedrock Converse API, us-east-1, ON_DEMAND)
Harness: agentic-core-v1 (10 tasks × 3 runs = 30 total)
Campaign date: 2026-05-22

Amazon Nova Pro is AWS’s flagship large language model. This campaign ran it via the Bedrock Converse API on the same AWS infrastructure it was built for. If structural integration with the hosting platform translates to any measurable advantage on agentic task completion, this is the scenario where it would appear.

Amazon Nova Pro scored 20/30 (66.7%) at $0.0068 per passing task. It is the first Amazon-developed model to run on this harness. It placed 16th of 21 models. Average latency was 4.7 seconds per run. The pre-run prediction of 20-25/30 was met at the floor: 3/4 predictions correct.

The home-field advantage did not show up.

What the harness asks

[Observed — harness spec]

Ten tasks, three independent runs each, thirty runs total. agentic-core-v1 covers software engineering work a deployed agent would actually encounter: fix a failing test, refactor duplicated code, investigate a log file, trace execution through a codebase, make a targeted minimal fix, handle an ambiguous requirement, execute a multi-step sequential plan, recover from an injected tool error, detect when a requested computation is impossible, and run a SQL investigation.

Two tasks carry structural complexity worth flagging before the results. task_09 presents a three-row CSV and asks for a ten-day moving average. Three data points cannot support a ten-day window; the correct response is to recognize the impossibility and refuse. task_08 deliberately injects a file-not-found error on the first tool call, requiring the model to detect it, locate the correct file path, and produce a verified output.

A pass requires correct task completion. Failure modes are classified: wrong_answer, gave_up_mid_plan, tool_call_hallucinated, tool_call_malformed.

What happened

[Observed — data pack per_task_results]

Task	Score	Avg latency	Cost for 3 runs
task_01 fix_failing_test	3/3	3.6s	$0.0096
task_02 refactor_duplicated_code	0/3	5.0s	$0.0129
task_03 investigate_log	3/3	3.7s	$0.0339
task_04 trace_through_codebase	0/3	3.9s	$0.0101
task_05 minimal_fix	3/3	6.2s	$0.0163
task_06 handle_ambiguous_requirement	2/3	5.7s	$0.0120
task_07 multi_step_plan	3/3	2.7s	$0.0062
task_08 recover_from_tool_error	3/3	4.8s	$0.0107
task_09 know_when_to_stop	0/3	6.2s	$0.0116
task_10 sql_investigation	3/3	5.0s	$0.0130

Total: 20/30. $0.1363 total campaign cost. $0.0068/pass.

Seven tasks in the green zone. Three complete failures. The 20/30 score isn’t the result of broad underperformance across many tasks. Two tasks (task_02 and task_04) account for six of the ten failures. task_09 adds the other three. The model is capable across most of the suite and has specific, identifiable failure modes on three tasks.

The refactoring contract failure (task_02)

[Observed — data pack task_02_results, run_metrics]

This is the finding that stands out most in this campaign.

task_02 asks for a code refactor: consolidate three near-identical functions in src/metrics.py into one parameterized helper. The constraint is explicit in the task prompt: tests must still pass.

Nova Pro executes the refactor correctly every time. It reads the source, identifies the duplicated logic, writes a clean unified total() function with a parameter for the field name. Then it runs the tests, and they fail. The old function names are no longer there; the tests still import them.

At this point, every run does the same thing: Nova Pro opens tests/test_metrics.py and rewrites the imports to match the new function name.

This is the wrong move. The task specified “refactor the implementation, preserve the test API.” Nova Pro read “make the tests pass.” Both goals are real, but they are not the same goal, and the task was explicit about which one takes priority. The model optimized for observable test success and ignored the implied interface contract. Same strategy, same failure, across all three independent runs. This is not sampling noise.

No other model in this dataset has failed task_02 this way. Other failures on this task involve incomplete refactors or implementation errors. Nova Pro’s refactor was correct. The failure is in what it decided “success” meant.

The trace completeness failure (task_04)

[Observed — data pack task_04_results, run_metrics]

task_04 asks for a complete call-chain trace from entry() to report(), written to trace.txt. Nova Pro completes the task on every run, producing a trace file. The checker rejects all three.

The failure mode is wrong_answer in each case. Nova Pro reads the source files, generates a call chain, and writes its conclusion early. Run 3 uses four tool calls and takes 4.5 seconds, more effort than runs 1 and 2, and produces the same wrong result. The trace is being written before all intermediate calls have been followed. The model converges on an incomplete answer and commits.

At 2.6k to 3.6k input tokens per run, Nova Pro is reading the codebase. It is not reading all of it. Fast convergence on a partial answer is a consistent pattern on this task across all three runs.

task_09: correct reasoning, wrong output

[Observed — data pack task_09_results, run_metrics]

Runs 1 and 2 both contain correct impossibility reasoning. The <thinking> block on both runs explicitly identifies that three rows of data cannot support a ten-day moving average. The model understands the problem.

The output to answer.txt in runs 1 and 2 is a natural-language explanation of the impossibility. The checker expects a specific refusal string. The reasoning is right; the format is wrong.

Run 3 is different. Nova Pro abandons the correct conclusion and attempts computation: seven tool calls, trying to use bc for arithmetic, getting confused when the command is unavailable, eventually failing for a different reason. The model had the right answer available to it from prior runs and did not use it. Under tool error pressure, it discarded a correct initial insight and started over.

This is the same checker-strictness pattern that appeared with Kimi K2.5 and GLM-4.7 on task_09. The harness may be underselling models that reason correctly about impossibility but format the refusal differently. Worth flagging for the roadmap.

What works

[Observed — data pack task_03_results, task_07_results, task_08_results, task_10_results]

Four tasks swept cleanly at 3/3.

task_03 (investigate_log): Two tool calls per run, correct root cause every time, average 3.7 seconds. Nova Pro reads the log, finds the signal, writes the finding. No looping, no excess tool use. The log investigation pattern here is similar to GLM-4.7-Flash’s surgical approach.

task_07 (multi_step_plan): Clean sequential execution, four steps completed correctly across all three runs. This task separates models that have genuine sequential planning capability from those pattern-matching on instruction-following. Nova Pro passes cleanly.

task_08 (recover_from_tool_error): The injected file-not-found error was caught and corrected. Path correction on first failure, clean retry, 3/3. No looping.

task_10 (sql_investigation): Consistent schema analysis, 3/3 at 4.7 average tool calls. The investigation pattern held across all three runs.

Seven of ten tasks passed. The 20/30 score is almost entirely explained by two failure modes, not a general capability deficit across the suite.

The home-field finding

[Observed — leaderboard, cross-campaign data]

AWS built this model and runs the infrastructure it runs on. The Bedrock Converse API is its native calling convention. If tight platform integration was going to produce a measurable advantage anywhere in this dataset, it would be here.

It did not produce one. Nova Pro placed 16th of 21 models on a harness that runs via Bedrock Converse API. The models above it range from MoE architectures to dense models, from AWS’s API competitors to open-weight releases. Infrastructure origin does not appear to be a factor in agentic task performance.

Amazon’s stated design priorities for Nova Pro are enterprise reliability, RAG integration, and predictable throughput. Those are real product advantages; they just do not show up as scoring advantages on a benchmark that tests code-level reasoning and sequential planning. The model was built for a different job.

Cost in context

[Observed — cross-campaign data, pricing documentation]

At $0.0068/pass, Nova Pro is expensive relative to its tier:

Model	Score	$/pass	Pricing
Mistral Large 3	27/30	$0.0021	$0.10/$0.30/1M
GLM-4.7	28/30	$0.0038	$0.50/$2.00/1M
Amazon Nova Pro	20/30	$0.0068	$0.80/$3.20/1M
GPT-OSS-20B	25/30	$0.000481	—
GLM-4.7-Flash	25/30	$0.000565	$0.07/$0.40/1M

Mistral Large 3 costs 3x less per passing task and scores 7 passes higher. GPT-OSS-20B costs 14x less and scores 5 passes higher. GLM-4.7-Flash costs 12x less at the same 25/30 tier.

The Nova Pro pricing reflects Bedrock’s enterprise positioning: SLA guarantees, availability commitments, AWS-native billing. Those factors matter to enterprise buyers. They do not show up in passing task counts on this harness.

Leaderboard position

[Observed — leaderboard, cross-campaign data]

Model	Score	$/pass	Lab
GLM-4.7	28/30	$0.0038	Zhipu AI
DeepSeek V4 Flash	28/30	$0.0015	DeepSeek
Claude Sonnet 4.6	28/30	$0.0514	Anthropic
GPT-5.5	27/30	$0.0699	OpenAI
Devstral 2	27/30	$0.0020	Mistral
Mistral Large 3	27/30	$0.0021	Mistral
MiniMax M2.5	27/30	$0.0024	MiniMax
GLM-5	27/30	$0.0065	Zhipu AI
GPT-OSS-20B	25/30	$0.000481	OpenAI
GLM-4.7-Flash	25/30	$0.000565	Zhipu AI
Kimi K2.5	24/30	$0.0044	Moonshot AI
GPT-OSS-120B	23/30	$0.0013	OpenAI
Qwen3 32B	23/30	$0.0010	Alibaba
Qwen3-Coder 30B A3B	22/30	$0.0018	Alibaba
Qwen3 Next 80B A3B	21/30	$0.0012	Alibaba
Amazon Nova Pro	20/30	$0.0068	Amazon
DeepSeek V3.2	19/30	—	DeepSeek
Llama 3.3 70B	14/30	$0.0047	Meta
Nemotron Super 3 120B	12/30	$0.0016	NVIDIA
Qwen3 VL 235B A22B	12/30	$0.0050	Alibaba
Nemotron Nano 3 30B	10/30	$0.00079	NVIDIA

Nova Pro sits one slot above DeepSeek V3.2 (19/30). Its failure modes are qualitatively different from DeepSeek V3.2’s, which showed a consistent investigation-loop failure across multiple tasks. Nova Pro’s failures are concentrated on two tasks with specific, diagnosable causes.

It is not in the bottom tier. It is not near the top tier either. Mid-table, first Amazon model, with a cost structure that the scoring doesn’t fully justify.

Failure mode summary

[Observed — data pack task_outcomes]

Failure mode	Tasks	Runs
wrong_answer (refactor contract)	task_02	3/3
wrong_answer (trace incompleteness)	task_04	3/3
wrong_answer (impossibility format)	task_09 runs 1, 2	2/3
wrong_answer (regression under tool error)	task_09 run 3	1/3
wrong_answer (ambiguous format)	task_06 run 2	1/3

Zero gave_up_mid_plan. Zero infrastructure errors. Zero tool_call_malformed. Nova Pro completed every task; every failure was an answer quality issue, not an execution breakdown.

Execution profile

[Observed — data pack run_metrics]

4.7s avg latency/run. Consistent across tasks; no single task drove latency significantly higher.
$0.1363 total campaign spend. Highest total spend in the dataset outside Claude Sonnet 4.6 ($1.44) and GPT-5.5.
Zero infrastructure errors across 30 runs. Clean Bedrock API behavior throughout.
All failures classified as wrong_answer. No planning breakdowns, no hallucinated tools, no malformed outputs.

The execution profile is well-behaved. The problems are in what the model decided was correct, not in how it executed.

What builders need to know

[Speculation]

If you are running Nova Pro in production on agentic workflows:

Interface-preservation tasks carry risk. When the task requires modifying internals while preserving an external API contract, Nova Pro optimizes for observable success (tests passing) over contractual compliance (preserving specified interfaces). Any workflow where the model must maintain external API shape while refactoring internals will hit this.

Deep call-chain tracing is unreliable. The model converges early on incomplete answers. Shallow codebase investigations are likely fine. Full cross-module trace requirements are not.

Impossibility recognition works intermittently. The model can identify structurally impossible requests, as the task_09 reasoning blocks show. It does not always format the refusal correctly, and under tool error pressure it abandons the correct conclusion. If your workflow requires reliable impossible-task detection, the 0/3 outcome on task_09 is the relevant signal, not the thinking-block analysis.

Multi-step sequential execution, error recovery, log analysis, and SQL investigation are solid. If your deployment encounters primarily those task types, Nova Pro is a workable option, especially if you are already embedded in the AWS ecosystem.

The Bedrock-native integration is real. It just does not translate into agentic task performance advantages on this harness.

Predictions

[Observed — predictions file]

Prediction	Expected	Actual	Verdict
P1 Score range	20–25/30	20/30	✅ (floor)
P2 task_09	0/3	0/3	✅
P3 task_07	3/3	3/3	✅
P4 task_06	0/3	2/3	❌

3/4 correct. P4 missed: task_06 was predicted to fail completely on the ambiguous requirement task but Nova Pro passed 2/3. The model handled the ambiguity on two of three runs. The miss was in the positive direction.

P1 landed exactly at the prediction floor. The home-field thesis was not factored into the score range; the pre-run prediction treated Nova Pro as a capable mid-tier model without structural advantages. That turned out to be the right frame.