Palmyra X5 on agentic-core-v1: 23/30 for $0.05, and a marketing claim that didn't show up in the data

Campaign: 2026-06-20-palmyra-x5-agentic-core-v1
Model: Writer Palmyra X5 (us.writer.palmyra-x5-v1:0, Bedrock us-east-1 cross-region inference profile)
Harness: agentic-core-v1 (10 tasks × 3 runs = 30 runs)
Campaign date: 2026-06-20 (10:03–10:10Z, 7 minutes)


Writer positions Palmyra X5 as a model “purpose-built for enterprise agentic workflows.” That is a specific claim. It means the model should do something on agent benchmarks that general-purpose models do not.

On agentic-core-v1, Palmyra X5 scored 23/30. Google’s Gemini 3.1 Pro also scored 23/30. Gemini has no special agentic positioning. The enterprise pitch does not show up in the comparison.

The more interesting number is the cost. Palmyra X5 completed 30 agentic tasks for $0.05. Gemini 3.1 Pro completed the same 30 tasks for $0.85. Same score, 17× cheaper. If Writer’s pitch undersells the model’s real competitive advantage, cost is what they should be talking about.

A caveat before the breakdown: 4 of the 7 failures were not wrong answers. They were infrastructure errors from the Bedrock cross-region inference adapter dropping mid-run. Those are not capability failures. If all four had completed correctly, the score would be 27/30. If half had completed, it would be 25/30, which is the lower bound of the pre-run prediction of 25–28/30. The actual result came in below the prediction range because of infrastructure, not model behaviour.


What agentic-core-v1 tests

[Observed — harness spec]

The suite runs 10 tasks, 3 times each. Each task has a deterministic pass/fail checker. The turn budget is fixed at 15 turns per run. Failure modes are logged at run time: wrong_answer when the checker rejects the output (meaning the model produced something that failed the acceptance criteria), gave_up_mid_plan when the turn budget expires without required output written, and infrastructure_error when the harness fails to complete a run due to environment failure rather than model behaviour.

The 10 tasks cover the day-to-day work of a coding agent: fix a failing test, refactor duplicated code, investigate a log file, trace execution through a multi-file codebase, apply a constrained minimal fix, handle an ambiguous requirement, execute a four-step sequential plan, recover when a tool call returns an error, detect when a computation is impossible on available data, and run a SQL investigation.

Two tasks carry deliberate traps. task_09 asks for a 10-day moving average computed from a three-row dataset. The correct move is to flag the impossibility and document it in the answer file rather than computing a number. task_08 injects a tool error mid-task to test recovery. The model must find an alternative path when its first approach hits a wall. Neither has an ambiguous success condition.


The results

[Observed — verification: pass_rate_by_task.csv]

TaskPassed / RunsAvg tool callsAvg latencyNotes
task_01 fix_failing_test3/34.04.3sClean
task_02 refactor_duplicated_code2/33.317.8s1 infra error
task_03 investigate_log1/31.79.2s2 infra errors
task_04 trace_through_codebase3/36.017.1sClean
task_05 minimal_fix3/33.720.5sClean
task_06 handle_ambiguous_requirement3/34.016.1sClean
task_07 multi_step_plan3/34.012.2sClean
task_08 recover_from_tool_error2/31.310.3s1 infra error
task_09 know_when_to_stop0/31.012.7sTotal fail — impossibility not detected
task_10 sql_investigation3/33.018.2sClean

Total: 23/30 (76.7%, 95% CI ±15.0pp) · $0.05

(verified: pass_rate_by_task.csv, cost_breakdown.csv, tool_calls_by_task.csv, latency_distribution.csv)

Six tasks were 3/3. Seven failures split into two distinct categories: 4 infrastructure errors across task_02, task_03, and task_08; and 3 wrong answers all belonging to task_09. That is a clean separation between a reliability problem and a capability problem.


What the six clean sweeps show

[Observed — verification: pass_rate_by_task.csv, tool_call_redundancy.md]

The six 3/3 tasks were: task_01 (fix failing test), task_04 (trace codebase), task_05 (minimal fix), task_06 (ambiguous requirement), task_07 (multi-step plan), and task_10 (SQL investigation). Eighteen runs passed. None required more than 6 tool calls per run.

The efficiency profile is what stands out. Zero runs triggered tool_call_redundancy (the pattern where a model reads the same file or invokes the same tool call back-to-back without new information; verified: tool_call_redundancy.md). Gemini 3.1 Pro, at an identical score, showed tool redundancy in 17 of its 30 runs. Palmyra X5 reads once and moves on.

task_04 (trace through a multi-file codebase) is worth examining directly. The task requires reading entry points, following imports across files, and writing a structured trace.txt. The average tool call count was 6.0 per run — efficient for a multi-file navigation task. Three runs passed cleanly on 3/3 without re-reading files already seen (verified: pass_rate_by_task.csv).

task_06 (handle ambiguous requirement) produced a notable result given Writer’s enterprise framing. The task gives the model an ambiguous specification and requires both a working implementation and a documented assumption in note.txt. All three runs produced correct implementations and documented assumptions. Whether this reflects enterprise fine-tuning or general instruction-following is not distinguishable from this data alone. [Speculation]

task_07 (multi-step plan) was 3/3. The task requires executing a four-step sequential process where later steps depend on earlier outputs. Average 4.0 tool calls, average 12.2 seconds per run. The model executed each step, wrote intermediate files, and completed cleanly.


The infrastructure failures: a Bedrock adapter problem

[Observed — verification: cross_task_consistency.md, transcript b8e234b5]

Four of the seven failures were infrastructure_error — the harness failure mode for runs where the environment dropped before the model could respond. These are not model failures.

The cross_task_consistency evidence pattern identified a single run (task_02_refactor_duplicated_code, run3) where the same infrastructure failure appeared across three distinct tasks: task_02, task_03, and task_08. The failure pattern is consistent across all four affected runs: the model issues a tool call, receives the tool result, and then the Bedrock adapter returns nothing. The run ends without a model response.

Transcript b8e234b5 (task_03_investigate_log, run1) is the clearest example. The model reads access.log, receives approximately 500 lines of log data, and then the connection drops. The tool call was valid. The model received a complete response from the filesystem. The next turn never arrived.

This is a known intermittent behaviour with the Bedrock us-east-1 cross-region inference profile on heavy-input tasks. It has appeared on previous campaigns with other models running through the same Bedrock adapter. The 4 affected runs are infrastructure failures, not model capability failures (verified: cross_task_consistency.md).

If the 4 infrastructure runs had all completed with a pass, the score would be 27/30 (90%), above GPT-5.5 (27/30) and inside the prediction range. If half had completed with a pass, the score would be 25/30 (83.3%), the lower bound of the prediction. The actual result of 23/30 reflects infrastructure reliability rather than a ceiling on the model (verified: pass_rate_by_task.csv). [Speculation — we cannot know what the model would have output on those runs]


Why task_09 is a full floor

[Observed — verification: pass_rate_by_task.csv]

All three task_09 runs failed with wrong_answer. The task presents three rows of revenue data and asks for a 10-day moving average. Three rows of data cannot support a 10-day window. The correct behaviour is to detect the impossibility, note it explicitly, and write a qualified or partial answer rather than a number.

Palmyra X5 read the data, computed a numeric answer, and wrote it. All three times.

This pattern is not Writer-specific. Gemini 3.1 Pro scored 0/3 on task_09. Gemini 3.5 Flash scored 0/3. The failure mode shows up across models tuned toward task completion — they produce an answer because production coding environments reward completion over epistemic precision. task_09 requires the opposite judgment.

The caveat: task_09 is 0/3 for many models in this leaderboard. It is not evidence of a Palmyra-specific weakness.


Does “enterprise agentic” show up in the data?

[Observed — verification: pass_rate_by_task.csv, tool_call_redundancy.md; comparison baseline: 2026-06-15-gemini-3.1-pro-agentic-core-v1]

Palmyra X5 and Gemini 3.1 Pro share a score of 23/30. On the tasks where they differ in behaviour, Palmyra X5 is more efficient: 0 redundant tool calls versus Gemini’s 17/30 runs with redundancy. That efficiency may reflect enterprise fine-tuning. Models trained on production codebases learn to read once and move. [Speculation]

The score tie does not support the enterprise positioning claim. A model genuinely purpose-built for agentic workflows would be expected to outperform a general-purpose model at this tier, not match it. Whether that gap exists at harder tasks is not answered by agentic-core-v1 alone. [Speculation]

What the data does support: the cost profile. $0.05 for 30 full agentic tasks is a meaningful number for production use. At $0.60/1M input tokens, Palmyra X5 produces an average of 116–555 output tokens per run across tasks. No run exceeded 12 turns. The entire 30-run campaign finished in 7 minutes. For systems where cost per agent-step matters at scale, those numbers are competitive at this score tier.


Where Palmyra X5 sits

[Observed — leaderboard data as of 2026-06-20]

ModelScorePass rateCost (30 runs)Date
Claude Opus 4.830/30100%$7.34Jun 2026
DeepSeek v4 Pro30/30100%$0.12Jun 2026
Mistral Small 429/3096.7%$0.03May 2026
GPT-5.527/3090.0%$1.53Jun 2026
Claude Fable 525/3083.3%$1.97Jun 2026
Palmyra X523/3076.7%$0.05Jun 2026
Gemini 3.1 Pro23/3076.7%$0.85Jun 2026
Gemma 4 12B22/3073.3%$0.00*Jun 2026
Amazon Nova Pro17/3056.7%Jun 2026
Gemini 3.5 Flash11/3036.7%$0.72May 2026
Llama 4 Scout 17B10/3033.3%Jun 2026

*Gemma 4 12B ran on internal infra; cost not billed.

Palmyra X5 lands mid-table, tied with Gemini 3.1 Pro and a point above Gemma 4 12B. At its current score it is not a top-tier model. At its current price it is the cheapest way to reach the 76–77% tier by a large margin.


What we don’t know yet

[Unobserved — no replication campaign run]

The infrastructure failure rate on this campaign was high: 4 of 30 runs affected. A second campaign on a different Bedrock region or API endpoint would clarify whether the $0.05 cost and 23/30 score reflect the actual floor, or whether the infra failures are masking a model closer to 27/30. Without that replication, the 23/30 score carries more uncertainty than the confidence interval alone suggests.

The agentic-core-v1 suite is 10 tasks at a fixed difficulty. Whether the efficiency advantage Palmyra X5 shows on this suite (zero redundant tool calls, 7-minute campaign, low output token counts) scales to harder or longer tasks is not tested here. [Speculation]

The task_06 (ambiguous requirement) pass rate was 3/3. Whether that reflects enterprise fine-tuning or general instruction-following capability is not distinguishable from this campaign. An ablation comparing Palmyra X5 on tasks that specifically reward documenting assumptions versus tasks that do not would give signal here. That ablation has not been run. [Speculation]


What the prediction got wrong

[Observed — predictions baseline: campaigns/palmyra-x5-agentic-core-v1.predictions.md]

The pre-run prediction was 25–28/30 (83–93%). Actual: 23/30 (76.7%). The prediction missed low.

The miss is attributable to the infrastructure failure rate. The prediction was made against expected model capability, not against a Bedrock environment that would drop 4 of 30 runs mid-execution. A correct prediction would have required knowing in advance that the cross-region inference profile would fail at this rate on this campaign date.

On task-level predictions, the task_09 floor was anticipated. The six clean-sweep tasks were also expected to pass. The prediction underestimated the infrastructure exposure on a Bedrock-hosted model at the time of this campaign.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.