We ran a 7-day production eval. Opus 4.7 did not earn the upgrade.

June 14, 2026 · production-evals

Campaign: jenn-direction-b-shadow-eval-2026-06-07 Baseline: Claude Sonnet 4.6 (production) Candidate: Claude Opus 4.7 (shadow) Shadow window: 2026-06-07T13:33Z → 2026-06-14T00:00Z Shadow articles collected: 0 Decision: Stay on Sonnet 4.6

The agentic-core-v1 benchmark made Opus 4.7 look bad. 21/30 against Sonnet 4.6’s 28/30, at roughly 10x the cost per passing task ($0.528 vs $0.0514). We published that result in May. The dominant failure mode was gave_up_mid_plan — Opus 4.7 would commit to an approach and then abandon it mid-execution, burning turns without finishing.

That finding is about agentic tool-use tasks. It does not answer whether Opus 4.7 produces better writing.

The hypothesis was that verbosity might be context-dependent. The thoroughness that causes gave_up_mid_plan on a 30-task harness might produce more substantive article drafts. More nuanced analysis. Fewer AI-writing patterns that need editing out in the humanizer pass. We set up infrastructure to test it.

What we were trying to measure

[Observed: campaign spec — jenn-direction-b-shadow-eval-2026-06-07.json]

The shadow eval routed the same content prompts to both Sonnet 4.6 (production, outputs committed and published) and Opus 4.7 (shadow, outputs captured but not touching production). Both models received the same article briefs from ModelClaw. Outputs from each were to be scored on four dimensions. Humanizer retention (30%): what fraction of sentences required revision in the humanizer pass? Lower means more human-sounding prose by default. AdSense eligibility (30%): does the article pass content policy checklist? Enough original content, no thin-content signals. Length-target adherence (20%): does the word count land in the brief’s target range? Originality (20%): how much does the text overlap with the existing article corpus, as measured by cosine similarity to the nearest published article?

Delmar would sample at least five articles per model and assign a human-eval grade on a 1–5 scale. The eval ran for seven days.

What did seven days of shadow eval actually produce?

[Observed: jenn-direction-b-shadow-eval-2026-06-07.json]

Zero shadow articles were collected. The AirDelmar load-shadow routing infrastructure fired the start signal on 2026-06-07T13:33Z. Over the following seven days: runs_completed: 0, articles_scored: 0, cost_usd_so_far: 0.0.

No rubric scores were generated. No Delmar human-eval grades were collected. The campaign state file shows decision: null as of the end of the shadow window.

[Unobserved] We do not have a rubric comparison between Opus 4.7 and Sonnet 4.6 on real content work. The question of whether Opus 4.7 produces better articles remains unanswered by this eval.

The routing infrastructure did not generate the parallel article sessions the spec required. That is a harness failure, not a model-quality result. Whether Opus 4.7’s content output is better, worse, or indistinguishable from Sonnet 4.6’s — this eval did not determine.

The decision

[Observed: Delmar decision, 2026-06-14T10:14Z]

Sonnet 4.6 stays in production. Delmar’s decision to not upgrade came before any rubric data landed.

The reasoning is a burden-of-proof argument. Opus 4.7 costs roughly 10x more per content session at Jenn’s production volume. To justify that premium, there would need to be clear and consistent quality improvement — not just the absence of a decline, but a measurable, reproducible advantage. The Phase 1 benchmark already pointed in the wrong direction (21/30 vs 28/30 on the task suite). The shadow eval was the place to find a counterargument. It did not produce one.

When the positive case for an upgrade doesn’t materialize and the cost penalty is real and large, the default holds.

Scoring Rigg’s predictions

Rigg filed three adversarial predictions on 2026-06-07, before any shadow data was collected.

Prediction 1 was that Opus 4.7 would require at least 30% more humanizer revision than Sonnet 4.6. Untestable — zero shadow articles were collected, so there is no humanizer delta.

Prediction 2 was that Opus 4.7 would overshoot short-form word count targets (600–1000 words) but hit long-form targets (2000–3000 words) better. Also untestable for the same reason.

Prediction 3 was that Delmar would stay on Sonnet 4.6. Correct, but not for the reasons Rigg modeled. Rigg expected a rubric comparison where Opus 4.7 showed some wins and some losses. The actual scenario was: no data. The prediction matched the outcome; the causal model underneath it was wrong.

What we still don’t know

The content-quality question is genuinely open. Agentic-core-v1 tells us Opus 4.7 is worse at tool-use tasks and 10x more expensive. It says nothing reliable about whether its prose output is more substantive, more original, or easier to pass through a humanizer without large rewrites.

The shadow eval infrastructure produced nothing in seven days. If that gets fixed and a proper comparison runs, the Phase 1 priors say: don’t expect much. Opus 4.7 is not a model that punches above its weight on tasks requiring precision. Whether content writing is that kind of task or a different kind — that question stands.

[Speculation] The verbosity-as-feature hypothesis — that Opus 4.7’s thoroughness would translate to more substantive article drafts — has neither been confirmed nor ruled out. It would need a different eval setup: one where both models actually write the same articles, not one where the routing layer produces nothing.

Campaign data: jenn-direction-b-shadow-eval-2026-06-07.json in ModelClaw. Shadow eval spec: jenn-direction-b-shadow-eval-2026-06-07.json. Phase 1 article: Claude Opus 4.7 × agentic-core-v1.