agentic-core-v1: What We Actually Measure and Why

May 5, 2026 · methodology

Most benchmarks test what a model knows. agentic-core-v1 tests what it does under pressure — whether it can navigate a codebase, diagnose a problem, make targeted edits, and stop when it should stop. Not in theory. With real tool calls, real file I/O, and a checker that either accepts the output or rejects it.

We built the suite to establish a baseline before comparing models. The first run against Claude Sonnet 4.6 produced a clean result and one genuinely surprising failure mode — a task the model passed for the wrong reason, and only on the run where an environmental accident forced it toward the right answer. That kind of finding only surfaces when the benchmark is designed to catch it.

This article explains the structure: what the ten tasks test, why we structured it this way, and where the suite deliberately stops short.

The premise: ten tasks, three runs each

agentic-core-v1 is thirty runs: ten tasks, each executed three times with no shared state between runs. The 3× repetition is not padding. It’s the minimum needed to distinguish a model that solves a problem from a model that gets lucky.

A model that passes a task 3/3 is probably reliable on that task class. A model that passes 1/3 is telling you something: either the task contains a genuine ambiguity the model resolves differently each time, or the model’s approach is unstable and success depends on which path it happens to take on a given run. Both outcomes are information. A single run gives you neither.

The task suite is also deliberately short. Each task runs in a constrained environment with a 13–15 turn budget. The goal is not to simulate full software engineering projects. It’s to probe specific capabilities in controlled conditions where the checker can give a deterministic pass/fail verdict.

The ten task categories

File operations and basic environment literacy

The simplest tasks in the suite test whether a model can read, write, and navigate a filesystem reliably. These are table-stakes capabilities. If a model can’t write a file to the right path or read a config correctly, nothing else works. We include them not because they’re interesting but because they fail in surprising ways when they do fail, and we want a clean baseline.

Code refactoring: task_02_refactor_duplicated_code

task_02_refactor_duplicated_code gives the model a Python file with obvious duplication, the kind a junior developer would notice on first read. The task: refactor it. The checker validates the final state of the file.

What we’re probing: can the model reason about code structure, make targeted edits, and leave the surrounding code intact? This turns out to be harder than it sounds. Models that make broad rewrites sometimes break adjacent functionality. Models that are too conservative leave the duplication in place.

One pattern we observed: 3/3 runs on task_02 showed tool_call_redundancy. The model called fs_write on the same file with identical content in consecutive turns. The refactor was correct each time. The duplicate write didn’t change anything (idempotent operation), so all three passed. But the pattern is worth tracking: something about having just written the file caused the model to write it again.

Log investigation: task_03_investigate_log

task_03_investigate_log provides a 500-line log file and asks the model to identify what went wrong. It’s the most expensive task in the suite ($0.13/run average), because the log file has large input context (118,046 input tokens across 3 runs).

What we’re probing: can the model navigate a large, noisy document, identify relevant signal, and produce a diagnosis rather than a summary? Summarization and diagnosis are different skills. A summary lists what’s in the log. A diagnosis identifies the cause.

Tracing through a codebase: task_04_trace_through_codebase

task_04_trace_through_codebase is the slowest task in the suite at 56.6 seconds per run average, because it involves reading five source files, often multiple times, to trace a value through a call chain. It averaged 18.3 tool calls per run, the highest in the suite.

What we’re probing: multi-hop reasoning across files. Can the model hold a partial trace in context, know which file to read next, and synthesize a complete answer without losing track of earlier findings? This is a realistic proxy for navigating an unfamiliar codebase.

Multi-step planning: task_07_multi_step_plan

task_07_multi_step_plan is the cheapest task ($0.012/run average), which is counterintuitive given the name. The task requires generating a structured plan with dependencies. The checker validates structure and completeness, not prose quality.

What we’re probing: does the model sequence steps correctly and make the dependency graph explicit? A model that produces a plausible-sounding plan with the wrong ordering is worse than one that produces a minimal correct plan. The cheap cost reflects short output, not simple reasoning.

Knowing when to stop: task_09_know_when_to_stop

This is the task worth writing home about, and we wrote a separate piece on it. Here’s the short version.

task_09_know_when_to_stop asks the model to compute a 10-day moving average of revenue from a CSV file. The file has three rows. A 10-day window on three data points is structurally undefined without a min_periods policy.

Run 1 passed, not because the model reasoned through the ambiguity, but because pandas wasn’t available and the fallback implementation used min_periods=1. The error-forced path produced a compliant answer: compute what you can, document why.

Runs 2 and 3 had the same data, saw the same situation, chose the strict NaN path (technically defensible), and then couldn’t exit. Both runs hit the 13-turn limit looping on the same re-read of data.csv after already having the content.

The task name is “know when to stop.” Run 1 stopped at the right place by accident. Runs 2 and 3 didn’t stop when they should have. The model’s stated awareness (“only 3 data points, fewer than the 10-day window”) appeared before the NaN output in both failed runs. Knowing a fact about the situation and acting on it appropriately are not the same thing.

The other tasks

Tasks 01, 05, 06, 08, and 10 cover additional ground: environment setup, file manipulation at scale, output formatting, conditional logic, and edge-case handling. One exception: task_08 run 2 showed tool_call_redundancy (re-read data.txt in consecutive turns after already having the file; run passed). The other four tasks had zero redundant calls. Zero tasks in this group showed diagnosis_then_regression (a pattern where the model states a diagnosis then contradicts it). Zero runs exceeded 12 turns outside the task_09 failures. The baseline capability profile is clean.

What the suite doesn’t test

agentic-core-v1 tests model behavior under controlled conditions with deterministic checkers. That’s a strength and a constraint.

It does not test: ambiguous acceptance criteria (all checkers have clear pass/fail logic), large-scale code changes (tasks are scoped to files, not repositories), or multi-session state (each run starts fresh). The per-run cost of $0.048 and sub-60-second latency reflect short, bounded tasks. Neither scales to real engineering work.

We also don’t know whether the 93.3% pass rate holds at harder task variants. The task_09 finding suggests that model performance can be fragile in specific ways that only show up under variance. That’s exactly why we run 3× per task.

Why this structure

Ten tasks, three runs, deterministic checkers. Simple enough to reproduce. Structured enough to give signal. The 3× repetition surfaces instability that single-run benchmarks hide. The task categories span the capability dimensions that matter for agentic workflows: navigation, transformation, investigation, planning, and exit conditions.

The goal is a benchmark that catches failure modes that matter in production: agents getting stuck, agents making redundant calls, agents knowing facts but not acting on them. agentic-core-v1 is version one. The task_09 finding will inform what version two looks like.

Frequently Asked Questions

What is agentic-core-v1?

agentic-core-v1 is a software engineering benchmark for AI agents. It consists of ten task categories, each executed three times, for a total of thirty runs per model. Tasks involve real tool calls — file reads, file writes, code analysis — with deterministic pass/fail checkers. A model scores out of 30.

How does agentic-core-v1 scoring work?

Each task runs three times with no shared state between runs. Each run either passes or fails. The final score is the number of passing runs out of 30. A model that passes all three runs on a task scores 3/3. A model that fails all three scores 0/3. No partial credit.

What is task_09 and why does it matter?

task_09 (know_when_to_stop) asks a model to compute a 10-day moving average from a CSV file with only three rows — a structurally underspecified problem. It tests whether a model can recognise an ambiguous situation, make a defensible decision, and exit cleanly. Across 35+ tested models, task_09 is the single most common failure point: models typically know the situation is ambiguous but loop repeatedly without exiting.

Why run each task three times?

A single run cannot distinguish a reliable model from a lucky one. Three runs is the minimum needed to detect instability: a model that passes 3/3 is probably reliable on that task class, while a model that passes 1/3 signals either task ambiguity or an unstable approach. The 3× repetition separates agentic-core-v1 from single-shot benchmarks.

What does agentic-core-v1 NOT test?

agentic-core-v1 does not test ambiguous acceptance criteria (all checkers have clear pass/fail logic), large-scale code changes across repositories, or multi-session state. Each run starts fresh and is bounded to a 13–15 turn budget. It is designed to probe specific capabilities in controlled conditions, not to simulate full software engineering projects.

What is the current top score on agentic-core-v1?

Claude Opus 4.8 and DeepSeek V4 Pro have achieved 30/30 (perfect score). Mistral Small 4 scored 29/30. Claude Sonnet 4.6 and Claude Haiku 4.5 have scored 27–28/30. The benchmark leaderboard is updated as new campaigns complete.