Jamba 1.5 Large diagnosed the bugs. It did not fix any of them.

Campaign: 2026-05-22-ai21-jamba-1-5-large-agentic-core-v1
Model: AI21 Jamba 1.5 Large (ai21.jamba-1-5-large-v1:0, AWS Bedrock, us-east-1, ON_DEMAND)
Architecture: SSM-Transformer hybrid — 398B parameters, Mamba + Attention layers
Harness: agentic-core-v1 (10 tasks × 3 runs = 30 total)
Campaign date: 2026-05-22


AI21 positions Jamba 1.5 Large explicitly for enterprise agentic workflows. It is the company’s flagship model: a 398-billion parameter SSM-Transformer hybrid combining Mamba state-space layers with standard Attention. The architecture is the first of its kind in this dataset, and the pre-run prediction was 19–25/30 based on the enterprise positioning and model scale.

Jamba 1.5 Large scored 8/30 (26.67%) at $0.0044 per passing task. F4 was triggered: the score fell below the predicted lower bound of 19. Average latency was 4.5 seconds per run. 2/4 predictions were correct.

The score is last in the dataset. The story behind it is more specific than the number alone suggests.


What the harness asks

[Observed — harness spec]

Ten tasks, three independent runs each, thirty runs total. agentic-core-v1 covers software engineering work a deployed agent would encounter in practice: fix a failing test, refactor duplicated code, investigate a log file, trace execution through a codebase, make a targeted minimal fix, handle an ambiguous requirement, execute a multi-step sequential plan, recover from an injected tool error, detect when a requested computation is impossible, and run a SQL investigation.

Two tasks have structural complexity worth flagging before the results. task_09 presents a three-row CSV and asks for a ten-day moving average. Three data points cannot support a ten-day window; the correct response is to recognize the impossibility and refuse. task_08 deliberately injects a file-not-found error on the first tool call, requiring the model to detect it, locate the correct path, and produce verified output.

A pass requires correct task completion. Failure modes are classified: wrong_answer, gave_up_mid_plan, tool_call_hallucinated, tool_call_malformed.


What happened

[Observed — data pack per_task_results]

TaskScoreNotes
task_01 fix_failing_test0/3Found the bug, never wrote the fix
task_02 refactor_duplicated_code0/3Narrated refactor plan, stopped before execution
task_03 investigate_log1/3Partially succeeded on 1 run
task_04 trace_through_codebase0/3Zero tool calls; hallucinated walkthrough inline
task_05 minimal_fix0/3Read files, described fix, never applied it
task_06 handle_ambiguous_requirement1/31 run produced a valid clarification response
task_07 multi_step_plan3/3Clean; created 4 files in sequence
task_08 recover_from_tool_error3/3Clean; wrote content to file after error recovery
task_09 know_when_to_stop0/3Confident wrong numeric output
task_10 sql_investigation0/3Failed to produce correct SQL analysis

Total: 8/30. $0.0355 total campaign cost. $0.0044/pass.

Six of the ten tasks failed completely. The two perfect scores come from tasks with a specific structural property discussed below. 22 of the 22 failure-mode classifications were wrong_answer. No planning breakdowns, no hallucinated tool calls, no malformed outputs. The model completed every task attempt; it just did not complete the task.


The read-analyze-narrate pattern

[Observed — data pack diagnosis_then_regression, task_01_results, task_02_results, task_05_results]

Across five of the six failed tasks, Jamba 1.5 Large exhibits a consistent pattern: it identifies the correct fix, describes it clearly in prose, and stops without executing it.

Transcript evidence from task_01 (fix failing test):

“Let’s fix it.” The model narrates the next step and terminates before taking it. The same structure appears in task_02 (refactor), task_05 (minimal fix), and task_06 (ambiguous requirement). Read the relevant files, narrate the diagnosis in correct detail, stop. The action phase of the agentic loop never begins.

This is not ambiguity about what the task requires. The model demonstrates it understood the problem. The tool post-training did not close the loop between “here is the correct answer” and “now write it.”


task_04: the most diagnostic data point

[Observed — data pack task_04_results, tool_calls_by_task]

task_04 asks for a complete call-chain trace from entry() to report(), written to trace.txt. This requires reading three source files.

Jamba 1.5 Large used zero tool calls across all three runs. The model wrote out a pseudo-code walkthrough of the call chain from its training knowledge, without reading any file. The walkthrough is fluent. It is also fabricated. The code in the task repository does not match what the model described.

task_04 is the clearest signal in the campaign. The other failed tasks at least show correct file-reading behavior followed by premature termination. task_04 shows the model deciding it does not need to read the files at all.


The two tasks that passed

[Observed — data pack task_07_results, task_08_results]

task_07 (multi-step plan) and task_08 (recover from tool error) both scored 3/3. Both passed cleanly. Neither required the model to read existing code and then modify it.

task_07 asks for four new files to be created in sequence: a plan document, a directory structure, a manifest, and a summary. All creation, no modification. Jamba executed each step correctly, in order, across all three runs.

task_08 injects a file-not-found error on the first tool call and asks the model to write a number to a file. The model caught the error, located the correct path, and wrote the output. Again: no existing code to analyze and modify.

The pattern separates cleanly along task type. Write-only tasks pass. Read-modify-write tasks fail. The model’s tool post-training appears to have learned to generate new content but not to execute the read-then-modify loop that most code-level agentic tasks require.


The architecture angle

[Observed — architecture spec; Speculation — post-training inference]

Jamba 1.5 Large is the first SSM-Transformer hybrid in this dataset. The pre-run hypothesis was that Mamba layers might cause state-tracking degradation across multi-turn tool exchanges, specifically by losing earlier tool results when processing long contexts.

That hypothesis did not hold. The observed failures are not state degradation. The model correctly tracked file contents it read, correctly described bugs it found, correctly narrated refactor steps in sequence. Conversational coherence is intact. The Mamba recurrence appears sufficient for the analysis phase of agentic tasks.

The failure is in the post-training gap between analysis and execution. The model was not trained to close the agentic loop. It does not follow “here is what should change” with an fs_write call that changes it. That is a training decision, not an architecture constraint. An attention-only model with the same post-training gap would produce the same failure mode.

[Speculation]

AI21 markets Jamba 1.5 Large for enterprise document processing, Q&A, and retrieval-augmented generation. Those use cases do not require the read-modify-write loop agentic-core-v1 tests. The model performs the diagnostic half of agentic work well. The “agentic” in AI21’s product positioning appears to cover a narrower capability surface than the harness tests.


Amazon Nova Pro comparison

[Observed — cross-campaign data]

Both Jamba 1.5 Large and Amazon Nova Pro are enterprise-positioned models with explicit agentic claims. Amazon Nova Pro scored 20/30 on the same harness two campaigns earlier. The gap is 12 passing tasks at roughly similar per-run pricing.

Nova Pro’s failures were concentrated on two specific tasks (task_02 and task_04) with diagnosable causes. Jamba’s failures are systematic across six tasks. The difference is not architecture. Nova Pro is a standard Transformer. The difference is post-training depth for the read-modify-write loop. Nova Pro was trained to close the loop and gets the answer wrong. Jamba was not trained to close the loop at all.

Both models are positioned for enterprise agentic deployment. The scoring gap between them suggests substantially different training investments in code-level tool execution.


Cost in context

[Observed — cross-campaign data, pricing documentation]

At $0.0044/pass, Jamba 1.5 Large is expensive relative to what it delivers:

ModelScore$/passNotes
GLM-4.728/30$0.00383.5x the score, similar cost
Amazon Nova Pro20/30$0.00682.5x the score, somewhat costlier
GLM-4.7-Flash25/30$0.0005653x the score, 8x cheaper
GPT-OSS-20B25/30$0.0004813x the score, 9x cheaper
Jamba 1.5 Large8/30$0.0044
Nemotron Nano 3 30B10/30$0.00079Scores 2 pts higher, 6x cheaper

The $0.0355 total campaign cost is low in absolute terms. The per-pass cost is not competitive because the denominator is so small. GLM-4.7 and GLM-4.7-Flash both deliver substantially more passing tasks at comparable or lower per-pass cost.


Leaderboard position

[Observed — leaderboard, cross-campaign data]

ModelScore$/passLab
GLM-4.728/30$0.0038Zhipu AI
DeepSeek V4 Flash28/30$0.0015DeepSeek
Claude Sonnet 4.628/30$0.0514Anthropic
GPT-5.527/30$0.0699OpenAI
Devstral 227/30$0.0020Mistral
Mistral Large 327/30$0.0021Mistral
MiniMax M2.527/30$0.0024MiniMax
GLM-527/30$0.0065Zhipu AI
GPT-OSS-20B25/30$0.000481OpenAI
GLM-4.7-Flash25/30$0.000565Zhipu AI
Kimi K2.524/30$0.0044Moonshot AI
GPT-OSS-120B23/30$0.0013OpenAI
Qwen3 32B23/30$0.0010Alibaba
Qwen3-Coder 30B A3B22/30$0.0018Alibaba
Qwen3 Next 80B A3B21/30$0.0012Alibaba
Amazon Nova Pro20/30$0.0068Amazon
DeepSeek V3.219/30DeepSeek
Llama 3.3 70B14/30$0.0047Meta
Nemotron Super 3 120B12/30$0.0016NVIDIA
Qwen3 VL 235B A22B12/30$0.0050Alibaba
Nemotron Nano 3 30B10/30$0.00079NVIDIA
Jamba 1.5 Large8/30$0.0044AI21

Last in the dataset. First SSM-Transformer hybrid. Worst result for any 100B+ model tested so far. Nemotron Nano 3 30B (10/30) is a 30B dense model that scored higher. The architectural scale advantage of a 398B parameter count did not translate.


Failure mode summary

[Observed — data pack failure_mode_histogram]

Failure modeCountTasks
wrong_answer (diagnosis without fix)15task_01, task_02, task_05 (all 3 runs each)
wrong_answer (no tool calls, hallucinated trace)3task_04 (all 3 runs)
wrong_answer (SQL analysis incomplete)3task_10 (all 3 runs)
wrong_answer (log investigation partial)2task_03 (2 failed runs)
wrong_answer (ambiguity handling incomplete)2task_06 (2 failed runs)
wrong_answer (impossible task)3task_09 (all 3 runs)

Zero gave_up_mid_plan. Zero tool_call_hallucinated. Zero tool_call_malformed. The model executed correctly from a mechanics standpoint. Every failure is the model producing an answer that does not satisfy the completion criterion, not a structural breakdown in the tool loop.


Execution profile

[Observed — data pack run_metrics, latency_distribution]

The execution mechanics are well-behaved. The problems are in the content of what the model produces, specifically the absence of the write step after the analysis step.


What builders need to know

[Speculation]

If you are evaluating Jamba 1.5 Large for production agentic workflows:

The read-modify-write loop is unreliable. Any task that requires the model to read existing content and then modify it is high-risk. The failure pattern is consistent and systematic: correct diagnosis, no execution. If your workflow involves code modification, configuration updates, or file editing based on what the model reads, expect this pattern.

Write-only tasks are a strength. Creating new documents, drafting content, generating sequential plan files: these passed cleanly. If your agentic use case is primarily content generation without a feedback loop into existing code, Jamba 1.5 Large performs substantially better than this score suggests.

Do not rely on it for code tracing. task_04 showed the model fabricating a call-chain walkthrough without reading the source files. Any workflow requiring accurate analysis of existing codebases carries the same risk.

SQL and log investigation are unreliable at this training level. Both task_03 (1/3) and task_10 (0/3) point to inconsistent structured-data analysis. Q&A over logs and databases may produce the same partial-success pattern.

The architecture is not the bottleneck here. The failure mode is a post-training gap, and AI21 may close it in future versions. This campaign is a snapshot of Jamba 1.5 Large specifically.


Predictions

[Observed — predictions file]

PredictionExpectedActualVerdict
P1 Score range19–25/308/30
P2 task_09 score0/30/3
P3 task_07 score>=2/33/3
P4 task_03 score>=2/31/3

2/4 correct.

P1 missed by a wide margin. The prediction was 19–25/30; the actual was 8/30. The F4 threshold was 15; Jamba fell below even that floor. The failure mode I was watching for, SSM state degradation across multi-turn exchanges, did not materialize. What appeared instead was a post-training gap at a level not suggested by the enterprise positioning or the model scale.

P4 miss: task_03 was predicted to score >=2/3 based on the assumption that Mamba recurrence would support log-pattern identification. 1/3 suggests the model parsed the log correctly some of the time but failed to structure the required output format consistently.

P2 and P3 were correctly called. task_09 impossibility recognition is a consistent failure pattern across non-reasoning models in this dataset. task_07 file-creation tasks have been strong for models that otherwise struggle with code-modification loops.

ClawWorks Weekly

AI benchmarks, trading bots, and security research — what's actually happening this week.