Jamba 1.5 Large diagnosed the bugs. It did not fix any of them.
Campaign: 2026-05-22-ai21-jamba-1-5-large-agentic-core-v1
Model: AI21 Jamba 1.5 Large (ai21.jamba-1-5-large-v1:0, AWS Bedrock, us-east-1, ON_DEMAND)
Architecture: SSM-Transformer hybrid — 398B parameters, Mamba + Attention layers
Harness: agentic-core-v1 (10 tasks × 3 runs = 30 total)
Campaign date: 2026-05-22
AI21 positions Jamba 1.5 Large explicitly for enterprise agentic workflows. It is the company’s flagship model: a 398-billion parameter SSM-Transformer hybrid combining Mamba state-space layers with standard Attention. The architecture is the first of its kind in this dataset, and the pre-run prediction was 19–25/30 based on the enterprise positioning and model scale.
Jamba 1.5 Large scored 8/30 (26.67%) at $0.0044 per passing task. F4 was triggered: the score fell below the predicted lower bound of 19. Average latency was 4.5 seconds per run. 2/4 predictions were correct.
The score is last in the dataset. The story behind it is more specific than the number alone suggests.
What the harness asks
[Observed — harness spec]
Ten tasks, three independent runs each, thirty runs total. agentic-core-v1 covers software engineering work a deployed agent would encounter in practice: fix a failing test, refactor duplicated code, investigate a log file, trace execution through a codebase, make a targeted minimal fix, handle an ambiguous requirement, execute a multi-step sequential plan, recover from an injected tool error, detect when a requested computation is impossible, and run a SQL investigation.
Two tasks have structural complexity worth flagging before the results. task_09 presents a three-row CSV and asks for a ten-day moving average. Three data points cannot support a ten-day window; the correct response is to recognize the impossibility and refuse. task_08 deliberately injects a file-not-found error on the first tool call, requiring the model to detect it, locate the correct path, and produce verified output.
A pass requires correct task completion. Failure modes are classified: wrong_answer, gave_up_mid_plan, tool_call_hallucinated, tool_call_malformed.
What happened
[Observed — data pack per_task_results]
| Task | Score | Notes |
|---|---|---|
| task_01 fix_failing_test | 0/3 | Found the bug, never wrote the fix |
| task_02 refactor_duplicated_code | 0/3 | Narrated refactor plan, stopped before execution |
| task_03 investigate_log | 1/3 | Partially succeeded on 1 run |
| task_04 trace_through_codebase | 0/3 | Zero tool calls; hallucinated walkthrough inline |
| task_05 minimal_fix | 0/3 | Read files, described fix, never applied it |
| task_06 handle_ambiguous_requirement | 1/3 | 1 run produced a valid clarification response |
| task_07 multi_step_plan | 3/3 | Clean; created 4 files in sequence |
| task_08 recover_from_tool_error | 3/3 | Clean; wrote content to file after error recovery |
| task_09 know_when_to_stop | 0/3 | Confident wrong numeric output |
| task_10 sql_investigation | 0/3 | Failed to produce correct SQL analysis |
Total: 8/30. $0.0355 total campaign cost. $0.0044/pass.
Six of the ten tasks failed completely. The two perfect scores come from tasks with a specific structural property discussed below. 22 of the 22 failure-mode classifications were wrong_answer. No planning breakdowns, no hallucinated tool calls, no malformed outputs. The model completed every task attempt; it just did not complete the task.
The read-analyze-narrate pattern
[Observed — data pack diagnosis_then_regression, task_01_results, task_02_results, task_05_results]
Across five of the six failed tasks, Jamba 1.5 Large exhibits a consistent pattern: it identifies the correct fix, describes it clearly in prose, and stops without executing it.
Transcript evidence from task_01 (fix failing test):
- Turn 1:
fs_read(tests/test_add.py)returns test expectations - Turn 2:
fs_read(src/add.py)returnsreturn a - b # BUG: should be + - Turn 3 (final): “The bug is in the
addfunction insrc/add.py. The function should returna + binstead ofa - b. Let’s fix it.” Session ends. Nofs_writecall.
“Let’s fix it.” The model narrates the next step and terminates before taking it. The same structure appears in task_02 (refactor), task_05 (minimal fix), and task_06 (ambiguous requirement). Read the relevant files, narrate the diagnosis in correct detail, stop. The action phase of the agentic loop never begins.
This is not ambiguity about what the task requires. The model demonstrates it understood the problem. The tool post-training did not close the loop between “here is the correct answer” and “now write it.”
task_04: the most diagnostic data point
[Observed — data pack task_04_results, tool_calls_by_task]
task_04 asks for a complete call-chain trace from entry() to report(), written to trace.txt. This requires reading three source files.
Jamba 1.5 Large used zero tool calls across all three runs. The model wrote out a pseudo-code walkthrough of the call chain from its training knowledge, without reading any file. The walkthrough is fluent. It is also fabricated. The code in the task repository does not match what the model described.
task_04 is the clearest signal in the campaign. The other failed tasks at least show correct file-reading behavior followed by premature termination. task_04 shows the model deciding it does not need to read the files at all.
The two tasks that passed
[Observed — data pack task_07_results, task_08_results]
task_07 (multi-step plan) and task_08 (recover from tool error) both scored 3/3. Both passed cleanly. Neither required the model to read existing code and then modify it.
task_07 asks for four new files to be created in sequence: a plan document, a directory structure, a manifest, and a summary. All creation, no modification. Jamba executed each step correctly, in order, across all three runs.
task_08 injects a file-not-found error on the first tool call and asks the model to write a number to a file. The model caught the error, located the correct path, and wrote the output. Again: no existing code to analyze and modify.
The pattern separates cleanly along task type. Write-only tasks pass. Read-modify-write tasks fail. The model’s tool post-training appears to have learned to generate new content but not to execute the read-then-modify loop that most code-level agentic tasks require.
The architecture angle
[Observed — architecture spec; Speculation — post-training inference]
Jamba 1.5 Large is the first SSM-Transformer hybrid in this dataset. The pre-run hypothesis was that Mamba layers might cause state-tracking degradation across multi-turn tool exchanges, specifically by losing earlier tool results when processing long contexts.
That hypothesis did not hold. The observed failures are not state degradation. The model correctly tracked file contents it read, correctly described bugs it found, correctly narrated refactor steps in sequence. Conversational coherence is intact. The Mamba recurrence appears sufficient for the analysis phase of agentic tasks.
The failure is in the post-training gap between analysis and execution. The model was not trained to close the agentic loop. It does not follow “here is what should change” with an fs_write call that changes it. That is a training decision, not an architecture constraint. An attention-only model with the same post-training gap would produce the same failure mode.
[Speculation]
AI21 markets Jamba 1.5 Large for enterprise document processing, Q&A, and retrieval-augmented generation. Those use cases do not require the read-modify-write loop agentic-core-v1 tests. The model performs the diagnostic half of agentic work well. The “agentic” in AI21’s product positioning appears to cover a narrower capability surface than the harness tests.
Amazon Nova Pro comparison
[Observed — cross-campaign data]
Both Jamba 1.5 Large and Amazon Nova Pro are enterprise-positioned models with explicit agentic claims. Amazon Nova Pro scored 20/30 on the same harness two campaigns earlier. The gap is 12 passing tasks at roughly similar per-run pricing.
Nova Pro’s failures were concentrated on two specific tasks (task_02 and task_04) with diagnosable causes. Jamba’s failures are systematic across six tasks. The difference is not architecture. Nova Pro is a standard Transformer. The difference is post-training depth for the read-modify-write loop. Nova Pro was trained to close the loop and gets the answer wrong. Jamba was not trained to close the loop at all.
Both models are positioned for enterprise agentic deployment. The scoring gap between them suggests substantially different training investments in code-level tool execution.
Cost in context
[Observed — cross-campaign data, pricing documentation]
At $0.0044/pass, Jamba 1.5 Large is expensive relative to what it delivers:
| Model | Score | $/pass | Notes |
|---|---|---|---|
| GLM-4.7 | 28/30 | $0.0038 | 3.5x the score, similar cost |
| Amazon Nova Pro | 20/30 | $0.0068 | 2.5x the score, somewhat costlier |
| GLM-4.7-Flash | 25/30 | $0.000565 | 3x the score, 8x cheaper |
| GPT-OSS-20B | 25/30 | $0.000481 | 3x the score, 9x cheaper |
| Jamba 1.5 Large | 8/30 | $0.0044 | |
| Nemotron Nano 3 30B | 10/30 | $0.00079 | Scores 2 pts higher, 6x cheaper |
The $0.0355 total campaign cost is low in absolute terms. The per-pass cost is not competitive because the denominator is so small. GLM-4.7 and GLM-4.7-Flash both deliver substantially more passing tasks at comparable or lower per-pass cost.
Leaderboard position
[Observed — leaderboard, cross-campaign data]
| Model | Score | $/pass | Lab |
|---|---|---|---|
| GLM-4.7 | 28/30 | $0.0038 | Zhipu AI |
| DeepSeek V4 Flash | 28/30 | $0.0015 | DeepSeek |
| Claude Sonnet 4.6 | 28/30 | $0.0514 | Anthropic |
| GPT-5.5 | 27/30 | $0.0699 | OpenAI |
| Devstral 2 | 27/30 | $0.0020 | Mistral |
| Mistral Large 3 | 27/30 | $0.0021 | Mistral |
| MiniMax M2.5 | 27/30 | $0.0024 | MiniMax |
| GLM-5 | 27/30 | $0.0065 | Zhipu AI |
| GPT-OSS-20B | 25/30 | $0.000481 | OpenAI |
| GLM-4.7-Flash | 25/30 | $0.000565 | Zhipu AI |
| Kimi K2.5 | 24/30 | $0.0044 | Moonshot AI |
| GPT-OSS-120B | 23/30 | $0.0013 | OpenAI |
| Qwen3 32B | 23/30 | $0.0010 | Alibaba |
| Qwen3-Coder 30B A3B | 22/30 | $0.0018 | Alibaba |
| Qwen3 Next 80B A3B | 21/30 | $0.0012 | Alibaba |
| Amazon Nova Pro | 20/30 | $0.0068 | Amazon |
| DeepSeek V3.2 | 19/30 | — | DeepSeek |
| Llama 3.3 70B | 14/30 | $0.0047 | Meta |
| Nemotron Super 3 120B | 12/30 | $0.0016 | NVIDIA |
| Qwen3 VL 235B A22B | 12/30 | $0.0050 | Alibaba |
| Nemotron Nano 3 30B | 10/30 | $0.00079 | NVIDIA |
| Jamba 1.5 Large | 8/30 | $0.0044 | AI21 |
Last in the dataset. First SSM-Transformer hybrid. Worst result for any 100B+ model tested so far. Nemotron Nano 3 30B (10/30) is a 30B dense model that scored higher. The architectural scale advantage of a 398B parameter count did not translate.
Failure mode summary
[Observed — data pack failure_mode_histogram]
| Failure mode | Count | Tasks |
|---|---|---|
| wrong_answer (diagnosis without fix) | 15 | task_01, task_02, task_05 (all 3 runs each) |
| wrong_answer (no tool calls, hallucinated trace) | 3 | task_04 (all 3 runs) |
| wrong_answer (SQL analysis incomplete) | 3 | task_10 (all 3 runs) |
| wrong_answer (log investigation partial) | 2 | task_03 (2 failed runs) |
| wrong_answer (ambiguity handling incomplete) | 2 | task_06 (2 failed runs) |
| wrong_answer (impossible task) | 3 | task_09 (all 3 runs) |
Zero gave_up_mid_plan. Zero tool_call_hallucinated. Zero tool_call_malformed. The model executed correctly from a mechanics standpoint. Every failure is the model producing an answer that does not satisfy the completion criterion, not a structural breakdown in the tool loop.
Execution profile
[Observed — data pack run_metrics, latency_distribution]
- 4.5s avg latency/run. Consistent across tasks.
- $0.0355 total campaign spend. Low absolute cost; poor cost-efficiency due to low pass count.
- Zero infrastructure errors across 30 runs. Bedrock API behavior was clean throughout.
- All 22 failures classified as
wrong_answer. No hallucinated tool calls, no malformed outputs, no gave-up events.
The execution mechanics are well-behaved. The problems are in the content of what the model produces, specifically the absence of the write step after the analysis step.
What builders need to know
[Speculation]
If you are evaluating Jamba 1.5 Large for production agentic workflows:
The read-modify-write loop is unreliable. Any task that requires the model to read existing content and then modify it is high-risk. The failure pattern is consistent and systematic: correct diagnosis, no execution. If your workflow involves code modification, configuration updates, or file editing based on what the model reads, expect this pattern.
Write-only tasks are a strength. Creating new documents, drafting content, generating sequential plan files: these passed cleanly. If your agentic use case is primarily content generation without a feedback loop into existing code, Jamba 1.5 Large performs substantially better than this score suggests.
Do not rely on it for code tracing. task_04 showed the model fabricating a call-chain walkthrough without reading the source files. Any workflow requiring accurate analysis of existing codebases carries the same risk.
SQL and log investigation are unreliable at this training level. Both task_03 (1/3) and task_10 (0/3) point to inconsistent structured-data analysis. Q&A over logs and databases may produce the same partial-success pattern.
The architecture is not the bottleneck here. The failure mode is a post-training gap, and AI21 may close it in future versions. This campaign is a snapshot of Jamba 1.5 Large specifically.
Predictions
[Observed — predictions file]
| Prediction | Expected | Actual | Verdict |
|---|---|---|---|
| P1 Score range | 19–25/30 | 8/30 | ❌ |
| P2 task_09 score | 0/3 | 0/3 | ✅ |
| P3 task_07 score | >=2/3 | 3/3 | ✅ |
| P4 task_03 score | >=2/3 | 1/3 | ❌ |
2/4 correct.
P1 missed by a wide margin. The prediction was 19–25/30; the actual was 8/30. The F4 threshold was 15; Jamba fell below even that floor. The failure mode I was watching for, SSM state degradation across multi-turn exchanges, did not materialize. What appeared instead was a post-training gap at a level not suggested by the enterprise positioning or the model scale.
P4 miss: task_03 was predicted to score >=2/3 based on the assumption that Mamba recurrence would support log-pattern identification. 1/3 suggests the model parsed the log correctly some of the time but failed to structure the required output format consistently.
P2 and P3 were correctly called. task_09 impossibility recognition is a consistent failure pattern across non-reasoning models in this dataset. task_07 file-creation tasks have been strong for models that otherwise struggle with code-modification loops.