April 22, 20269 min read

Why AI Agents Fail at Data Analysis (And How We Fixed It)

An honest post-mortem of the three failure modes we hit building AI agents for data analytics — and what each one taught us about what actually works.

TL;DR

Failure 1: Over-orchestration. Rigid DAGs work for ETL, fail for analysis. Fix: adaptive sub-tasks inside a structured DAG.
Failure 2: Prompt duct-taping. Bloating prompts to patch bad outputs. Fix: refactor the mental model instead.
Failure 3: Domain-generic statistics. Rolling averages on Bitcoin data. Fix: dynamic skill loading per domain.

Failure Mode 1: Over-Orchestration

The first instinct building a multi-agent analytics system is to build a rigid DAG. Planner outputs steps 1, 2, 3; executor runs each in order; narrator writes it up. Clean, auditable, predictable. Works great for ETL pipelines.

Fails for analysis. The reason: good data analysis fundamentally cannot be fully planned beforehand. Each finding changes what you should do next. A plan that says "clean data, then run regression" breaks when the data cleaning reveals that regression is the wrong model for the shape of the data.

The fix: adaptive sub-tasks embedded within a structured DAG. The top-level plan is stable ("clean, explore, model"), but within each node the agent can iterate and branch based on what it finds. The planner updates the DAG mid-run when it encounters surprises.

Key insight

The biggest threat to a multi-agent system isn’t a lack of intelligence. It’s over-orchestration. The executor needs room to explore; the DAG has to allow it.

Failure Mode 2: Prompt Duct-Taping

The second failure mode is the most seductive. You have a prompt that produces the wrong output in edge cases. You add a rule: "don’t do X." The next edge case: "also don’t do Y." Six months later, the prompt is 14,000 words of patches. Each patch fixed one issue and introduced three more.

Our own data analyst prompt followed exactly this path. Version 1: 5,778 words. Peak bloat version 47: 14,881 words. Lean rewrite version 53: 7,441 words — and a better output than any of the bloated versions.

The lesson: when a prompt needs patches, it’s a signal that the underlying mental model is wrong. Don’t add rules. Refactor the model.

Failure Mode 3: Domain-Generic Statistics

A generic statistical toolkit is not enough. The same "run a regression" instruction means very different things in different domains:

Financial timeseries need GARCH volatility models, RSI/MACD momentum indicators, Bollinger Bands — not rolling averages.
Clinical trial data needs survival curves, Kaplan-Meier, Cox regression — not percentages.
Genomics needs FDR correction across 20,000 genes — not t-tests.
Geospatial data needs spatial autocorrelation and Moran’s I — not pairwise distance.
Panel data needs fixed-effects models — not OLS.

Generic agents run generic stats. The output is technically correct and analytically useless.

The fix: dynamic skill loading. The agent identifies the data domain on upload (schema + column names + data shape → LLM inference) and loads the appropriate statistical library before planning the investigation. PlotStudio loads financial, biomedical, geospatial, and panel-data libraries on demand.

What We Got Right on the Third Try

After three rewrites of the orchestration layer, we arrived at a pattern that works:

Profiler runs autonomously on upload before any user prompt. This fixes the "cold start" problem.
Planner produces a DAG with explicit adaptive nodes. Some steps are fixed (cleaning comes before modeling); others are marked as exploratory.
Executor can update the plan mid-run when it encounters surprises. The planner re-runs with new context.
Domain skills load dynamically based on data shape — the agent knows to treat stock OHLCV data differently from tabular customer records.
QA agent runs after every step to check assumptions, flag leakage, and verify statistical validity.
Narrator produces the final report with explicit caveats and a limitations section.

The profiler running on upload — the first step in the multi-agent workflow.

How to Tell If an AI Agent Has Fixed These

Does it profile the data before asking you questions? (If no → failure mode 1.)
Does the same prompt produce consistent output across edge cases? (If no → failure mode 2.)
Does it apply domain-appropriate stats automatically? (If no → failure mode 3.)
Does it flag sample-size and assumption limits? (If no → probably all three.)

Try an agent that doesn’t fail these tests

Built by engineers who hit all three failure modes, and fixed them.

Download PlotStudio AI

Failure Mode 1: Over-Orchestration

Failure Mode 2: Prompt Duct-Taping

Failure Mode 3: Domain-Generic Statistics

What We Got Right on the Third Try

How to Tell If an AI Agent Has Fixed These

FAQ