Agentflow: Racing Against Exponential Model Improvement

I spent a few months in early 2025 building Agentflow using Cursor, Claude Code, and Augment. The goal was clean: bridge the gap between sprawling JIRA epics and the tightly scoped, context-window-friendly tasks that AI coding agents actually need.

The BMAD methodology was my decomposition framework - break down features into bite-sized chunks that fit within Claude's or GPT's context limits, preserve artifacts across the workflow, and maintain traceability from epic to implementation.

Then models got dramatically better at reasoning. MCP tooling landed on top of project management software. And the decomposition problem largely solved itself.

What I thought was solved (task decomposition) wasn't the real problem. What I thought I was building toward (requirements traceability) is still unsolved - but it's not a context window problem, it's a retrieval and memory problem.

The Actual Problem I Was Solving

Early 2025, the disconnect was real: JIRA epics are massive, multi-faceted beasts. AI coding assistants need narrow, focused tasks with clear acceptance criteria. Feeding a sprawling epic into Claude Code or Cursor just produced garbage.

My thesis:

BMAD-driven decomposition
- Use structured prompts to break epics into context-window-sized tasks
Artifact preservation
- Maintain the decomposition trail (PRD → architecture → task specs → implementation)
Intelligent handoffs
- Route tasks to the right agent based on phase, capabilities, and historical performance
Traceability
- Connect implementation back to original requirements without losing context

The technical implementation was actually solid:

The Architecture I Built

Agent System (src/server/agents/)

AgentRegistry: Singleton managing 7 specialized agents (analyst, PM, architect, dev, QA, designer, generalist)
AgentOrchestrator: Intelligent selection using performance scoring, phase alignment, historical metrics
AgentManager: Activation/deactivation, configuration, audit trails
WorkflowManager: Configurable workflows (BMAD, Linear, etc.) with phase transitions
Handoff System: Phase-to-phase context transfer with automatic artifact propagation

Tech Stack

Next.js 15 App Router, React 19, Tailwind CSS
tRPC procedures, Drizzle ORM, PostgreSQL (Neon)
NextAuth v5 for auth
Mastra for long-lived agent context
Anthropic/OpenAI SDKs for multi-model support

Database Schema All project context lived in Postgres:

Session chains spanning multiple phases
Phase artifacts (PRDs, specs, task breakdowns, QA reports)
Agent performance metrics for selection scoring
Activation history and audit logs

The Five-Phase Flow

Soundboard
- Exploratory conversations, context capture
Refinement
- Auto-generate PRDs from captured context
Plan/Spec
- Convert requirements into architecture using BMAD prompts
Implementation
- Structured task assignments with progress tracking
Review
- QA validation, deployment readiness

The agent scoring system was particularly clever - it considered base confidence per agent type, primary/secondary phase alignment, historical performance in the project, user preferences, and capability matching.

What Actually Happened: The Ground Shifted Beneath Me

Two things killed the need for task decomposition faster than I could ship:

1. Models Got Dramatically Better at Reasoning

Claude Sonnet 4.5, GPT-5, Gemini 2.5 Pro can handle complexity that earlier models choked on. Feed them a sprawling epic and they can figure out the decomposition themselves, on the fly.

My BMAD-structured prompts for breaking down tasks? The models got good enough to do that reasoning natively. The decomposition framework I was building became redundant.

Larger context windows helped (100k → 200k tokens), but that's not what really changed. The models got smarter. They can hold more complexity in working memory, reason about dependencies, and plan implementation steps without explicit decomposition scaffolding.

2. MCP Tooling Integrated Directly into PM Software

The Model Context Protocol (MCP) landed and immediately got integrated into project management platforms. Now Claude Code or Cursor can:

Read JIRA epics directly via MCP
Access full project context
See relationships between tickets
Understand acceptance criteria without manual decomposition

My first two phases (Soundboard → Refinement) became obsolete overnight. MCP servers give AI agents direct access to requirements and project management systems. No need for my artifact handoff layer.

3. Claude Code and Remote Agents Ate the Backend Phases

Implementation and Review phases? Claude Code, Cursor with Claude Sonnet, and remote coding agents now handle these autonomously:

They read the epic
They understand the codebase
They write the implementation
They run tests and validate

My orchestration system was adding overhead where none was needed. The agents got smart enough to handle the entire flow themselves.

The Piece That's Actually Still Unsolved

There's ONE thing I was building toward that's still a real problem: Requirements traceability and validation across long-running projects.

Six months into a codebase refactor, can you:

Trace implementation decision in PR #247 back to the original epic requirement?
Prove acceptance criteria were met across 50 PRs spanning 6 months?
Answer "why did we build it this way?" when the context has drifted?

This is NOT a context window problem. Even with unlimited context, you can't solve:

Retrieval: How do you FIND the relevant requirement from hundreds of epics and thousands of comments? Dumping everything into context doesn't help if the model can't surface the right piece.
Attention: "Lost in the middle" problem - models don't maintain equal attention across all tokens. Relevant context buried in token 95,000 gets ignored.
Cost: Processing 200k tokens every validation check is prohibitively expensive at scale.
Temporal Understanding: Models struggle with "what changed when and why" - they need to understand the SEQUENCE of decisions, not just current state.
Semantic Drift: Original epic says "user authentication" but implementation talks about "identity federation" - connecting those requires semantic understanding and persistent mapping.
Structured Validation: You need to PROVE acceptance criteria were met. That's not a "read and summarize" problem - it's structured querying against a knowledge graph.

The real problems are retrieval, persistent memory, and structured validation - not intelligence or context size.

My artifact preservation system was actually attacking the RIGHT problem - building queryable, persistent state that could answer:

"What requirement led to this implementation?"
"Was acceptance criteria X met?"
"Why did we choose this approach in this context?"

That's still unsolved. But is it a big enough market before models get better at long-term memory? Probably not.

Why I Stopped: Building Workarounds While the Problem Disappeared

Here's the thing that broke me: I was using Cursor and Claude Code to BUILD Agentflow. Every week, the tools I was using got measurably better at the exact problem Agentflow was supposed to solve.

I'd build a BMAD decomposition prompt one week. The next week, Claude Sonnet 4 would drop and handle the decomposition natively better than my structured prompt.

I'd architect a handoff system to preserve context across phases. Then 200k context windows landed and made the handoff unnecessary.

I'd implement an artifact storage system for requirements traceability. Then MCP servers integrated with JIRA and gave Claude direct access to the source of truth.

I was literally watching my solution become obsolete as I built it.

The burnout wasn't from working hard - it was from fighting exponential improvement with linear tooling. You can't win that race.

The Specific Trends That Killed This

Models improved at reasoning faster than expected
- Claude Sonnet 4.5, GPT-5, Gemini 2.5 can handle sprawling requirements that earlier models couldn't. Native decomposition became good enough.
MCP made direct integration trivial
- Why build artifact handoffs when Claude can just read JIRA via MCP? The middleware layer I was building became a weekend project.
Coding agents got autonomous
- Claude Code, Cursor, Windsurf don't need task management layers. They read epics and ship code end-to-end.
The unsolved piece (traceability) is real but narrow
- Requirements validation across long-term projects is still a problem, but it's retrieval/memory, not decomposition. And the market is small (compliance-heavy orgs).

What I'm Actually Taking Away

You Can't Build Faster Than Models Improve

This is the meta-lesson: Don't build workarounds for model limitations in 2025. By the time you ship, the limitation is gone.

Reasoning ability, tool use, planning - these are improving exponentially. Building decomposition layers or orchestration frameworks to work around current model weaknesses is building on quicksand.

If you're solving a model limitation today, ask: "Will this still be a problem in 6 months?" If not, you're wasting time.

Build for persistent problems, not temporary model gaps.

The Tech Is Solid (For What It's Worth)

The architecture is actually good:

Agent registry and orchestration patterns
Multi-project isolation
Artifact preservation and traceability
Performance-based selection algorithms

But good architecture solving the wrong problem is still worthless.

The ONE Unsolved Piece: Retrieval + Temporal Memory

The traceability problem is still real: requirements validation across months-long projects spanning hundreds of PRs.

Can you prove that implementation decision in PR #247 traces back to the original epic requirement? Can you validate acceptance criteria were met when the implementation is scattered across 40 files and 6 months of changes?

This is a retrieval and persistent memory problem, not a model intelligence problem. You need:

Semantic search across temporal boundaries (finding requirements from 6 months ago)
Knowledge graphs mapping epics → specs → PRs → implementations
Structured validation (proving acceptance criteria were met)
Temporal causality tracking (why decisions were made in specific contexts)

The artifact preservation system I built was tackling this. But:

It's a narrow use case (compliance-heavy orgs, regulated industries, government)
Models might solve this with better RAG + long-term memory architectures
Market timing is uncertain - is there a business before the tech catches up?

Stopping Is Winning

I validated this wasn't viable before raising money, hiring a team, or wasting another year. That's a win in my book.

The code is on GitHub. The agent orchestration patterns might help someone. The BMAD prompts could work as a library. The traceability piece might have legs if someone wants to tackle the compliance angle.

But I'm not fighting exponential model improvement with incremental tooling.

For Anyone Building AI Tooling Right Now

Here's what I learned the hard way:

1. Don't Solve Model Limitations - They're Temporary

Reasoning scaffolding? The next model does it natively. Task decomposition frameworks? Models learned to plan autonomously. Prompt engineering layers? Better base models made them redundant.

If you're building infrastructure around a model limitation, you're building on quicksand.

2. Use the Tools You're Building For

I was using Cursor and Claude Code to build Agentflow. Every week, those tools got better at exactly what Agentflow was supposed to do. That's a clear signal.

If your development tools are solving your product's problem better than your product will, you're too late.

3. The Unsolved Problems Are About Retrieval and Memory, Not Intelligence

Models are smart enough. The hard problems now are:

Retrieval: Finding relevant context from massive historical data
Persistent memory: Queryable state across months/years
Temporal understanding: Tracking why decisions were made when
Structured validation: Proving requirements were met, not just summarizing

The valuable problems aren't "make AI smarter" - they're "give AI persistent, queryable memory across time boundaries and prove what happened."

4. MCP Integration Killed Middleware Businesses

Model Context Protocol made direct integration trivial. Any "bridge" or "orchestration layer" between AI and existing tools is now a weekend MCP server project.

If your value proposition is "AI can access X," that's not defensible anymore.

5. Stopping Is Not Failing

I validated the market moved past this approach using the actual tools that made it obsolete. That's efficient market research.

The tech industry romanticizes "failing fast," but stopping before you waste years is smarter than failing spectacularly.

What Actually Might Still Matter

The long-term requirements traceability piece is still unsolved. When you're 6 months into a refactor, can you:

Trace implementation decisions back to original epic requirements?
Validate acceptance criteria across 50+ PRs?
Answer "why did we build it this way?" months later?

This is a retrieval + persistent memory problem, not an intelligence or context window problem. You need:

Semantic search to find relevant requirements from months ago
Knowledge graphs mapping requirements → decisions → implementations
Temporal causality tracking (context at decision time)
Structured validation to prove compliance

MCP servers can read JIRA. Models can reason. But neither solves:

Finding the RIGHT requirement from thousands of historical ones
Proving acceptance criteria satisfaction across scattered implementations
Understanding temporal context (why this decision made sense 6 months ago)

If someone wants to tackle that - compliance-heavy orgs, regulated industries, government contracts - there might be something there. But it's narrow enough that better RAG architectures and long-term memory will probably solve it before there's a big business.

The code is on GitHub. The agent orchestration patterns are solid. The architecture is clean. The BMAD decomposition methodology works.

It's just solving a problem that evaporated while I was building the solution.

Postscript: If you're building AI tooling and want to talk through timing and market signals before you waste months, hit me up. I'm happy to share the real lessons - the ones that don't make it into the polished case studies where everyone pretends they saw it coming.

Agentflow: Building Task Decomposition While Context Windows Made It Obsolete