April 9, 2026

Building a Recursive Agent-Improvement Pipeline

Patrick Donohoe

AI Engineer

A blue-tinted composite of a city skyline overlaid with financial charts, bar graphs, and data numbers on a purple gradient background.
Share
  1. LI Test

  2. LI Test

In a previous post, we wrote about building an equity research agent using the You.com Web Search APIs. That post covered the tools and data sources that let the agent produce financial reports. This post is about what comes next: making those reports actually good.

Getting a working agent is one problem. Getting a good one is a different kind of problem. We built an automated pipeline that lets the agent improve its own implementation through a structured feedback loop. Over 15 versions of the core prompts have gone through this system.

Two key elements make this automated pipeline work:

  1. Infrastructure that lets the agent iterate safely on its own: versioned prompts, stored outputs, automated evaluation, and rollback when something gets worse.
  2. A human who steps in at the right moments to reason from first principles about what the agent is missing. They don’t tweak prompts, but, instead, change the environment the agent operates in.

Defining What Good Looks Like

Taking a step back, before we could automate anything, we needed to define what a good report actually looks like. We evaluated the agent the way a portfolio manager evaluates an analyst. Not on whether the predictions are right, but on the depth of the research. How well does the agent understand the business? How specific are the insights? Can the agent articulate what the market is missing? If it can't, the analysis has no value regardless of how polished it reads.

Reward Engineering

That evaluation framework worked for manual review. But to power an automated improvement loop, we needed to quantify it. This is what we think of as reward engineering. Not building the agent itself, but building the system that tells the agent whether it's getting better. So we built a judge.

The judge is an agent that scores reports across six dimensions: 

  • Data accuracy
  • Analytical depth
  • Market gap identification
  • Recommendation quality
  • Insight specificity
  • Completeness. 

The judge gives each dimension a score of 1-10. We calibrated the judge so that our starting reports scored a three or four. That calibration matters. If the starting score is a seven, the agent has nowhere meaningful to go. The benchmark has to be hard enough that the agent has to work to improve against it.

The judge also has to evolve. After running the system for a while, we started noticing patterns in how the agent scored well without actually being good. Reports had confident claims that weren't backed by data. Consensus views were framed as original insights. Generic analysis could have applied to any company. The agent had learned to satisfy the rubric without satisfying the intent behind it.

So, we built a second generation of the judge with harder constraints. For example, uncited claims cap data accuracy at four, consensus opinions presented as novel gaps cap that dimension at four, and generic language without company-specific data caps insight specificity. None of these existed in the first iteration. They came from watching the system find shortcuts and closing them off one by one.

This is also how we know when the agent has outgrown a benchmark entirely. Once it consistently scores an eight, for example, the current eval isn't teaching it anything. That's the signal to step back and ask what we're not measuring yet. Then we build that test and the cycle starts over.

Infrastructure for Iteration

The agent can't improve in a vacuum. Before it can iterate on its own outputs, it needs infrastructure built specifically for that purpose.

The most important piece is prompt versioning. Every agent in the pipeline has a system prompt that encodes what it should do, what it should prioritize, and where the bar is. Initially, these lived in config files and got edited by hand. That's fine when you're exploring, but it stops being fine when you want to measure whether a change actually made things better.

We moved every prompt into a versioned database. Each version is stored with a description of what changed, and every report links back to the exact prompt version that produced it. You can look at any report, see exactly which instructions generated it, and compare it against reports generated by previous versions. This is the foundation that makes systematic improvement possible.

It also makes rollback safe. We maintain a high-watermark system across an evaluation universe of 20 financial stocks. Every time the agent changes a prompt, we run the full pipeline across all 20 and compare aggregate scores against the current best. If the new version beats the high-watermark, it becomes the new baseline. If it doesn't, we revert—one step. No single good result on one stock can trick the system into shipping a worse prompt. The bar is set by overall performance, not individual wins.

The agent also stores its own run history. It can compare outputs across runs, see which changes led to improvements, and reference what it tried before. This matters because without it, the agent has no memory. It would suggest the same changes repeatedly or lose track of what worked. The run history gives it a foundation to reason from.

None of this is the agent doing its job. This is the infrastructure that lets the agent do its job. Without versioning, you can't trace what changed, without rollback, every experiment is a risk, and without run history, the agent can't learn from its own past. The improvement loop doesn't work unless all three are in place.

Breaking Through Plateaus

The agent is good at incremental improvement. It can see what the judge is penalizing and adjust. Scores go up, reports get more specific, prompts get tighter. When the trajectory is obviously positive, there's no reason to intervene.

Then it plateaus. The scores stop moving. The agent is still suggesting changes, but they're lateral moves. Different but not better. It has optimized within the environment it was given. The evaluation set, given the current tools and data sources, is effectively solved.

That's when the human steps back in. Not to tweak prompts. The agent is already better at that than we are within a fixed environment. The human's job is to reason from first principles about what capability or data source would unlock the next level of quality.

For example, early on, the agent was producing decent reports but the quality had a ceiling. The reports reasoned about valuation in prose but they couldn't do real financial modeling. So we built a discounted cash flow (DCF) model and gave the agent access to it. That wasn't a prompt change, it was a structural change to what the agent could do.

The same pattern repeated. Reports were missing nuance about management strategy and forward guidance, so we added earnings call transcripts as a data source. The agent was attempting arithmetic inline and getting it wrong, so we built computation tools so it could do math in code. Each of these came from asking a specific question: what is the agent fundamentally unable to do right now, and what would fix that?

Each addition unlocked a step change in quality that no amount of prompt iteration could have reached. And each time, once the new capability was in place, the agent took over again. It figured out how to use the new tools well. It iterated on the prompts. Scores climbed. And eventually, it plateaued again.

Follow the Pattern to Improvement

The pattern is simple:

  1. Define what good looks like
  2. Build infrastructure that lets the agent iterate safely against that definition
  3. When it plateaus, reason from first principles about what's missing and change the environment
  4. Then, let it run again

We used this approach to build a finance research agent that went through 15+ prompt versions and got meaningfully better each round. The goal is to continuously improve. As Richard Socher always says, “Better, better, never done.” 

To start building with You.com APIs, get your API key.

Featured resources.

All resources.

Browse our complete collection of tools, guides, and expert insights — helping your team turn AI into ROI.

Stacked white cubes on gradient background with tiny squares.
AI Search Infrastructure

AI Search Infrastructure: The Foundation for Tomorrow’s Intelligent Applications

Brooke Grief

Head of Content

January 9, 2026

Blog

Cover of the You.com whitepaper titled "How We Evaluate AI Search for the Agentic Era," with the text "Exclusive Ungated Sneak Peek" on a blue background.
Comparisons, Evals & Alternatives

How to Evaluate AI Search in the Agentic Era: A Sneak Peek 

Zairah Mustahsan

Staff Data Scientist

January 8, 2026

Blog

API Management & Evolution

You.com Hackathon Track

Mariane Bekker

Head of Developer Relations

January 5, 2026

Guides

Chart showing variance components and ICC convergence for GPT-5 on FRAMES benchmarks, analyzing trials per question and number of questions for reliability.
Comparisons, Evals & Alternatives

Randomness in AI Benchmarks: What Makes an Eval Trustworthy?

Zairah Mustahsan

Staff Data Scientist

December 19, 2025

Blog

Blue book cover titled "How We Evaluate AI Search for the Agentic Era" by You.com, featuring abstract geometric shapes and a gradient blue background.
Comparisons, Evals & Alternatives

How to Evaluate AI Search for the Agentic Era

Zairah Mustahsan

Staff Data Scientist

December 18, 2025

Guides

Screenshot of the You.com API Playground interface showing a "Search" query input field, code examples, response area, and sidebar navigation on a gradient background.
Product Updates

December 2025 API Roundup: Evals, Vertical Index, New Developer Tooling and More

Chak Pothina

Product Marketing Manager, APIs

December 16, 2025

Blog

A person holding a stack of books, reaching for another, against a futuristic blue geometric background.
AI Agents & Custom Indexes

Introduction to AI Research Agents

You.com Team

December 12, 2025

Blog

Illustration of justice scales on a blue background, overlaid with circuitry patterns, symbolizing the intersection of law and technology.
AI Agents & Custom Indexes

What Are Legal AI Agents?

You.com Team

December 9, 2025

Blog