Blog

April 9, 2026

Building a Recursive Agent-Improvement Pipeline

AI Engineer

LI Test
LI Test

In a previous post, we wrote about building an equity research agent using the You.com Web Search APIs. That post covered the tools and data sources that let the agent produce financial reports. This post is about what comes next: making those reports actually good.

Getting a working agent is one problem. Getting a good one is a different kind of problem. We built an automated pipeline that lets the agent improve its own implementation through a structured feedback loop. Over 15 versions of the core prompts have gone through this system.

Two key elements make this automated pipeline work:

Infrastructure that lets the agent iterate safely on its own: versioned prompts, stored outputs, automated evaluation, and rollback when something gets worse.
A human who steps in at the right moments to reason from first principles about what the agent is missing. They don’t tweak prompts, but, instead, change the environment the agent operates in.

Defining What Good Looks Like

Taking a step back, before we could automate anything, we needed to define what a good report actually looks like. We evaluated the agent the way a portfolio manager evaluates an analyst. Not on whether the predictions are right, but on the depth of the research. How well does the agent understand the business? How specific are the insights? Can the agent articulate what the market is missing? If it can't, the analysis has no value regardless of how polished it reads.

Reward Engineering

That evaluation framework worked for manual review. But to power an automated improvement loop, we needed to quantify it. This is what we think of as reward engineering. Not building the agent itself, but building the system that tells the agent whether it's getting better. So we built a judge.

The judge is an agent that scores reports across six dimensions:

Data accuracy
Analytical depth
Market gap identification
Recommendation quality
Insight specificity
Completeness.

The judge gives each dimension a score of 1-10. We calibrated the judge so that our starting reports scored a three or four. That calibration matters. If the starting score is a seven, the agent has nowhere meaningful to go. The benchmark has to be hard enough that the agent has to work to improve against it.

The judge also has to evolve. After running the system for a while, we started noticing patterns in how the agent scored well without actually being good. Reports had confident claims that weren't backed by data. Consensus views were framed as original insights. Generic analysis could have applied to any company. The agent had learned to satisfy the rubric without satisfying the intent behind it.

So, we built a second generation of the judge with harder constraints. For example, uncited claims cap data accuracy at four, consensus opinions presented as novel gaps cap that dimension at four, and generic language without company-specific data caps insight specificity. None of these existed in the first iteration. They came from watching the system find shortcuts and closing them off one by one.

This is also how we know when the agent has outgrown a benchmark entirely. Once it consistently scores an eight, for example, the current eval isn't teaching it anything. That's the signal to step back and ask what we're not measuring yet. Then we build that test and the cycle starts over.

Infrastructure for Iteration

The agent can't improve in a vacuum. Before it can iterate on its own outputs, it needs infrastructure built specifically for that purpose.

The most important piece is prompt versioning. Every agent in the pipeline has a system prompt that encodes what it should do, what it should prioritize, and where the bar is. Initially, these lived in config files and got edited by hand. That's fine when you're exploring, but it stops being fine when you want to measure whether a change actually made things better.

We moved every prompt into a versioned database. Each version is stored with a description of what changed, and every report links back to the exact prompt version that produced it. You can look at any report, see exactly which instructions generated it, and compare it against reports generated by previous versions. This is the foundation that makes systematic improvement possible.

It also makes rollback safe. We maintain a high-watermark system across an evaluation universe of 20 financial stocks. Every time the agent changes a prompt, we run the full pipeline across all 20 and compare aggregate scores against the current best. If the new version beats the high-watermark, it becomes the new baseline. If it doesn't, we revert—one step. No single good result on one stock can trick the system into shipping a worse prompt. The bar is set by overall performance, not individual wins.

The agent also stores its own run history. It can compare outputs across runs, see which changes led to improvements, and reference what it tried before. This matters because without it, the agent has no memory. It would suggest the same changes repeatedly or lose track of what worked. The run history gives it a foundation to reason from.

None of this is the agent doing its job. This is the infrastructure that lets the agent do its job. Without versioning, you can't trace what changed, without rollback, every experiment is a risk, and without run history, the agent can't learn from its own past. The improvement loop doesn't work unless all three are in place.

Breaking Through Plateaus

The agent is good at incremental improvement. It can see what the judge is penalizing and adjust. Scores go up, reports get more specific, prompts get tighter. When the trajectory is obviously positive, there's no reason to intervene.

Then it plateaus. The scores stop moving. The agent is still suggesting changes, but they're lateral moves. Different but not better. It has optimized within the environment it was given. The evaluation set, given the current tools and data sources, is effectively solved.

That's when the human steps back in. Not to tweak prompts. The agent is already better at that than we are within a fixed environment. The human's job is to reason from first principles about what capability or data source would unlock the next level of quality.

For example, early on, the agent was producing decent reports but the quality had a ceiling. The reports reasoned about valuation in prose but they couldn't do real financial modeling. So we built a discounted cash flow (DCF) model and gave the agent access to it. That wasn't a prompt change, it was a structural change to what the agent could do.

The same pattern repeated. Reports were missing nuance about management strategy and forward guidance, so we added earnings call transcripts as a data source. The agent was attempting arithmetic inline and getting it wrong, so we built computation tools so it could do math in code. Each of these came from asking a specific question: what is the agent fundamentally unable to do right now, and what would fix that?

Each addition unlocked a step change in quality that no amount of prompt iteration could have reached. And each time, once the new capability was in place, the agent took over again. It figured out how to use the new tools well. It iterated on the prompts. Scores climbed. And eventually, it plateaued again.

Follow the Pattern to Improvement

The pattern is simple:

Define what good looks like
Build infrastructure that lets the agent iterate safely against that definition
When it plateaus, reason from first principles about what's missing and change the environment
Then, let it run again

We used this approach to build a finance research agent that went through 15+ prompt versions and got meaningfully better each round. The goal is to continuously improve. As Richard Socher always says, “Better, better, never done.”

To start building with You.com APIs, get your API key.

Featured resources.

Paying 10x More After Google’s num=100 Change? Migrate to You.com in Under 10 Minutes

September 18, 2025

Blog

September 2025 API Roundup: Introducing Express & Contents APIs

September 16, 2025

Blog

You.com vs. Microsoft Copilot: How They Compare for Enterprise Teams

September 10, 2025

Blog

All resources.

Browse our complete collection of tools, guides, and expert insights — helping your team turn AI into ROI.

AI Agents & Custom Indexes

The Most Popular Agentic Open-Source Tools (2026 Edition)

Mariane Bekker

Head of Developer Relations

February 9, 2026

Blog

A lone silhouetted figure stands atop a dark hill with arms raised against a swirling blue‑purple star-filled sky, creating a dramatic scene of wonder and triumph.

AI Search Infrastructure

AI Agents Are Entering the Workforce, Is Your Data Ready?

Mariane Bekker

Head of Developer Relations

February 6, 2026

Blog

AI Agents & Custom Indexes

Mastering Metadata Management

Chris Mann

Product Lead, Enterprise AI Products

February 4, 2026

Guides

Blue graphic with the text “What Is API Latency” on the left and simple white line illustrations of a stopwatch with up and down arrows and geometric shapes on the right.

Accuracy, Latency, & Cost

What Is API Latency? How to Measure, Monitor, and Reduce It

You.com Team

February 4, 2026

Blog

Abstract render of overlapping glossy blue oval shapes against a dark gradient background, accented by small glowing squares around the central composition.

Modular AI & ML Workflows

You.com Skill Is Now Live For OpenClaw—and It Took Hours, Not Weeks

Edward Irby

Senior Software Engineer

February 3, 2026

Blog

AI-themed graphic with abstract geometric shapes and the text “AI Training: Why It Matters” centered on a purple background.

Future-Proofing & Change Management

Why Personal and Practical AI Training Matters

Doug Duker

Head of Customer Success

February 2, 2026

Blog

AI Search Infrastructure

What Are AI Search Engines and How Do They Work?

Chris Mann

Product Lead, Enterprise AI Products

January 29, 2026

Blog

A man with light hair speaks in a bright office, gesturing with one hand while wearing a gray shirt and lapel mic, with blurred city buildings behind him.

Company

How Richard Socher, Inventor of Prompt Engineering, Built a $1.5B AI Search Company

You.com Team

January 29, 2026

Blog