Blog

April 21, 2026

The You.com Web Search Eval Harness: Benchmark Any Web Search Provider Yourself

Senior Applied Scientist

LI Test
LI Test

AI systems grounded in web search are only as reliable as the search layer underneath them. When that layer retrieves inaccurate, stale, or incomplete information, the model compounds the error.

We have seen this play out repeatedly: developers and enterprise teams spend weeks configuring eval harnesses, only to discover their setup is inconsistent with how the original benchmarks were designed to be scored. Others accept vendor-published numbers at face value and make architecture decisions on claims they cannot reproduce. Either path leads to the same place: AI use cases that underperform in production, and teams that cannot pinpoint why.

The You.com Web Search Eval Harness is our attempt to remove that obstacle. It is an open-source evaluation framework that runs the four benchmarks that have emerged as the industry standard for measuring search accuracy in AI workflows: SimpleQA, FRAMES, BrowseComp, and DeepSearchQA. It works out of the box, produces results you can compare across providers, and is designed to be extended.

Our goal is to give any developer or enterprise team a reliable starting point for vendor evaluation, so they can spend time on the AI use cases that matter rather than on eval infrastructure.

The Included Benchmarks

Not all search queries are alike, and no single benchmark covers the full spectrum from direct factual retrieval to deep, multi-source research. The You.com Web Search Eval Harness includes four benchmarks that, together, cover all of the bases.:

SimpleQA (OpenAI) measures factual accuracy on direct questions with unambiguous answers. It is designed to be hard to game: questions are calibrated so that a model without access to current web information cannot reliably answer them. Results reflect whether the search layer is retrieving the right facts, not just relevant documents.

FRAMES (Google) tests accuracy on questions that require synthesizing structured information across a broader set of documents, covering the kind of multi-hop reasoning tasks common in enterprise research workflows.

BrowseComp (OpenAI) evaluates a system's ability to answer questions that require navigating across multiple sources, where no single page contains the complete answer. It measures the full retrieval and synthesis pipeline under conditions that resemble how research agents operate in practice.
DeepSearchQA (Google DeepMind) evaluates agents on complex, open-ended research tasks where correctness is measured by completeness and precision across an exhaustive answer set, not just the first correct result. It is the closest proxy for how search performs under production research workloads.

If there is a benchmark your team relies on that isn’t included, open an issue on GitHub—we’re all ears.

How the You.com Web Search Eval Harness Works

The Web Search Eval harness handles dataset loading, query execution, answer synthesis, and scoring in a consistent pipeline across benchmarks. Scoring follows the methodology defined by the original benchmark authors. Where an original scoring model was deprecated, we selected the closest available replacement to preserve grading consistency.

The framework uses OpenAI's gpt-5.4-mini for grading and gpt-5.4-nano for synthesis, regardless of which search endpoints you are testing. Consistent grading requires a fixed, third-party model that is not tied to any vendor being evaluated. The eval harness is open source, so you can inspect or adjust these choices.

Clone the repo, add your API keys, and you're running evaluations in minutes.

Code Example


  python src/evals/eval_runner.py --samplers
  you_search_with_livecrawl --datasets simpleqa

To run all default providers and datasets, drop the flags. Results are written to CSV in batches during execution, so longer runs are monitorable. If a run is interrupted, it resumes without reprocessing completed queries. Full setup instructions and configuration options are in the README.

The Results

When a run completes, analyzed_results.csv contains accuracy and latency for every provider and benchmark combination you tested.

Every vendor, including You.com, continuously updates their search endpoints, so results shift over time. For the latest numbers, see the results table on GitHub. The framework exists so you don't have to take anyone's snapshot on faith. When you want to know where things stand, run it yourself.

Getting Started

The repository is at github.com/youdotcom-oss/web-search-api-evals. The README covers setup, configuration, and how to run the evaluations.

You will need a You.com API key and an OpenAI API key to run the default configuration. The OpenAI key is used for the grading and synthesis models.

If you want to understand more about how we think about search accuracy, evaluation methodology, and what good performance looks like in production AI workflows, we have written about it here:

We want to hear from you. If you hit a configuration issue, have questions about your eval setup, want to request a benchmark, or just want to talk through how to evaluate search providers for your use case, start a conversation in GitHub Discussions.

For enterprise or private inquiries, reach out directly at api@you.com. We read it.‍

Featured resources.

Paying 10x More After Google’s num=100 Change? Migrate to You.com in Under 10 Minutes

September 18, 2025

Blog

September 2025 API Roundup: Introducing Express & Contents APIs

September 16, 2025

Blog

You.com vs. Microsoft Copilot: How They Compare for Enterprise Teams

September 10, 2025

Blog

All resources.

Browse our complete collection of tools, guides, and expert insights — helping your team turn AI into ROI.

A person standing before a projected screen with code, holding a tablet and speaking, illuminated by blue and purple light.

AI Agents & Custom Indexes

Why Agent Skills Matter for Your Organization

Edward Irby

Senior Software Engineer

February 26, 2026

Blog

Illustration with the text “What Is P99 Latency?” beside simple line-art icons, including a circular refresh symbol and layered geometric shapes.

Accuracy, Latency, & Cost

P99 Latency Explained: Why It Matters & How to Improve It

Zairah Mustahsan

Staff Data Scientist

February 25, 2026

Blog

Modular AI & ML Workflows

How to Add AI Web Search to n8n

Tyler Eastman

Lead Android Developer

February 24, 2026

Blog

Abstract circular target design with alternating purple and white segments and a small star-shaped center, set against a soft purple-to-white gradient background.

Modular AI & ML Workflows

Give Your Discord Bot Real-Time Web Intelligence with OpenClaw and You.com

Manish Tyagi

Community Growth and Programs Manager

February 20, 2026

Blog

Blue graphic background with geometric lines and small squares, featuring centered white text that reads ‘Semantic Chunking: A Developer’s Guide to Smarter Data.’

Rag & Grounding AI

Semantic Chunking: A Developer's Guide to Smarter RAG Data

Megna Anand

AI Engineer, Enterprise Solutions

February 19, 2026

Blog

Clothing rack seen through a shop window, displaying neatly hung shirts and tops in neutral and dark tones inside a softly lit retail space.

AI Agents & Custom Indexes

4 AI Use Cases in Retail That Demonstrate Transformation

Chris Mann

Product Lead, Enterprise AI Products

February 18, 2026

Blog

Graphic with the text “What Is a Forward-Deployed Engineer?” beside abstract maroon geometric shapes, including concentric circles and angular line designs.

AI Agents & Custom Indexes

The Forward-Deployed Engineer: What Does That Mean at You.com?

Megna Anand

AI Engineer, Enterprise Solutions

February 17, 2026

Blog

Abstract glowing network of interconnected nodes and lines forming a curved structure against a dark blue gradient background with small outlined squares floating around.

Modular AI & ML Workflows

What is n8n? A Beginner's Guide to Workflow Automation

Tyler Eastman

Lead Android Developer

February 13, 2026

Blog