Scaling AI Evaluation Through Expertise
How Harvey evaluates AI performance using expert feedback, automated pipelines, and custom data infrastructure.
May 22, 2025
Harvey Team
Introduction
Every day, professionals around the world rely on Harvey to deliver high-quality answers, analyze complex documents, and provide trustworthy citations they can use with confidence. Behind the scenes, this is powered by a sophisticated network of AI systems—and ensuring they work as intended requires an equally thoughtful approach to evaluation. At Harvey, we know high performance doesn’t come from guesswork—it comes from rigorous testing and constant iteration.
Imagine this scenario: A tax attorney needs to quickly understand how a new multinational tax ruling affects their client's overseas operations. They ask Harvey a complex question about cross-border tax implications. Within seconds, Harvey delivers an answer citing relevant tax codes and recent court interpretations—but how do we ensure this response is accurate, helpful, and properly sourced? This is where our evaluation system comes in.
We’ve developed a structured evaluation strategy grounded in three core pillars:
- Expert-led reviews that offer deep domain insights and uphold the highest professional standards
- Automated evaluation pipelines for rapid iteration and continuous monitoring
- Dedicated data services to ensure evaluations are organized, secure, and repeatable
Together, these pillars help us answer a fundamental question: Is our AI actually doing what our users need it to do?
In this blog post, we’ll walk through our evaluation systems and processes—from legal expert feedback to AI-powered citation checks—and how it all comes together to raise the bar for quality in high-stakes domains.
Domain Expert Review
Leveraging the expertise of professionals on the front lines of high-stakes work is the critical first step in our evaluation strategy. By collaborating with legal specialists, tax experts, and other subject matter experts, we ensure every improvement we make is grounded in real-world needs. What sets Harvey apart is how unusually direct this collaboration is. Most companies need to go through layers of abstraction—consultants, account managers, external feedback pipelines—to get expert input. At Harvey, we get it firsthand. Engineers regularly hop on calls with partners at some of the most prestigious law firms in the world. These are people whose time is otherwise reserved for billion-dollar cases or mentoring top-tier associates—but here, they're reviewing product decisions and helping us shape the frontier of legal AI.
Just last month, Omar Puertas from Cuatrecasas, one of Spain’s largest law firms, visited our San Francisco office to talk to the Harvey team about how his firm uses Harvey across practice areas and thinks about AI adoption. Moments like these aren’t one-offs. They’re baked into the way we work, and they create a kind of product feedback loop that’s almost impossible to replicate elsewhere.
Building Expert-Curated Retrieval Datasets
In our previous post about our work with PwC, we highlighted how expert feedback has been critical to improving our Tax AI Assistant. Since then, we’ve expanded this human-in-the-loop evaluation model across multiple product lines, including our Lefebvre Spanish case law offering, our updated EUR-Lex database, our web search capabilities, and many other search tools and knowledge sources.
These evaluations follow a consistent two-step process that mirrors our answer pipeline—the end-to-end system that retrieves documents, constructs context, and generates answers. First, we collaborate with domain experts to develop retrieval datasets: curated “golden” query sets designed to rigorously test how well our systems surface relevant documents. These queries range from common user questions to highly nuanced legal challenges that require deep domain expertise.
For each query, these experts identify the most relevant supporting documents. We then evaluate our retrieval systems—both traditional and agent-based—against these references using metrics like precision (the proportion of relevant results), recall (how many of the relevant documents were found), and NDCG, or Normalized Discounted Cumulative Gain, which measures whether the most important documents appear near the top of the results. These metrics have proven highly predictive of real-world user satisfaction. We also test system performance under varying levels of retrieval power (how strong the underlying search engine is) and model context, or how much of the model’s context window is filled with useful information. This helps us understand tradeoffs between quality, speed, and cost.
This stage is all about search: can we get the right information in front of the model? In an agentic system, this isn’t a one-time step—it’s an iterative process where the system may search, reflect, and refine its context multiple times before generating a response. That’s why it’s so important to get this stage right: mistakes here can cascade, compounding in later steps and undermining the final answer.
Evaluating Answer Quality
Of course, surfacing the right documents is only part of the battle—we also need to make sure the model produces a helpful, accurate answer. But evaluating generative outputs, especially those that require domain expertise to judge, is far harder to automate. That's why we’ve built an internal tool for side-by-side LLM comparisons, enabling domain experts to assess responses in a structured, unbiased way.
We run two complementary human-evaluation protocols:
- A/B preference tests: Experts see two anonymized answers side-by-side, with model order randomized, and choose the better one.
- Likert-scale ratings: Experts rate each answer independently on a scale from 1 (very bad) to 7 (very good), assessing dimensions like accuracy, helpfulness, and clarity.
These controls—randomized ordering, standardized prompts, and anonymized content—help reduce labeling bias and allow us to detect statistically significant improvements when we modify prompts, pipelines, or models.
We recently used this setup to compare freshly released GPT-4.1 with GPT-4o in one of our AI systems designed to handle complex legal questions. The results were clear: GPT-4.1 significantly outperformed GPT-4o. The mean rating improved by over 10% (from 5.10 to 5.63), and the median score rose from 5 to 6—a full point higher on a 7-point scale. The difference wasn’t just noticeable; it was statistically significant, giving us strong confidence that the improvement was real and consistent.

These evaluations give us the signal we need to justify changes to our underlying model stack. In this case, the feedback helped validate our decision to shift more workloads to GPT-4.1, knowing it would lead to a measurable jump in quality for end users.
Side-by-side evaluations have also been instrumental in refining everything from prompt templates to how we post-process citations. In one recent iteration, a seemingly minor change to how we chunk retrieved documents led to a noticeable boost in both perceived helpfulness and factual grounding—an insight we would’ve likely missed without structured expert review.
Automated Evaluation Pipelines
While expert-led reviews offer a high level of rigor, they face several key limitations:
- Data Scarcity: The sheer volume of potential test cases surpasses what any single expert or team can reasonably evaluate.
- Feedback Latency: Manual reviews typically occur in discrete batches, delaying essential insights and slowing down iteration.
- Fragmented Expertise: Different jurisdictions or practice areas require specialized knowledge, adding complexity and cost.
- Regression Risks: Without systematic, large-scale metrics, improvements in one area can inadvertently lead to declines elsewhere.
That’s where our automated evaluation pipelines come in—extending human feedback with continuous, data-driven methods that enable rapid iteration, broader coverage, and consistent monitoring of system performance. By combining expert-led reviews with automated evaluations, we close critical gaps that either approach might leave on its own—ensuring broad evaluation coverage without sacrificing depth or rigor.
Evaluation Systems Grounded in Legal Expertise
A core innovation at Harvey is our integration of automated evaluation systems informed by deep legal knowledge. These systems go beyond generic benchmarks to capture the nuanced demands of professional legal workflows.
Our automated evaluation systems consider several key elements: the model’s output, the original user request, relevant domain documentation or knowledge bases, and expert-provided prior knowledge. The evaluator uses this information to produce two results: a grade that reflects how well the model’s output meets the expected quality or correctness standards, and a confidence score that indicates how reliable that grade is. The grade may be numerical or categorical, depending on the context, and the confidence score helps determine how much trust to place in the evaluation. For example, consider a Q&A dataset constructed with expert-crafted golden queries and evaluation rubrics for model-based auto-grading. A representative query might be:
“Analyze these trial documents and draft an analysis of conflicts, gaps, contradictions, or ambiguities, including a detailed chronology of events and analysis results.”
The associated evaluation rubric could assess dimensions such as structure (e.g., Is the response presented in a structured format such as a table with columns X, Y, Z?), style (e.g., Does the response emphasize actionable advice?), and substance (e.g., Does the response state a certain fact?), along with a check for hallucinations or misconstrued information. The query and rubrics are used to calculate the grade and confidence score with carefully designed and calibrated model-based grading systems.
Through close collaboration between engineers and legal experts, Harvey has developed a suite of automated methods that accelerate R&D while providing continuous, around-the-clock monitoring of answer quality. These methods serve three core purposes:
- Routine Evaluations: We run a suite of lightweight canary evaluations nightly to validate the day's code changes before they go to production, catching regressions in sourcing accuracy, answer quality, legal precision, and more.
- Production Monitoring: We monitor anonymized production data to track performance trends and gain insights—without compromising client confidentiality.
- Model Vetting: We evaluate newly released foundation models to identify performance gains and guide integration, ensuring Harvey remains at the forefront of AI-driven legal solutions.
Knowledge Sources: A Specialized Auto-Evaluation Example
Building on our general auto-evaluation pipeline, certain tasks require more focused, domain-specific techniques. One such example is our Knowledge Source Identification system, designed to verify legal citations generated by LLMs with high accuracy. This system is a core component of our model refinement loop and addresses several unique engineering challenges:
- High-Volume Fuzzy Matching: Quickly and accurately matching citation strings against a corpus of millions of documents, even when data is incomplete or slightly misspelled.
- Metadata Weighting: Properly weighting fields like the document name, date, parties, and publication in situations where the citation is partial or ambiguous.
We overcame these challenges with a custom embedding pipeline that prioritizes document title similarity and accounts for source context. The process starts with structured metadata extraction from each citation, parsing details such as the title, source collection, volume/issue (if any), page range, author/organization, and publication date. Proprietary or incomplete citations are excluded to focus on verifiable public sources.
When reliable publication data exists, the system queries an internal database to retrieve a curated set of candidate documents. When metadata is partial—such as when only a title fragment or source identifier is available—an embedding-based retrieval approach with date filters is used to identify likely matches.
Finally, an LLM performs a binary document-matching evaluation, confirming whether the retrieved candidate refers to the same document as the original citation. This combination of structured parsing, intelligent retrieval, and LLM-based judgment has yielded over 95% accuracy on our internal benchmark dataset validated by attorneys.

Data Management
Enterprise-grade evaluations demand more than just expert input and automated pipelines. They require a secure, organized operations layer that centralizes data, enforces strong security controls, and supports rich stakeholder collaboration. At Harvey, we’ve built a dedicated service for evaluation data that addresses the often-overlooked challenges of organizing, labeling, and versioning data. Through this data service, we can safeguard confidential information, streamline workflows for our domain experts, and maintain a single source of truth for all evaluation activities.

Centralized and Secure Data Management
A recurring challenge in complex, multi-stakeholder evaluation processes is tracking where different datasets reside and controlling who can access them. We established a service dedicated to evaluation data, isolated from Harvey’s primary application to prevent unintentional data leakage or unauthorized dependencies. This gives us complete control over how data is accessed, updated, and versioned, while still allowing for the flexibility needed to handle anything from small pilot projects to large-scale research efforts.
We standardize how inputs, outputs, and annotations are stored, so legal experts, engineers, and automated evaluators all work from the same playbook. This reduces confusion, speeds up iteration, and ensures consistent quality across teams. Moreover, our fine-grained role-based access control system enforces strict privacy policies at the row level, enabling us to segment data between public, confidential, and restricted tiers. This means sensitive legal documents and customer-provided examples remain under the tightest restrictions, while higher-level metrics and aggregate statistics can be safely shared more broadly.
Streamlined Collaboration and Versioning
Building a standalone evaluation platform streamlines cross-functional collaboration by giving domain experts a single interface to add or refine data sets, attach relevant documents, and provide qualitative feedback, while engineers benefit from APIs that allow them to run automated checks, create or consume new data sets, and track changes over time. Crucially, we have implemented dataset versioning as a core principle: once a collection of evaluation samples is “published,” it becomes locked, ensuring that iterative experiments have a fixed baseline for comparing new features, models, or methods. This immutability enhances reproducibility, helping product teams confirm that any observed quality improvements are the result of deliberate changes rather than shifting data sets or annotation drift.
Conclusion
As we continue to expand what Harvey can do—from legal research to agentic workflows—our commitment to rigorous evaluation remains at the heart of our approach. In high-stakes fields like law and tax, AI isn’t just about speed or scale—it’s about trust. That’s why we evaluate relentlessly. Looking ahead, we're tackling even more exciting evaluation challenges: How do we assess the quality of multi-step reasoning? How do we generalize the automation of domain expert reviews? These are the kinds of questions our engineering team is excited to solve next. Because at Harvey, we’re not just building models—we’re building the future of professional work.
If you're an engineer who gets excited about these types of challenges, explore our open roles or reach out directly to engineering@harvey.ai to learn more about how we work.
Thanks to everyone who’s worked on eval at Harvey!
Samarth Goel, Boling Yang, Nan Wu, Stefan Palombo, Joel Niklaus, Spencer Poff, Pablo Felgueres, Lysia Li, Calvin Qi, Reggie Cai, Emilie McConnachie, Bronwyn Austin, Matthew Guillod, Lauren Oh, Niko Grupen, Julio Pereyra, and the Applied Legal Research team