How We Use ELO Scores to Build Better Legal AI
Turning human preference into measurable, meaningful improvements across the Harvey platform.
In BigLaw Bench: Arena, we described our system for scaling human preference over results. The core artifacts produced by that system are ELO scores for Harvey systems and non-Harvey baselines. These scores help us understand how likely the output of one system is to be preferred to another when both outputs are presented to lawyers.
To make this concrete, the Assistant ELO scores from BLB: Arena suggest that the responses of Harvey Assistant are preferred more than 70% of the time to those of generic foundation models.
Importantly, we use ELO not just to celebrate the results of applied AI work, but also to help shape that work. In the rest of this post, we explain how we use ELO generally to understand and improve AI systems through specific, recent examples from across Harvey’s products.
ELO Explained
ELO is a rating system for competitors originally used in chess and later adopted to measure skill in many competitive games. It converts a large number of head-to-head wins and losses into a single, comparative measure of how good a competitor (e.g., an AI system) is.
Specifically, a pair of ELO scores expresses the likelihood that one AI system will win (be preferred in the quality of its output) over another system on any given task. We use ELO at Harvey because it provides an intuitive way to express quality differences between AI systems and measure when those differences become meaningful to our customers.
“ELO expresses the likelihood that one AI system will [be preferred] to another system on any given task.”
In practice, an ELO gap of around 150–200 points tends to reflect a clear preference. Roughly speaking, this means the higher-rated system is preferred about 70% of the time. In this case, users routinely report that the better system feels clearly better for their work. Smaller gaps than this tend to be qualitatively neutral to customers: some lawyers prefer one system, while others see no difference in practice.
At around a 400-point ELO gap, systems are usually experienced as fundamentally different, not just improved. With this gap, preference for the stronger system exceeds 90%, and users find it more useful almost all the time. Historically at Harvey, shifts of this size have marked genuine step-changes — such as the introduction of sentence-level citations or focused investments in research models like case law and tax.
Because these patterns recur reliably in human evaluations, we use ELO as a shorthand for understanding when improvements are likely to be felt by customers and how. For that reason, human preference — and the ELO scores derived from it — remains central to how Harvey measures progress and builds meaningfully differentiated AI.
Understanding Systems: Knowledge Sources (November 2025)
The main way we use ELO is to understand what people like and dislike about using AI systems. Oftentimes, we run studies to generate initial ELO scores and then work through the data to convert those scores into a meaningful why. This is especially helpful when we may not have intuitions about a system, such as when we build new international knowledge sources.
When building our Australia knowledge source, we contracted with local lawyers to evaluate at scale across various AI systems and figure out what they genuinely preferred. The results were somewhat unsurprising. Lawyers preferred systems that were accurate, drew on primary sources, and provided clear and actionable content from those sources. We used these pillars to build a strongly differentiated Australia knowledge source.
Australia System* | ELO | Harvey Preference Rate (% of time Harvey is preferred over alternative) |
|---|---|---|
Harvey | 1500 | N/A |
GPT-5 (reasoning high) | 1333 | 72.34% |
Gemini | 1165 | 87.31% |
GPT-5 (no reasoning) | 1154 | 87.99% |
OpenAI Deep Research | 1133 | 89.21% |
As we extended these research cycles to other countries, we found another nuance that did not exist in Australia. In many countries, localization was a challenge for models even when provided web search access. For instance, questions written in German by Austrian lawyers would often return answers about German law. Adding strong localization to our systems allows us to deliver systems in Austria that are even more differentiated than those in Australia in enabling high quality responses for local legal work.
Austria System* | ELO | Harvey Preference Rate (% of time Harvey Knowledge Source is preferred over alternative) |
|---|---|---|
Harvey (Knowledge Source) | 1500 | N/A |
Harvey (General Web) | 1281 | 77.91% |
GPT-5 (Auto) | 1181 | 86.25% |
Claude 4.5 Sonnet | 907 | 96.81% |
Setting New Standards: EDGAR (December 2025)
Another way we use ELO is to benchmark and improve against a particular standard. When we rebuilt our EDGAR system, we wanted to ensure it remained differentiated not just against generalist models, but against specialized financial data search APIs.
EDGAR System | ELO Before EDGAR Rebuild | Harvey Preference Rate (% of time Harvey EDGAR is preferred over alternative) |
|---|---|---|
Human | 1651 | 29.54% |
Harvey (EDGAR) | 1500 | N/A |
Harvey (Web) | 1499 | 50.14% |
GPT-5 (Financial Search API) | 1489 | 51.58% |
GPT-5 (Web) | 1488 | 51.73% |
Our results were surprising: None of these systems were differentiated from each other — each had unique strengths and weaknesses that overall came out neutrally. So, we set a new standard. Building on our general knowledge source understanding, we asked human lawyers to write short, accurate, primary-sourced answers to dozens of EDGAR search questions. These answers handily beat AI systems and set a north star for improvement.
From there, we iterated on our EDGAR system to improve in places where AI-generated answers fall short of human answers (accuracy) and lean in where they tend to excel (tireless detail). The result: an EDGAR system that can go head-to-head in preference with human experts and is differentiated from other AI systems. While there are still improvements to be had, ELO allowed us to track and iterate on not just performance, but meaningful performance.
EDGAR System | ELO After EDGAR Rebuild | Harvey Preference Rate (% of time Harvey EDGAR is preferred over alternative) |
|---|---|---|
Harvey (EDGAR) | 1695 | N/A |
Human | 1689 | 50.86% |
Harvey (Web) | 1491 | 76.39% |
GPT-5 (Financial Search API) | 1489 | 76.6% |
GPT-5 (Web) | 1488 | 76.7% |
Pushing Our Own Boundaries: Vault (September 2025)
ELO also helps us iterate in places where no external baseline exists. In evaluating Harvey’s Vault product, we knew it was as accurate as humans on meaningful tasks. But, we wanted to ensure we were presenting that information as usefully and as quickly as possible. So, we set out with three competing objectives: Could we gain ELO over our existing Vault system while also reducing the time to generate a cell and improving citation quality?
To explore the possibilities, Harvey Applied Legal Researchers competed to build different interpretations of Vault using an array of models, prompts, and other techniques to optimize the relevant parameters. The result was an array of choices for a better Vault system, including several systems that both meaningfully improved human preference and citation quality while drastically reducing latency.
Additionally, we found surprising strengths in unexpected models, such as GPT-4.1-mini’s ability to provide clear, compelling answers (though it struggled on citations). Continuing research on possibilities unlocked by new models is expected to compound these improvements, allowing us to continuously improve the Vault experience as new tools become available.
Vault System (listed by base model) | ELO | Harvey Preference Rate (% of time Harvey A is preferred over alternative) | Citation Quality (% of valid citations) | Latency (average time to complete batch of 50 cells) |
|---|---|---|---|---|
Harvey A (GPT-4.1-mini) | 1603 | N/A | 62% | 17.37s |
Harvey B (GPT-5) | 1549 | 57.71% | 97% | 11.27s |
Harvey C (Sonnet 4) | 1538 | 59.25% | 89% | 16.29s |
Harvey Baseline | 1500 | 64.4% | 92% | 23.91s |
Harvey D (GPT-5-mini) | 1454 | 70.22% | 92% | 14.04s |
Continuous Improvement
The baseline expectations of AI are always shifting. New models, paradigms, and tools are constantly becoming available to push the capabilities of AI in professional services. Human preference and ELO are one of the main tools we use to separate hype from value by turning complex human data into clear insights. Doing so allows us to differentiate from baseline AI capabilities, and continue constantly refining those capabilities to deliver the best AI systems for legal work.
Credits: The research cited in this post was performed by Cam MacGregor and Olga Baranoff (Vault); Chris Bello (Knowledge Sources), Emilie McConnachie (EDGAR), and Elizabeth Lebens (Assistant)





