Expanding :Harvey:'s Model Offerings

With multiple leading foundation models now available in Harvey, users get better results by default—and more control when needed. We’re pairing this shift with deeper evaluations to help match users with the right models for the right legal tasks.

May 13, 2025

Harvey Logo

Harvey Team

At Harvey, our mission is to build the best AI for legal work. As AI systems have evolved from chat interfaces to agents and workflows, delivering on this mission requires Harvey to change as well. The first change we are making is to incorporate and optimize leading foundation models from Anthropic and Google for use across the Harvey platform. Both models will be integrated through their respective cloud providers (AWS Bedrock, Google Vertex), with the same security and privacy guarantees that have always existed on the Harvey platform. Over time, we intend to identify other foundation model providers that provide unique advantages for Harvey systems and incorporate them as well.

For most users, this change will be only felt in results. They will get better responses, more collaborative agents, and more powerful workflows—as by default Harvey will auto-route to the best model systems for legal work. Multi-model integration will also open up novel customization and optimization options for firm-specific use cases and workflows. In support of both initiatives, Harvey is also redoubling its commitment to transparent evaluation. Below, we detail the reasons for becoming more deeply multi-model, its impact, the evaluations we use to support these efforts, and how we will continue providing the best AI for legal work.

Multi-model Evaluations

The recent decision to incorporate additional models has been driven by the emergence of a large number of highly performant foundation models. Last year, we published a first set of results on our proprietary benchmark: BigLaw Bench. Those results reflected what we knew: that a small number of models could be optimized for use within Harvey systems to substantially outperform competitors across a variety of legal reasoning tasks.

Since then, general foundation models have improved at baseline legal reasoning. This dramatically reduces the effort of adopting new models into the Harvey platform, as optimization can be focused on task execution, incorporating firm knowledge, and user collaboration rather than baseline reasoning. The below graph shows performance of various models when incorporated into and optimized for Harvey systems. In less than a year, seven models (including three non-OAI models) now outperform the originally benchmarked Harvey system on BigLaw Bench.

BigLaw Bench Score 1

This convergence in general quality hides a more subtle trend. There is still a substantial gap in which model is best suited to a particular legal task. Like lawyers, modern models present different strengths, weaknesses, and biases. For example, on BigLaw Bench subtasks, Gemini 2.5 Pro excels at legal drafting but struggles at trial preparation and oral argument due to difficulties in reasoning effectively about complex evidentiary rules like hearsay which models from other providers understand more clearly.

BigLaw Bench score 2

The same holds true for expert evaluation of models. When asked to consider response quality and preference, experts show a wide variance on individual model outputs even as their evaluations of model quality tend to converge at scale. In an increasing number of cases, expert preference is also being driven by subjective factors such as tone, style, and structure as models converge on the right answer but present that answer in materially different forms. Taken together, these different perspectives on evaluation point towards the same outcome: there is no longer a single “best” model, but an array of premier models, each particularly suited to different legal tasks and preferences.

Multi-model, Agents and Workflows

While there may no longer be a “best” model, there can continue to be a best model system—one that effectively leverages all of these models to maximize value for our clients. As we continue to integrate and optimize models in the Harvey platform, users will access these improved systems in two ways. Most simply, users will continue to use Harvey as they normally do while seamlessly being routed to the best models for their work. Lawyers and teams interested in more actively exploring how different models can support their work will also have access to a model selector. This will allow them to pick the particular model used by Harvey’s model systems to support their work, or choose “auto” to have Harvey select for them.

model selector

As systems continue to evolve, incorporating a broad range of models will allow us to lean into each model’s individual strengths, selecting models optimally suited for specific sub-tasks or agentic processes. Optimizing from a model’s inherent strengths becomes increasingly necessary when pursuing more complex agents & workflows. These multi-step problems demand increasingly exacting model quality as they considerably shrink the margin of error for each model. Even on a relatively straightforward three-step process, completing 80% of the work at each step means only completing about half of the overall work.

Workflows and other task-specific systems are ideal for hitting this exacting quality bar as they are built to solve specific, high-value problems. When building a research agent, you don’t need to worry about whether a model struggles at writing; you can use a separate model to draft once the research is compiled. Instead, you can identify the optimal solution for a particular workflow or even a particular step—independently choosing the model best suited to plan, orchestrate, and execute each task across a complex system. Broadening the number of base models that the Harvey team can integrate will increase the quality of existing systems and allow for entirely new ones.

Quantifying Quality

Enabling a broad range of models for an even broader range of legal workflows requires novel evaluations suited to capturing task-specific variances that are becoming essential to pushing the quality frontier. These evaluations not only help us identify the best models, they also serve as critical guidance to our clients. Clear evaluations allow us to give precise insights on which models are best suited for particular use cases and preferences, and how Harvey systems overall help solve our clients’ unique problems.

As our model strategy becomes increasingly complex, it becomes more important than ever to show our work in evaluation. To that end, we will be maintaining a public leaderboard for BigLaw Bench, providing an up-to-date standard on how baseline model reasoning is evolving on legal tasks. We will also begin to publish results of our expert evaluation process, where top lawyers provide nuanced insights into model performance not captured by single-score benchmarks. These complementary evaluations will continue to evolve in order to provide the most comprehensive view of model and system performance across legal use cases.

Conclusion

We are excited to announce the integration of Anthropic and Google models to supplement our existing OpenAI models in the Harvey platform. Integrating these additional models will provide us with more optionality in selecting the best models for particular legal tasks. As this broader suite of models unlocks additional performance and net new functionality, we will also redouble our public evaluation efforts to continue to provide clarity in how a rapidly changing AI environment is driving real outcomes for our clients. While the tools may change, our mission remains constant: build the best AI for legal work.