Insights

How Legal Teams are Using AI for Legal Discovery

Learn how litigation teams use legal-grade AI and defensible protocols to manage discovery reviews faster, protect privilege, and meet tight production deadlines.

by Harvey Team•Jun 5, 2026

A litigation team is deep into discovery. The corpus is 800,000 documents pulled from twenty-three custodians. The production deadline is 30 days out. The traditional linear review math, even with a contract attorney rate that has barely moved in a decade, no longer works with the client's budget.

This is the operating reality of modern discovery, and it is why AI has started becoming an architectural question, rather than a procurement one. The firms doing this work well are not the ones with the largest technology budgets. They are the ones with the clearest protocols.

A working litigator now has to decide not whether to use AI in discovery but which kind, at which stage, and under what protocol. Predictive coding ranks documents for relevance using attorney-trained models. Generative AI reads complex documents and produces relevance determinations and drafts follow-ups with cited reasoning attached to every output. Agentic AI executes multi-step review workflows under attorney supervision. Each carries a different defensibility profile, a different cost curve, and a different place in the discovery lifecycle. By the end of this article, a litigation team will know where each of the three modes earns its place in a matter, what protocol elements survive a defensibility challenge, and what to look for in a platform that has to hold up in production.

Where AI Fits in the Discovery Lifecycle

Discovery has stopped being a single-tool problem. For most of the past 15 years, "AI in discovery" was synonymous with technology-assisted review at the relevance stage. That framing is now too narrow. AI capabilities operate across the full Electronic Discovery Reference Model (EDRM), and treating them as a single capability obscures both where the value is and where the risk concentrates.

Developed in 2005 by George Socha and Tom Gelbmann, the EDRM remains the most useful map of the discovery process and, by extension, the most useful map of where AI now sits within it. The earliest stages, Information Governance and Identification, are where early case assessment models do their highest-leverage work. By analyzing collected ESI, these models surface likely custodians, communication patterns, and dispositive documents before review formally begins. Preservation and Collection involve AI more narrowly, mostly through deduplication across collection sources and gap analysis on custodian coverage.

Processing is the part of the lifecycle where the older generation of analytics still does the heaviest work. Email threading, near-duplicate detection, and language identification routinely reduce a raw collection by a meaningful fraction before a reviewer sees a single document, with deduplication alone often eliminating large portions of the corpus. These techniques predate the current AI era and remain table stakes for any modern review.

Review is the stage where the most consequential shift of the past three years has taken place. Predictive coding and generative AI review now coexist as the two primary approaches to relevance determination, with different strengths and different defensibility records. The next section walks through the tradeoff in detail.

By the time a matter reaches Analysis and Production, AI's contribution shifts to the deliverables. The model drafts privilege log entries, identifies privilege indicators across large corpora, and generates the production-ready metadata that historically required dedicated paralegal time. Presentation, the final stage, is where generative tools now support deposition preparation, witness kit assembly, and exhibit selection from the produced record.

The human role remains decisive at every stage. The model proposes; the lawyer disposes. What AI changes is the ratio between routine determination and considered judgment, freeing attorneys to spend more time on the latter.

How to use AI for Relevance Review

The two dominant AI approaches to relevance review work differently, defend differently in court, and suit different case profiles. Choosing between them, or knowing when to combine them, is now one of the more consequential protocol decisions a litigation team makes.

Technology-assisted review (TAR) in its established forms relies on attorney-trained classification models. In TAR 1.0, a senior attorney codes a seed set of documents, the model learns from those determinations, and the model then ranks the remaining corpus by predicted relevance. TAR 2.0, also called continuous active learning, refines this by feeding new attorney determinations back into the model on a rolling basis, so the ranking improves as the review progresses. Validation typically involves a control set with confidence intervals on recall and precision, sampled and measured against attorney-coded ground truth.

The defensibility record here is well-established. Da Silva Moore v. Publicis Groupe, decided by Magistrate Judge Andrew Peck in February 2012, was the first federal opinion to approve TAR for use in document review. Rio Tinto PLC v. Vale S.A., also decided by Judge Peck in March 2015, reflected that the use of TAR by a producing party had become well-settled in federal practice, while leaving open the question of how much methodological transparency parties must offer one another. Hyles v. New York City, decided by Judge Peck in August 2016, established that a responding party cannot be compelled to use TAR over its objection, even where TAR would be more efficient than the methodology the responding party preferred. More than a decade of case law now gives TAR a settled posture in federal court.

Generative AI review is structurally different. Instead of ranking, the model reads each document and produces a relevance determination accompanied by reasoning. The reviewer sees not just a score but an explanation, with citations to the specific passages that drove the conclusion. Legal-grade implementations of this approach, like Harvey, ground every output in the underlying source document so the reviewer can verify the determination rather than accept it on faith. This shifts the human review task from coding from scratch to validating a proposed answer. The result is a review operation that is generally faster on complex, fact-intensive cases, where the explanatory output saves time the reviewer would otherwise spend reconstructing the determination from scratch.

The tradeoffs sort along three dimensions. On defensibility, TAR has the longer track record; generative review is newer, and the case law is still developing, though courts have been receptive where the protocol is rigorous and the validation is documented. On explainability, generative review has a structural advantage. A model that produces written reasoning for each determination creates a transparent record that a reviewer, an opposing party, or a court can interrogate. On scale and complexity, generative review handles fact-intensive and multi-issue relevance criteria that classification models often miss, while predictive coding remains exceptionally efficient for high-volume reviews against narrow criteria.

A working decision rule is starting to settle in practice. For high-volume reviews with clear, narrow relevance criteria, predictive coding remains the most efficient option. For complex, fact-intensive cases where reasoning matters and where the cost of a missed document is high, generative review delivers more value. Many sophisticated protocols now combine the two, using predictive coding to triage at the corpus level and generative review for the documents the model ranks as most likely relevant or most ambiguous.

Five Elements of a Defensible AI Discovery Protocol

Defensibility is a documentation problem, not a technology problem. Courts evaluate process, not algorithms. A litigation team that loses a discovery dispute over its use of AI usually loses it in the protocol, the validation record, or the meet-and-confer, not in the model itself.

The Federal Rules of Civil Procedure set the frame. Rule 26(b)(1) anchors discovery to proportionality, which is the operative standard a court applies when a party challenges the scope or method of review. Rule 26(f) governs the meet-and-confer and the discovery plan that comes out of it. Rule 26(b)(5) governs the privilege assertions that any review protocol must produce. Federal Rule of Evidence 502 controls the consequences of inadvertent privilege waiver. These four provisions are the substrate against which any AI protocol should get tested.

A defensible protocol contains five elements. Most disputes originate in the absence of one of them.

1. Written ESI protocol

The starting point is a written ESI protocol that discloses the AI methodology in operational terms. Not the model architecture, but the workflow. How the corpus was collected. What tools are being used at each review stage. Who is making final relevance determinations. How the team will validate the results. The level of detail should be sufficient that opposing counsel can evaluate the process and a court can review it.

2. Validation methodology

The protocol must produce the metrics courts now expect to see. Recall, precision, and elusion rates measured against a statistically valid sample. The sample size and confidence interval should be documented in advance. The validation should occur on a control set the model has not seen during training. The results should be preserved in the work product file.

3. Sampling-based quality control

This is distinct from end-state validation. It means pulling random samples of model-classified documents at intervals during the review itself and having attorneys verify the determinations, with the results logged. When a court asks how the team knew the model was working, this is the answer.

4. Audit trail

The audit trail captures human decisions, model versions, and any methodology changes mid-review. If a senior attorney recalibrates the seed set, that gets logged. If the team switches from one model to another, that gets logged. If a category of documents is excluded from automated review and routed to manual, that gets logged. The audit trail is the evidentiary record of the process.

5. Meet-and-confer disclosure

Disclosure should be calibrated to the case. The Sedona Conference's cooperation principles, set out in its 2008 Cooperation Proclamation, encourage transparency. Most experienced practitioners now disclose the use of TAR or generative review at the Rule 26(f) conference. Disclosure of generative AI use specifically remains less settled. Some courts and standing orders now require it; others do not address it. The working practice in most sophisticated litigation is to disclose the use of AI, to describe the workflow at a level sufficient for the other side to evaluate it, and to reserve the underlying validation data for production if challenged.

Privilege Review is Where AI is Changing the Economics Most Dramatically

Privilege review is the highest-leverage application of AI in discovery. It is consistently characterized in industry analysis as the most time-consuming and expensive phase of document review in complex litigation. The cost of a privilege miss is also asymmetric. A relevance error produces a marginal inefficiency. A privilege error produces an inadvertent waiver that can put attorney-client communications into the hands of an adversary.

This is also where generative AI has shifted the curve most visibly in the past 18 months. The reason is structural; privilege determination is a reasoning task, not a classification task. A document is privileged because of the participants, the content, and the legal context. Classification models can flag candidates by surface features but cannot evaluate the underlying claim. Generative models can read the document, identify the privilege indicators, and articulate the basis for the determination in language a reviewing attorney recognizes as legal analysis.

The operational pattern that has settled in sophisticated practice is human-in-the-loop, with the human at the decision point and the AI at the work point. The model surfaces privilege candidates by reading the corpus and identifying documents that contain attorney involvement, legal advice content, or work product characteristics. For each candidate, the model produces a draft determination with cited reasoning and a draft privilege log entry meeting the descriptive standard required by Rule 26(b)(5). The reviewing attorney confirms, modifies, or rejects each determination, and the audit trail captures both the model's proposal and the attorney's decision.

The risk side requires honest treatment. AI privilege review must be validated against attorney determinations on a representative sample, and the validation rate should be disclosed in the protocol. Incremental gains in accuracy matter at scale, because the difference between a model that misses one privileged document in 20y and one that misses one in 100 becomes large when the production runs to hundreds of thousands of documents. The validation should also test for the failure modes that matter most. False negatives, where a privileged document is classified as non-privileged and is at risk of inadvertent production, are more consequential than false positives, where a non-privileged document is over-designated and held back.

Federal Rule of Evidence 502 provides the safety net for inadvertent waiver, particularly through 502(b) for unintentional disclosures and 502(d) for court-ordered non-waiver protections. A 502(d) order is now standard practice in any matter using AI for privilege review. It does not substitute for a defensible process, but it provides the doctrinal backstop when the inevitable edge cases surface.

The privilege log generation use case deserves separate attention. Producing a privilege log that meets the descriptive standard, with enough specificity to permit assessment of the claim without revealing privileged content, has historically been one of the most labor-intensive and error-prone deliverables in discovery. Generative AI can produce log entries that meet this standard at scale, with the same citation grounding and human review that apply to the underlying privilege determinations. For a matter with tens of thousands of privileged documents, this changes the deliverable from a multi-week paralegal project into a structured review of model-generated entries, with the time savings flowing directly to the partner-level review that should have been the focus all along.

Where the AI Value Curve is Steepest in Time-Compressed Reviews

AI's value in discovery scales with time pressure. The steeper the deadline, the higher the return, because the binding constraint shifts from cost to capacity. In a long-running matter, a team can absorb a slower review pace by adding contract attorneys. In a time-compressed matter, no amount of staffing can collapse the timeline below what linear review allows. AI is what changes the math.

The canonical scenarios are well-defined. A Hart-Scott-Rodino Second Request triggers a substantial compliance response that routinely involves the production of millions of documents, with the parties certifying compliance only after an intensive review effort that often spans several months. The 30-day window the reviewing agency has to act runs from the date of certification, which makes the speed of reaching certification a competitive variable in transaction timing. A regulatory investigation under SEC, DOJ, or FTC subpoena often imposes production windows measured in weeks, with the agency expecting privilege logs and rolling productions on a schedule that does not accommodate traditional linear review. An internal investigation with a board reporting deadline, particularly one tied to an earnings release or a public disclosure obligation, runs on the same kind of clock.

The staffing model that emerged over the past two decades, in which a Second Request triggers the assembly of a 50-attorney contract review team within 72 hours, is starting to give way. The replacement is a smaller team of associates and senior reviewers working alongside generative review and continuous learning models. The model handles first-pass relevance against the corpus. The senior team reviews the model's high-confidence relevant set, plus a stratified sample of the rest to validate the recall. The work product, including the validation record, is preserved as part of the substantial compliance certification.

The time savings show up most visibly in the first 48 hours. Early case assessment AI, run against the collected ESI before formal review begins, surfaces dispositive documents and key custodian communications in a window that historically required weeks of attorney work. For a Second Request, this changes the team's posture at the agency meet-and-confer because the negotiating attorneys know what is in the corpus before the formal review is complete. For internal investigations, the same dynamic gives the board or special committee a preliminary factual map within days, not weeks, and lets counsel build the witness interview list from the document record rather than from organizational charts.

The pattern is showing up in firm reporting. At Lynn Pinker Hurst and Schwegmann, a Chambers Band 1 litigation boutique handling high-stakes disputes in financial services, healthcare, and insurance, litigators are using Harvey for early case assessment across hundreds of files. They now save more than eight hours per lawyer per week. The firm has reported winning new business because the platform allows it to respond to urgent client requests in under 48 hours, a turnaround time that historically required either preexisting familiarity with the matter or weekend staffing.

AI's contribution to discovery is not uniform across the docket. It concentrates in the matters where the deadline is the binding constraint, and within those, in the early stages where the strategic value of fast factual orientation is highest. The difference between a team with mature protocols and a team without them shows up first in matters like these.

What Separates Legal-Grade AI From a General-Purpose AI Model

The selection criteria for legal discovery AI are different from general-purpose AI evaluation. Defensibility, citation grounding, and security determine fit more than benchmark scores. A model that performs well on public reasoning benchmarks may still be the wrong choice for a matter where every output has to point back to a verifiable source and every workflow has to survive a deposition.

Five criteria sort the field.

Domain-specific training

The first question is whether the model has been trained on legal documents and legal tasks, or whether it is a general-purpose model with a legal interface placed on top. The difference shows up in the work. A model trained on legal corpora recognizes the structural conventions of contracts, pleadings, and correspondence. It understands the difference between a hold notice and a litigation hold instruction. It produces output that reads like legal analysis because it has learned from legal analysis. Domain-specific platforms are designed around these requirements rather than retrofitted to meet them. Harvey, for instance, is purpose-built for legal work and used by more than 60% of the AmLaw 100.

Citation grounding

Every output the platform produces should point back to the source document the reviewing attorney can verify. This is the single most important defensibility feature in a generative AI tool used for legal work. A relevance determination without a citation is an assertion the reviewer has to take on faith. A relevance determination with a citation is a proposed conclusion the reviewer can validate in seconds. The same principle applies to privilege determinations, document summaries, and any other generative output that will appear in the work product file.

Validation tooling

The platform must produce the metrics courts expect to see in a defensibility brief. Recall, precision, and elusion rates measured against a control set, with documented sample sizes and confidence intervals. The validation tooling should be native to the platform, not an external workflow stitched together with spreadsheets. When opposing counsel challenges the methodology at a hearing, the team should be able to produce the validation record in its complete form without reconstructing it from logs.

Security architecture

Discovery work involves the most sensitive material a client possesses. The security requirements are non-negotiable. Matter-level data isolation, so that documents from one client cannot influence work product on another. No model training on client data without explicit authorization, and ideally not at all. SOC 2 Type II certification at minimum, with ISO 27001 and equivalent regional standards where applicable. Encryption in transit and at rest. The security posture should be documented in a form that satisfies a Fortune 500 information security review, because it will face one.

Workflow integration

The platform has to fit the tools the team already uses. Document management platforms like iManage and NetDocuments hold the work product. Microsoft 365 holds the communications and the drafting environment. Review platforms hold the matter-level corpus. AI that requires a separate workflow, with documents moved out of these tools and back again, creates friction that erodes adoption and introduces security exposure.

The Agentic Shift and What is Coming Next in Legal Discovery

Discovery AI is moving from single-task tools to platforms that execute multi-step workflows under attorney supervision. The shorthand for this shift is agentic AI. In a legal context, an agentic platform can plan a sequence of actions, execute them, validate the results, and adjust its approach based on what it finds, all within a defined scope and with human review at the decision points that matter. A predictive coding model ranks documents and a generative review model reads them; an agentic platform takes a more abstract instruction, such as "prepare an early case assessment for this matter," and executes the underlying steps without requiring an attorney to direct each one.

A working example illustrates what this looks like in practice. An associate at a litigation team receives a new complaint at 9 a.m. The matter involves a securities class action with 14 named defendants and a multi-year class period. The associate hands the complaint to an agentic platform configured for early case assessment. The platform identifies likely custodians, runs preliminary collection against the firm's connected data sources, applies de-duplication and threading, and produces a preliminary factual map with citations to the underlying documents. The associate reviews each step, modifies the custodian list, narrows the date range, and approves the next phase. By 2 p.m., the partner has a working briefing that historically would have taken a week to assemble.

This pattern is no longer hypothetical. Harvey Agents execute legal work across a Plan, Research, Work, Deliver, and Review sequence, with the attorney retaining final judgment at each decision point. The platform handles time-intensive workstreams across litigation, investigations, and document review, drawing on connected sources and producing formatted deliverables with granular citations. Major AmLaw firms working with Harvey, including Reed Smith and Vinson and Elkins, are among those building toward agentic workflow adoption.

The governance question that follows is the right one to ask. Agentic platforms require more rigorous audit trails, not fewer, because the chain of decisions is longer. Every action the agent takes should be logged with the context that produced it, the alternatives the agent considered, and the human decision that approved or modified the next step. The validation methodology must also account for the compounded risk of errors propagating across multi-step workflows.

Agentic platforms deliver value in repeatable workflows with well-defined inputs and outputs: early case assessment, privilege log generation, production quality control, deposition preparation. They do not replace the strategic judgments that define a matter, the kind of decisions that determine whether to file a motion to dismiss or which witness to depose first. AI's role in discovery is to remove the friction surrounding those judgments so the lawyer can spend more time on them. The teams thinking carefully about the protocols for this kind of work now are the ones who will define the operational practice of the next decade.

A Practical Starting Point

The teams that successfully adopt AI discovery share a pattern. They start with a single, well-scoped use case. They build the protocol around that use case until it is defensible and repeatable. Then they extend the same protocol to adjacent use cases, with each extension informed by what they learned in the first matter.

A first project should have a defined profile. A known matter type, such as a regulatory response, a contract dispute with a contained corpus, or an internal investigation with a clear scope. A bounded dataset, ideally under a million documents for the first deployment. A success metric the team can measure against, whether that is review hours per gigabyte, total cost per matter, time from collection to first factual map, or completeness of the validation record. A timeline that allows for iteration, which generally means a matter where the production deadline is firm but not punishing.

The team composition matters as much as the matter selection. A partner sponsor who can defend the protocol choice if it is challenged later. An eDiscovery lead who owns the operational execution. A data scientist or vendor counterpart who can answer the technical questions when they arise. An IT and security representative who has signed off on the data flows in advance. These four roles should be in the kickoff meeting, not added later when something goes wrong.

The most common adoption mistake is trying to deploy AI across all matters simultaneously. This pattern fails for predictable reasons. The protocols are not yet stable. The team has not built the institutional muscle to recognize the failure modes. The validation record is thin because no single matter has been carried through to a defensible conclusion. The right pattern is the opposite. One matter at a time, with each matter strengthening the protocol that the next matter inherits. AI in discovery is becoming infrastructure, not novelty. The protocols being written this year are the protocols the broader profession will be operating under in five years.

This is where Harvey fits in. Harvey is purpose-built for legal work, used by more than 142,000 legal professionals at 1,500+ organizations in 60+ countries, including more than 60% of the AmLaw 100. Every output is grounded in source documents the reviewing attorney can verify. The platform integrates with the document management tools litigation teams already use, supports matter-level data isolation, and produces the validation records that defensibility requires. The teams that will define the operational practice of the next decade are building their protocols on infrastructure designed for the work, not retrofitted to accommodate it. Request a demo below to see how Harvey can support your team's discovery workflow on your next matter.

Harvey Agents

A New Era of Collaboration for Legal and Professional Services

Harvey Academy

2025 Year in Review