LLMs in Law: Harvey at BigLaw, Spellbook for Solos, and What the Benchmarks Show

AI Innovation Published May 02, 2026 · legal tech · contract ai · biglaw · harvey ai · spellbook

In February 2023, Allen & Overy — one of London's five Magic Circle law firms — deployed Harvey, a GPT-4-based legal assistant, across all 43 of its global offices and made it available to roughly 3,500 lawyers. The announcement was treated as a watershed moment: a top-ten global law firm betting its institutional credibility on a generative AI system that had existed for less than a year. By May 2024, A&O had merged with Shearman & Sterling to form A&O Shearman, with Harvey embedded in the combined firm's workflow.

The reception has been decidedly uneven. Benchmarks show genuine capability — GPT-4 passed the Uniform Bar Exam at approximately the 90th percentile in OpenAI's March 2023 technical report — but courts have sanctioned lawyers for hallucinated citations, researchers flag material accuracy gaps on specialized contract clauses, and bar associations are still wrestling with what competent AI use actually means. Here is what the evidence says.

Harvey's Rise: From Seed Round to BigLaw's Default Stack

Harvey AI was co-founded in 2022 by Winston Weinberg, a former Goldman Sachs associate, and Gabriel Pereyra, a former DeepMind researcher who had worked on language model alignment. The company's core argument was that general-purpose LLMs needed domain-specific fine-tuning and a purpose-built retrieval layer to be trustworthy in legal practice — not a ChatGPT wrapper with a law-firm-flavored system prompt.

OpenAI backed the premise early. It participated in Harvey's $21 million Series A in April 2023 and granted Harvey access to fine-tune GPT-4-class models on curated legal corpora: jurisdiction-specific case law, regulatory filings, and transactional precedents that the base model had seen only partially. By December 2023, Harvey had closed an $80 million Series B at a valuation reported by multiple financial outlets to be approximately $700 million, with Google joining as a strategic investor alongside Sequoia Capital. The pace signaled that institutional law firms were willing to pay structurally, not just pilot.

Allen & Overy's February 2023 deployment was the marquee reference. The firm reported using Harvey for contract drafting assistance, due diligence summarization, regulatory mapping across jurisdictions, and first-pass document review. Nick Seddon, then A&O's chief information officer, publicly called it "the beginning of a profound change in the way legal services are delivered." PwC Legal followed with a firm-wide Harvey rollout announced in late 2023 — notable because PwC Legal is a regulated legal services provider that competes directly with BigLaw on transactional work, not a captive internal counsel team.

What Harvey Actually Runs On

Conjecture, marked clearly: Harvey has not disclosed the specific model versions powering its production system. Based on the company's confirmed OpenAI partnership (OpenAI participated in the Series A and granted fine-tuning access), engineering job postings emphasizing retrieval-augmented generation infrastructure, and public statements from the founding team referencing GPT-4 fine-tuning on legal corpora, industry analysts broadly assume Harvey's core engine is a fine-tuned GPT-4 or GPT-4o derivative layered over a jurisdiction-specific retrieval corpus. The RAG layer is likely the primary commercial differentiator — base model hallucination rates on specific case law drop materially when the system can retrieve and cite real documents in context. Harvey has not confirmed the architecture as of early 2026.

The Solo Attorney's Answer: Spellbook

For attorneys outside the Am Law 200, Harvey's enterprise pricing is effectively inaccessible. Spellbook, founded by Scott Stevenson in Canada, has positioned itself as the practical alternative. The product integrates directly into Microsoft Word as a sidebar add-in — the document environment where most solo and small-firm attorneys already work — and offers contract drafting suggestions, clause redlining, and risk-flag identification without requiring workflow changes. Spellbook raised a $10.9 million Series A in 2023, a fraction of Harvey's war chest but significant for a product targeting a segment of the legal market that enterprise legal tech has historically underserved.

Where Harvey pitches institutional transformation at scale, Spellbook's marketing focuses on reviewing contracts in minutes rather than hours. Both value propositions are genuine; both carry the same caveat: output requires attorney verification before it touches a signed document or a filed pleading.

What the Contract-Review Benchmarks Actually Show

The most-cited early data point is a 2018 LawGeex study that evaluated an AI classifier — pre-transformer, not LLM-based — against 20 experienced lawyers on a set of non-disclosure agreements. The AI achieved 94 percent accuracy on identifying 30 pre-specified legal issues, versus an 85 percent human average, and completed the review in 26 seconds against a human average of 92 minutes. The result was widely and somewhat misleadingly reported as proof that AI outperforms lawyers. The study's scope was narrow: a single, highly standardized document type with pre-labeled, well-defined issue categories.

The more demanding test is the CUAD (Contract Understanding Atticus Dataset), released in 2021 by the Atticus Project. CUAD contains 510 commercial contracts with 13,101 manually labeled clause annotations across 41 legal categories: termination-for-convenience rights, governing law selections, indemnification caps, limitation-of-liability provisions, anti-assignment clauses, and most-favored-nation terms, among others. Models must extract the correct clause text or respond "not present" for each category. Early GPT-3-class models achieved F1 scores in the mid-40s on the hardest CUAD categories. Fine-tuned models specifically trained on CUAD reached the mid-80s on common clause types but fell to F1 scores in the 30–50 range on rarer, high-stakes provisions — exactly the clauses that carry outsized risk in litigation and negotiation.

GPT-4, evaluated informally by multiple research groups through 2023, showed meaningful improvement on CUAD's simpler categories. Stanford CodeX-affiliated researchers noted in 2023 workshop proceedings that the primary failure mode was not hallucinated facts but misclassification: the model would identify a clause as present but assign it to the wrong category, or extract the wrong passage as the operative clause. For a solo attorney using Spellbook or a BigLaw associate using Harvey, the operational implication is identical — output requires verification, not blind acceptance.

The Bar Exam Is Not the Hard Test

OpenAI's GPT-4 technical report (March 2023) reported a score of approximately 298 out of 400 on the Uniform Bar Exam, placing the model at roughly the 90th percentile among human test-takers. That result is genuine and impressive. It is also somewhat misleading as a capability proxy: the bar exam tests rule recall, issue spotting, and structured argument construction — tasks heavily represented in GPT-4's training distribution and well-suited to autoregressive text generation. It does not test precise clause-level extraction, multi-paragraph cross-reference reasoning, or the kind of risk judgment that fills a transactional attorney's actual workday.

The Malpractice Flashpoint: Mata v. Avianca

No single event did more to crystallize the legal AI liability debate than the sanctions order in Mata v. Avianca, Inc., 22-cv-1461 (S.D.N.Y.). In June 2023, Judge Kevin Castel sanctioned Steven A. Schwartz and Peter LoDuca of Levidow, Levidow & Oberman after the attorneys submitted a brief citing cases that did not exist — cases entirely fabricated by ChatGPT. The attorneys had asked ChatGPT to identify supporting precedents, received plausible-looking citations complete with case names, volume numbers, and reporters, and filed the brief without verifying the citations against Westlaw, LexisNexis, or any other legal database.

Judge Castel's 21-page opinion did not rule AI use in legal practice impermissible; it ruled that the attorneys had violated their professional obligations by failing to verify their work product. The sanctions totaled $5,000 — modest in dollar terms but substantial professionally. The decision became mandatory reading at continuing legal education programs across the country within months of the ruling. It was cited in subsequent bar ethics proceedings as the paradigm case for applying existing competence standards to generative AI, and it made clear that novelty of the tool provides no safe harbor from Rules 3.3 and 8.4.

The Bar Responds: ABA Formal Opinion 512

The American Bar Association addressed the competence question formally in ABA Formal Opinion 512, issued in July 2024. The opinion confirmed that Model Rule 1.1 (competence) extends to understanding the limitations of AI tools a lawyer employs; that Rules 5.1 and 5.3 (supervision of lawyers and nonlawyer assistance) apply to AI-generated work product the same way they apply to output from associates and paralegals; and that disclosure of AI use may be required by applicable court rules or client agreements. The opinion did not prohibit AI use. It imposed a verification standard that most existing legal AI marketing — which typically emphasizes speed and implicitly discourages post-generation review — undercuts.

The Liability Gap

The structural tension in legal AI is that the incentive structure and the liability structure point in opposite directions. Law firms deploy AI to reduce associate hours billed — or to maintain margins while offering modest billing relief to clients who demand it. But every error that reaches a filed document or a signed agreement is a potential malpractice claim. Harvey's enterprise agreements reportedly include indemnification language capping Harvey's liability at subscription fees paid — a standard SaaS limitation that places malpractice exposure squarely with the firm. Solo practitioners subscribing to Spellbook face the same exposure with no institutional backstop and typically lower malpractice coverage limits.

Estimate, marked clearly: Harvey's December 2023 Series B was reported at approximately $700 million valuation. At typical high-growth enterprise SaaS revenue multiples of 20–30× ARR, implied ARR at that point was in the $23–35 million range. Harvey has not disclosed revenue figures. This estimate is a rough order-of-magnitude inference from public valuation data, not a reported figure, and should be treated accordingly.

Where This Is Headed

Competitive pressure from incumbents — Thomson Reuters (which acquired Casetext for $650 million in August 2023 and launched Westlaw AI on the back of CoCounsel's technology), LexisNexis (Lexis AI), and contract-intelligence platforms like Luminance and Kira — has compressed the timeline for feature parity across the market. What Harvey's early BigLaw deployments demonstrated is that institutional clients will tolerate a meaningful error rate if the productivity gain is large enough. What Mata v. Avianca demonstrated is that Article III courts will not extend the same tolerance.

The practical resolution is already taking shape in bar guidance and federal court standing orders: AI output touching a filed document must be reviewed and attested to by a licensed attorney. That requirement does not eliminate the value proposition of Harvey or Spellbook. It redirects it. The tools that survive the next wave of regulatory tightening are the ones that make attorney verification fast and auditable — not the ones that market verification as a step the technology has made unnecessary.

Frequently asked

What is Harvey AI and which major law firms have deployed it?

Harvey AI is a legal AI platform co-founded in 2022 by Winston Weinberg and Gabriel Pereyra, built on fine-tuned GPT-4-class models with a legal-corpus retrieval layer. Allen & Overy (now A&O Shearman) deployed it firm-wide in February 2023, and PwC Legal announced a global rollout in late 2023. Harvey raised a $21 million Series A in April 2023 with OpenAI participating and an $80 million Series B in December 2023 at a reported ~$700 million valuation.

What happened in Mata v. Avianca, and why does it matter for legal AI?

In 2023, attorneys at Levidow, Levidow & Oberman submitted a court brief citing cases that ChatGPT had entirely fabricated. Judge Kevin Castel of the Southern District of New York sanctioned them $5,000 in June 2023 for filing unverified AI-generated citations. The ruling established that existing competence and candor rules fully apply to AI-generated work product, and it has been cited by bar associations nationwide as the benchmark case for generative AI malpractice risk.

How accurate are current AI tools at contract review?

Accuracy varies sharply by task complexity. A 2018 LawGeex study found AI achieved 94% accuracy versus 85% for experienced lawyers on standardized NDAs. On the CUAD benchmark — 510 commercial contracts with 41 clause categories — GPT-4-class models score well on common clauses but fall to F1 scores of 30–50 on rarer, high-stakes provisions like anti-assignment and most-favored-nation terms. The realistic picture is strong on routine work and insufficiently reliable on complex clauses for unreviewed use.

What is Spellbook and how does it differ from Harvey?

Spellbook is a contract AI add-in for Microsoft Word, founded in Canada by Scott Stevenson and targeting solo practitioners and small firms priced out of enterprise platforms. It raised a $10.9 million Series A in 2023. Harvey targets large law firms with firm-wide enterprise contracts; Spellbook uses a per-seat subscription accessible to individual attorneys. Both tools use LLM-based clause drafting and risk flagging; the primary differences are pricing, integration model, and retrieval-layer sophistication.

What does ABA Formal Opinion 512 require of lawyers using AI?

Issued in July 2024, ABA Formal Opinion 512 confirmed that Model Rule 1.1 (competence) requires lawyers to understand AI tool limitations, and that supervision rules (5.1, 5.3) apply to AI outputs the same way they apply to associate or paralegal work. The opinion requires disclosure of AI use where court rules or client agreements mandate it. It does not ban AI use but establishes a verification standard: unreviewed AI output is not an acceptable basis for court filings or client advice.

Who is liable when an AI legal tool produces an error — the vendor or the lawyer?

The attorney or law firm, not the AI vendor, holds professional responsibility for work product accuracy under existing rules of professional conduct. Harvey's enterprise agreements reportedly cap Harvey's own liability at subscription fees paid, which is standard SaaS contract language. The attorney's malpractice insurer and the relevant bar's disciplinary authority are the accountability mechanisms, and Mata v. Avianca confirmed that courts will not treat AI-generated errors differently from any other form of professional negligence.

Sources & further reading

Last reviewed May 02, 2026. AI Pulled News is editorial; corrections welcome at /news/about.html.