LLMs in Law: Harvey at BigLaw, Spellbook for Solos, and What the Benchmarks Show
In February 2023, Allen & Overy — one of London's five Magic Circle law firms — deployed Harvey, a GPT-4-based legal assistant, across all 43 of its global offices and made it available to roughly 3,500 lawyers. The announcement was treated as a watershed moment: a top-ten global law firm betting its institutional credibility on a generative AI system that had existed for less than a year. By May 2024, A&O had merged with Shearman & Sterling to form A&O Shearman, with Harvey embedded in the combined firm's workflow.
The reception has been decidedly uneven. Benchmarks show genuine capability — GPT-4 passed the Uniform Bar Exam at approximately the 90th percentile in OpenAI's March 2023 technical report — but courts have sanctioned lawyers for hallucinated citations, researchers flag material accuracy gaps on specialized contract clauses, and bar associations are still wrestling with what competent AI use actually means. Here is what the evidence says.
Harvey's Rise: From Seed Round to BigLaw's Default Stack
Harvey AI was co-founded in 2022 by Winston Weinberg, a former Goldman Sachs associate, and Gabriel Pereyra, a former DeepMind researcher who had worked on language model alignment. The company's core argument was that general-purpose LLMs needed domain-specific fine-tuning and a purpose-built retrieval layer to be trustworthy in legal practice — not a ChatGPT wrapper with a law-firm-flavored system prompt.
OpenAI backed the premise early. It participated in Harvey's $21 million Series A in April 2023 and granted Harvey access to fine-tune GPT-4-class models on curated legal corpora: jurisdiction-specific case law, regulatory filings, and transactional precedents that the base model had seen only partially. By December 2023, Harvey had closed an $80 million Series B at a valuation reported by multiple financial outlets to be approximately $700 million, with Google joining as a strategic investor alongside Sequoia Capital. The pace signaled that institutional law firms were willing to pay structurally, not just pilot.
Allen & Overy's February 2023 deployment was the marquee reference. The firm reported using Harvey for contract drafting assistance, due diligence summarization, regulatory mapping across jurisdictions, and first-pass document review. Nick Seddon, then A&O's chief information officer, publicly called it "the beginning of a profound change in the way legal services are delivered." PwC Legal followed with a firm-wide Harvey rollout announced in late 2023 — notable because PwC Legal is a regulated legal services provider that competes directly with BigLaw on transactional work, not a captive internal counsel team.
What Harvey Actually Runs On
The Solo Attorney's Answer: Spellbook
For attorneys outside the Am Law 200, Harvey's enterprise pricing is effectively inaccessible. Spellbook, founded by Scott Stevenson in Canada, has positioned itself as the practical alternative. The product integrates directly into Microsoft Word as a sidebar add-in — the document environment where most solo and small-firm attorneys already work — and offers contract drafting suggestions, clause redlining, and risk-flag identification without requiring workflow changes. Spellbook raised a $10.9 million Series A in 2023, a fraction of Harvey's war chest but significant for a product targeting a segment of the legal market that enterprise legal tech has historically underserved.
Where Harvey pitches institutional transformation at scale, Spellbook's marketing focuses on reviewing contracts in minutes rather than hours. Both value propositions are genuine; both carry the same caveat: output requires attorney verification before it touches a signed document or a filed pleading.
What the Contract-Review Benchmarks Actually Show
The most-cited early data point is a 2018 LawGeex study that evaluated an AI classifier — pre-transformer, not LLM-based — against 20 experienced lawyers on a set of non-disclosure agreements. The AI achieved 94 percent accuracy on identifying 30 pre-specified legal issues, versus an 85 percent human average, and completed the review in 26 seconds against a human average of 92 minutes. The result was widely and somewhat misleadingly reported as proof that AI outperforms lawyers. The study's scope was narrow: a single, highly standardized document type with pre-labeled, well-defined issue categories.
The more demanding test is the CUAD (Contract Understanding Atticus Dataset), released in 2021 by the Atticus Project. CUAD contains 510 commercial contracts with 13,101 manually labeled clause annotations across 41 legal categories: termination-for-convenience rights, governing law selections, indemnification caps, limitation-of-liability provisions, anti-assignment clauses, and most-favored-nation terms, among others. Models must extract the correct clause text or respond "not present" for each category. Early GPT-3-class models achieved F1 scores in the mid-40s on the hardest CUAD categories. Fine-tuned models specifically trained on CUAD reached the mid-80s on common clause types but fell to F1 scores in the 30–50 range on rarer, high-stakes provisions — exactly the clauses that carry outsized risk in litigation and negotiation.
GPT-4, evaluated informally by multiple research groups through 2023, showed meaningful improvement on CUAD's simpler categories. Stanford CodeX-affiliated researchers noted in 2023 workshop proceedings that the primary failure mode was not hallucinated facts but misclassification: the model would identify a clause as present but assign it to the wrong category, or extract the wrong passage as the operative clause. For a solo attorney using Spellbook or a BigLaw associate using Harvey, the operational implication is identical — output requires verification, not blind acceptance.
The Bar Exam Is Not the Hard Test
OpenAI's GPT-4 technical report (March 2023) reported a score of approximately 298 out of 400 on the Uniform Bar Exam, placing the model at roughly the 90th percentile among human test-takers. That result is genuine and impressive. It is also somewhat misleading as a capability proxy: the bar exam tests rule recall, issue spotting, and structured argument construction — tasks heavily represented in GPT-4's training distribution and well-suited to autoregressive text generation. It does not test precise clause-level extraction, multi-paragraph cross-reference reasoning, or the kind of risk judgment that fills a transactional attorney's actual workday.
The Malpractice Flashpoint: Mata v. Avianca
No single event did more to crystallize the legal AI liability debate than the sanctions order in Mata v. Avianca, Inc., 22-cv-1461 (S.D.N.Y.). In June 2023, Judge Kevin Castel sanctioned Steven A. Schwartz and Peter LoDuca of Levidow, Levidow & Oberman after the attorneys submitted a brief citing cases that did not exist — cases entirely fabricated by ChatGPT. The attorneys had asked ChatGPT to identify supporting precedents, received plausible-looking citations complete with case names, volume numbers, and reporters, and filed the brief without verifying the citations against Westlaw, LexisNexis, or any other legal database.
Judge Castel's 21-page opinion did not rule AI use in legal practice impermissible; it ruled that the attorneys had violated their professional obligations by failing to verify their work product. The sanctions totaled $5,000 — modest in dollar terms but substantial professionally. The decision became mandatory reading at continuing legal education programs across the country within months of the ruling. It was cited in subsequent bar ethics proceedings as the paradigm case for applying existing competence standards to generative AI, and it made clear that novelty of the tool provides no safe harbor from Rules 3.3 and 8.4.
The Bar Responds: ABA Formal Opinion 512
The American Bar Association addressed the competence question formally in ABA Formal Opinion 512, issued in July 2024. The opinion confirmed that Model Rule 1.1 (competence) extends to understanding the limitations of AI tools a lawyer employs; that Rules 5.1 and 5.3 (supervision of lawyers and nonlawyer assistance) apply to AI-generated work product the same way they apply to output from associates and paralegals; and that disclosure of AI use may be required by applicable court rules or client agreements. The opinion did not prohibit AI use. It imposed a verification standard that most existing legal AI marketing — which typically emphasizes speed and implicitly discourages post-generation review — undercuts.
The Liability Gap
The structural tension in legal AI is that the incentive structure and the liability structure point in opposite directions. Law firms deploy AI to reduce associate hours billed — or to maintain margins while offering modest billing relief to clients who demand it. But every error that reaches a filed document or a signed agreement is a potential malpractice claim. Harvey's enterprise agreements reportedly include indemnification language capping Harvey's liability at subscription fees paid — a standard SaaS limitation that places malpractice exposure squarely with the firm. Solo practitioners subscribing to Spellbook face the same exposure with no institutional backstop and typically lower malpractice coverage limits.
Where This Is Headed
Competitive pressure from incumbents — Thomson Reuters (which acquired Casetext for $650 million in August 2023 and launched Westlaw AI on the back of CoCounsel's technology), LexisNexis (Lexis AI), and contract-intelligence platforms like Luminance and Kira — has compressed the timeline for feature parity across the market. What Harvey's early BigLaw deployments demonstrated is that institutional clients will tolerate a meaningful error rate if the productivity gain is large enough. What Mata v. Avianca demonstrated is that Article III courts will not extend the same tolerance.
The practical resolution is already taking shape in bar guidance and federal court standing orders: AI output touching a filed document must be reviewed and attested to by a licensed attorney. That requirement does not eliminate the value proposition of Harvey or Spellbook. It redirects it. The tools that survive the next wave of regulatory tightening are the ones that make attorney verification fast and auditable — not the ones that market verification as a step the technology has made unnecessary.
Frequently asked
What is Harvey AI and which major law firms have deployed it?
What happened in Mata v. Avianca, and why does it matter for legal AI?
How accurate are current AI tools at contract review?
What is Spellbook and how does it differ from Harvey?
What does ABA Formal Opinion 512 require of lawyers using AI?
Who is liable when an AI legal tool produces an error — the vendor or the lawyer?
Sources & further reading
- OpenAI GPT-4 Technical Report (March 2023)
- CUAD: An Expert-Annotated NLP Dataset for Legal Contracts (Atticus Project, ArXiv 2021)
- Allen & Overy to deploy Harvey AI firm-wide — Reuters, February 2023
- Lawyers sanctioned for filing ChatGPT-hallucinated citations, Mata v. Avianca — Reuters, June 2023
- ABA Formal Opinion 512: Generative Artificial Intelligence Tools (July 2024)
- Stanford CodeX — Center for Legal Informatics
- AI vs. Lawyer: The Battle of the NDA — LawGeex (2018)
Last reviewed May 02, 2026. AI Pulled News is editorial; corrections welcome at /news/about.html.