Legal experts are struggling to assess the rapidly evolving situation of generative artificial intelligence tools for legal research, according to experts who spoke during a panel discussion at the U.S. Law Library Annual Conference in Portland, Oregon this week.
The panel, entitled “AI in Legal Research: Measuring Important Things in Benchmarks and Rubrics,” brings together three experts working on the forefront of AI assessment. Sean Harrington, director of technology innovation at the University of Oklahoma School of Law. He practiced innovation lawyers for Cindy Guyer, O’Melveny and Myers. Nick Hafen, Head of Legal Technical Education at BYU Law School.
The panel was hosted by Debbie Ginsburg, the faculty services manager at Harvard Law School’s library.
Benchmark Challenge
According to panelists, benchmarks (comparing software tools with standards for consistency) face unique obstacles in the legal field. Unlike hardware benchmarks, which allow you to measure battery life and processing speed with clear metrics, Legal AI tools pose complex challenges.
“Legal questions are not always easily categorized as correct answers,” Hafen explained. “We’re always debating what the correct answer is.”
The challenges go beyond methodology. Harrington said vendors typically do not provide backend access to the system. Modern legal AI platforms often use combinations of models rather than relying on a single, large language model, making direct comparisons difficult.
“In many cases, they use a combination of models,” Harrington said. “When these things first started, they called the GPT-4 API and that was the engine behind it. Now they’re all using a mix of models.”
Recent research and its limitations
The panel has considered several recent attempts to benchmark legal AI tools. Each has a clear approach and limitations.
Stanford University Research: published as “Hazai-Free? Assessing the Reliability of Key AI Legal Research Tools,” the study tested the GEN AI Tools from LexisNexis and Thomson Reuters. Although we found hallucination rates of 17-33%, we faced criticism of the methodology, particularly the use of Westlaw’s practices for tasks not designed to handle. In a redoing of the research in response to criticism, Lexis+ AI answered 65% of the queries correctly, but Westlaw’s AI-assisted study was accurate, and Westlaw hallucinates almost twice as much as 33% of Lexisnexis products in Westlaw compared to 17% of Lexis+ AI.
Academic Study on Legal Hallucinations: A study entitled “Large-scale Legal Fiction: Profiling Legal Hallucinations in a Large-scale Language Model” tested GPT-4, GPT-3.5, Palm 2, and Llama 2 in 1,000 randomly selected federal lawsuits. Llama 2 reported an overall increase in hallucination rate of 58% to 88%, but also criticised for its methodological issues.
Vendor Self Study: Both Harvey AI and Paxton AI have produced unique research with more favorable results. Harvey claimed a human-level output of 74% on legal tasks, while Paxton reported an accuracy of 94.7%. However, panelists noted that these studies were essentially “scorecards” created by the vendors themselves.
Vals Research: According to panelists, the toughest efforts ever came from Vals (validation and evaluation of legal solutions), which used real law firm workflows to investigate seven different legal tasks. This study collaborated with multiple law firms to obtain real-world tasks and included human baselines using contract lawyers.
“AI SmackDown” Experiment
Geier described his participation in the “AI Smackdown” carried out for the Law Library Association of Southern California (explained here). Three law firm librarians tested three popular AI research platforms: Lexis Protégé, Westlaw Precision and Vlex using the same prompts in federal and California law questions.
In this experiment, the tool was evaluated based on six factors. Accuracy, depth of analysis, citations of primary and secondary sources, readability of formats, and repetitive ability. Geier said the results revealed significant variation in performance and highlighted the issue of trust when the tool provided irrelevant results.
“When you’re looking at the answer, it needs to be really relevant or at least relevant adjacent,” Guyer said, adding that some partners in her company have abandoned the tool after receiving poor results.
Implementation challenges
The panel said adoption rates remain low even for businesses with expensive AI subscriptions. Harrington cited a colleague at AM Law 10’s company. He reported that only 10% of lawyers use AI tools.
“The rest of us feel we can’t rely on it,” Harrington explained, questioning whether businesses can continue to justify costs if there is low trust.
Geier noted that her company created a practical innovation claims code to encourage lawyers to participate in the assessment of AI tools, in the belief that lawyers need dedicated time and resources to properly test these systems.
Practical Recommendations
For organizations looking to conduct their own assessments, the panelists provided some recommendations.
Rather than attempting comprehensive testing, focus on specific use cases. It’s easy to start with a basic comparison and iterate over time. Benchmarks accept that they are snapshots in time, given the pace of rapid change. Test it with realistic prompts (such as those that may be used by first-year associations) rather than “gold standard” questions that do not reflect actual usage. Despite time constraints, we will engage real users in the evaluation process.
I’m looking forward to it
The panelists highlighted the need for greater vendor transparency and potentially industry standard benchmarks, similar to the SOC 2 report in cybersecurity. However, they acknowledged that such changes would likely require pressure from large law firms rather than individual institutions.
“It’s going to be the world’s Kirkland Ellis,” Harrington predicted, referring to large companies that could have sufficient leverage to demand better transparency and reliability standards.
Although current benchmark efforts are limited, the panel agreed, representing an important first step in bringing rigour to AI tool assessments. As technology continues to evolve rapidly, the panel concluded that legal professions need to develop better frameworks for assessment and manage the inherent challenges of assessing “black box” systems that vendors don’t want to fully disclose.

