I believe we’re 2-4 years too early to apply AI in the legal space, and the bottleneck is not necessarily benchmark-measured LLM performance, but the infrastructure surrounding it: retrieval techniques, ingestion of sources, LLM inference latency, eval infrastructure, AI agents for more complex retrieval, etc. All these pieces require serious build-out, and they’re not related to the core value proposition of the product you’re building. You can try solving all these generic problems in-house, but you’re likely to end up with very serious scope creep before you can test any semblance of Product-Market Fit. These core infrastructure pieces will get built, but I believe it will happen on the back of applying AI to simpler domains than legal, and then applying AI to legal will make sense.
In other words, we’re in the early internet era where a lot of ideas made sense, but often the timing was 2-5 years off. In tech business, both the idea and timing must align simultaneously.
I started my journey of applying AI to legal questions a year ago, building Hotseat. Over time, two of my cofounders, Hugo and Lukas joined. The initial idea was to target internal (in-house) lawyers’ questions and have the AI provide source-backed answers. It started with simple prompt engineering combined with a single-document (e.g., regulation) RAG which gave very promising results — this was when people talked about battling hallucinations, and RAG indeed does help to a large extent. However, closer inspection of the results revealed that LLMs make subtle logical errors when synthesizing source information, and no amount of prompt engineering looked like it would help. The problems we found were often in the tail end of topics or concepts people would ask about. The tail end is where LLMs struggle the most, but it’s also what professional lawyers care about. The obvious questions have answers in lawyers’ heads. We concluded that given the current state of LLM reasoning, end-to-end answers to legal questions from internal lawyers are a non-goal. Empirically, LLMs are not bad at all types of synthesis; quite the contrary. For many everyday tasks, without single-word subtleties, LLMs prove to be excellent.
While building an end-to-end Q&A, we stayed in touch with a few law firm lawyers. They pointed out that while the synthesis was faulty, the surfacing of quotes looked really promising, and this would be of tremendous help in their work already. They said that research is a major pain. So we pivoted to “semantic search” for lawyers. While I thought this to be too small of a market, the bet was we could deliver value rather quickly based on what we had built already, and once we had a real-world workflow nailed, we would take the technical and product lessons to expand our focus (with the help of venture capital). I also kept in mind that we might find an adjacent area in lawyers’ work that is more amenable to AI-based automation. You can think of this as a customer/product discovery phase.
We started building the semantic search focused on EU laws related to tech, fintech, and a few other areas. Of course, for a fully-fledged product, the sources would have to encompass a much larger corpus, but we wanted to avoid the classic mistake of a startup doing too much at the beginning, so we were comically narrow-focused, to test whether a prototype would lead to some hypothesis about the path to PMF.
We’ve adopted the slot machine metaphor for the product. AI tends to oscillate between being stellar and mediocre. One of the common ways of dealing with this problem is to give AI more than one shot at the result. We would give a list of sources/quotes and let the user skim over the results, and pick the relevant ones, the jackpots. The UI would be your typical search results list with the added magic of AI-generated content: a hypothesis of why a given document/quote was relevant to the search query.
In demos, this looked fantastic. Instead of perusing through long, dry legal docs, you get the AI to extract relevant quotes along with succinct descriptions for quick glancing.
However, once we started rolling this out with our design partners (law firms), we faced more and more issues with retrieval. The more we dug into them, the more it looked like we would either need to spend a huge amount of effort on fine-grained tweaking of retrieval or we would need to employ some form of agentic retrieval (LLM-based). The LLM-based retrieval could potentially help us execute lightweight, on-the-fly tasks and sidestep the difficult job of building indices upfront.
However, the more we thought about the problem, the more we realized that the difficulty of building the indexing infrastructure would be in many ways similar to what, e.g., Perplexity needs to build.
Business-wise, we realized we would be in a very uncomfortable corner: the bar for retrieval quality is really high to reach user utility, and the use case is niche.
We spent a fair amount of time exploring various jobs and workflows lawyers perform, and upon closer inspection, we concluded that the work consists of many disparate tasks, and most of them had a component in them that’s not amenable to AI yet: the bar for single-word precision, and subtle reasoning is simply too high. This is despite LLMs making strides in legal benchmark performance. Anyone seriously working on LLM-based applications is aware of the gap between lab benchmarks and real-world usage, and I don’t believe there’s a local (specific to the domain) way to close this gap quickly enough.
Sure, we could build flashy demos for Q&A of regulations, or generate common document boilerplates, or perform rudimentary proofreading to close deals from AI exploratory/innovation budgets of large law firms.
However, long-term, utility wins over hype. I don’t believe we have the right infrastructure tools at our disposal to escape the AI demo trap just yet. The retrieval is rudimentary and hard to iterate on, high-quality ingestion of documents is being built as I’m writing these words, and LLM inference cost and latency are too high for building a satisfying product experience in a challenging domain like legal.
The situation I describe reminds me of video-sharing sites people built in the early 2000s, and all of them flopped. The problem was that the internet bandwidth was too slow, and browsers had unergonomic support for video playing. By 2005, the surrounding infrastructure had advanced, and YouTube was born. Even if the earlier attempts had $50m of funding, that wouldn’t have changed anything. No amount of money would fix browsers on people’s computers.
I believe we’re in a similar spot with applying AI. In some areas like customer service or coding, all pieces are ready; in others like legal, it’s not, and large funding rounds won’t fix it. The tech development curve at large has to run its course.
Closing, I’d like to share a few tactical mistakes I made when building Hotseat:
- In building AI products, you need some sort of data loop - a feedback mechanism that lets you improve your product; in legal, this is hard to pull off due to stringent security concerns and the protection of client-privileged information.
- We couldn’t create the data loop in-house - none of the founders were lawyers, and we didn’t find one who would like to jump ship.
- We were not building the product for ourselves, and it was too easy to fail the Mom Test, which we did.
- We thought our competition in legal research were legacy legal information systems with truly outdated search and bad UI; however, because these systems are that bad, lawyers often resorted to Googling and we were competing with a free product.
With bitter taste, I’m shutting down Hotseat. With my experience of AI’s shortcomings, am I still bullish on AI and LLMs as the major technology shift? Absolutely. Is there a redeemable piece we found while building Hotseat? Sotto voce.
Thanks to Traple, LBK&P, and Wardyński law firms for betting on us at such an early stage. If you’re looking for highly competent lawyers with knack for tech regulations in CEE, I can’t recommend these firms enough.