All Articles

You don't want to work with embeddings

Simon Willison wrote an excellent primer on embeddings. I’m here to tell you why I think we should get off interacting with embeddings as quickly as possible.

Programmer-facing embeddings are a weird hack where we took an internal state of a neuro net and decided to expose it to the outside world. The core feature of embeddings is that they capture the semantic similarity of pieces of information they represent. The embedded content can be a picture, sentence, whole document, or a fragment of an audio recording. You can create embeddings for, e.g. the text of two pages from PDFs, and embeddings let you check whether they’re semantically similar - e.g. cover similar topics.

As Simon mentioned, embeddings become really popular with the rise of LLMs and RAG: Retrieval Augmented Generation. The idea behind RAG is that instead of relying on LLM’s frozen, internal, and often fuzzy memory, we present text excerpts and ask LLM to analyze them before providing an answer. RAG is excellent at battling hallucinations and boosting LLM’s performance, especially when there’s an objective answer available. RAG is also how you connect the amazing LLM’s capabilities to your private or company data. For the retrieval part, embeddings-based search is used most frequently.

So why do I hope they go away? Embeddings are a very low-level construct. They’re long vectors of directly uninterpretable numbers. Embeddings are programming in assembly of the AI, with similar pain-points. When your data pipelines break, and you get non-sensical results for your retrieval, you have no quick way of singling out embeddings of broken pieces of data. You can’t look at a single embedding vector and make sense of it like you can for a row in a database. The developer experience of working with embeddings is deeply unsatisfactory. I’m a deep skeptic of the long-term potential of vector databases for that reason; they’re betting on the wrong abstraction.

Even worse are slight shifts in embeddings caused again by data issues with inputs or small changes to the model that calculates them. Embeddings are essentially non-debuggable from the point of view of a regular software engineer (or AI engineer) who does not have a PhD in applied mathematics.

What, then, is the better alternative? We should work with purely textual representations, particularly when LLMs are involved. Embeddings are a method of lossy information compression; we should continue with representations through compression but in a directly interpretable form. For example, a long text chunk should be represented by its short textual description rather than by an abstract vector of numbers. This approach allows a database to perform LLM-based semantic matching, leverage LLM’s reasoning capability to work out the connection between the query and matching text chunks, and provide an explanation for the match or its lack.

Storing purely textual representations in the database would catch several fish with one net:

  • textual representations can be inspected as single values, similar to the rest of your database record
  • regressions in representation are much easier to debug
  • we would likely get rid of reranking models often used as a stop-gap for the hit-and-miss performance of embedding-based retrieval in RAG applications
  • the LLM performing semantic search can be asked to explain its decision1; none of it exists in the low-level world of embeddings

In sum, swapping embeddings for textual descriptions would offer much better debugability and legibility of a RAG system. Switching to text would follow the long tradition of textual format winning over important domains: HTTP, HTML, JSON, SQL, etc for their ease of use. Christopher Potts at Stanford argues similarly that the future belongs to specialized LLM-based agents communicating via text.

My friend Igor Canadi quipped that I’m advocating full table scans - a database person’s worst nightmare. With embeddings, you can perform a very fast approximate nearest neighbor search, and with textual descriptions we would lose that ability. Yes, but you can perform inference in batches to speed up the scan and utilize KV caching for partial caching of the underlying LLM inference. Heck, you can imagine that a database would internally use embeddings for discarding the least promising matches, dynamically tuned based on observed usage patterns. There’s a rich and rapidly expanding solution space, largely thanks to the rise of small yet capable models and easy, data-efficient fine-tuning. It’s worth noting that if you’re querying the database to extract text chunks and feed them into a big LLM, a slow down by 100ms doesn’t matter much. You’re bottlenecked by the big LLM’s response time anyway.

I argue for the most natural programming interface for knowledge representation and querying is textual. Embeddings would get relegated to their natural place: internals. I’d love for database people seriously consider native LLM-based indices and querying primitives that would render vector databases a historical blip.

Footnotes

  1. empirically, this works surprisingly well regardless of whether the LLM is really walking you through its “thought” process or it dresses up its internal state as a “thought” process. The important bit is that you, a software engineer, get an actionable feedback on how to make progress on improving performance of the semantic search.

Published

Deep Learning ∩ Applications. A recent pivot from a 'promising career' in systems programming (core team behind the Scala programming language). Pastime: Ambient Computing. Grzegorz Kossakowski on Twitter