The Failure of RAG
RAG, or Retrieval-Augmented Generation, has been one of the big ideas since the launch of ChatGPT. It’s easy to get up and running as a demo but often falls short without a deep understanding of how to make it work.
Search problems
First, RAG relies on accurate and up-to-date information retrieval. If the data sources are flawed or outdated, you end up with garbage in, garbage out. Data goes out of date. It can also contain opinions and contradictions. When the information that is passed to the LLM is wrong, the end result is likely to be disappointing at best. An LLM doesn’t understand what is right or wrong. This is a search problem, and there are lots of ways to overcome this problem that the simplistic RAG demos don’t deal with.
Second, RAG assumes that the retrieval process will be seamless and effective. But we know that’s not always the case. Sometimes, the relevant information is buried under layers of data, making it hard to access. It’s like searching for a needle in a haystack, but you also need to make sure you are know you have the right haystack. You spend more time digging through irrelevant content than finding what you need. This is a search problem.
Context is king
Third, there’s the problem of context. Context is king. And this is related to the relevance of retrieved documents from search which can be hit or miss. The model sometimes grabs info that seems right on the surface but may not answer the user’s needs. RAG doesn’t always grasp the full context of the query. This can lead to outputs that miss the mark entirely. Just because you pull in some data doesn’t mean it fits well with the generated content or the intended question. The output can feel disjointed, leading to confusion, or it can just be plain wrong.
We’ve seen many instances where LLMs combine answers from different search results to develop a plausible but entirely wrong answer. One example was when we worked with a client to answer policy questions. There was one policy for children and another for cancer. When asked to create a “child cancer care” policy, the LLM created a horrifyingly credible description of a child cancer policy that didn’t exist. This example is totally unacceptable for a business and was caught during the development process. It was never deployed into a production environment. Hopefully, it illustrates how easy it is for LLMs to add two and two and get five. It’s best not to combine multiple context sources unless you provide a lot of context that the LLM can use to understand the nuances of what might be happening.
Corroboration
Lastly, the human touch is difficult to replicate. While RAG models can provide facts and figures, they often miss the nuance of human interaction. This is getting better and will continue to do so.
But, whether they sound human or not, if they are wrong, there will be problems that need attention.
As we build more robust, production-ready LLM systems, we’ve developed ways to handle many of these issues. Our SAGE approach ensures we embed evaluation into our process. This is designed to generate corroboration from multiple data sources to build confidence in answers like humans. The more weight we can generate for an answer being right, the more confident we can be. It’s a trade-off between speed and accuracy. Sometimes speed is more important. But, for almost any chat implementation for an enterprise, high accuracy is required.
A robust and repeatable way for delivery accuracy
These issues show that while RAG has potential, it doesn’t always deliver how we’d like it to without being clear about how you want answers to be dealt with. We know we must build many test and edge cases for things that can cause issues. The complexity of human language and the nuances of communication are tough challenges to overcome. Staying aware of these limitations when working with RAG and language models is essential.
Chat, and other open-ended generative tasks are some of the hardest things LLMs are required to do. In our experience, RAG can deliver 60-80% accuracy straight out of the box. For simpler tasks, you can get 95+% levels of accuracy quickly. But, automated chat will always require greater levels of accuracy for enterprise interactions. You can’t deploy systems that may or may not give the right answers to the public or unsuspecting users.
So, what happens when RAG is not getting the results you need? Our SAGE model leads to highly accurate answers and will cater to most knowledge agent systems. However, there are use cases where the required accuracy level, coupled with the nuance expected in incoming questions, drives us towards RAG and/or fine-tuned models. More about that in another blog post soon!