Am I talking about my own personal rags-to-riches story as a result of the cash influx this blog generates for me?
Sadly not.
I am in fact talking about the phenomenal technology that is Retrieval Augmented Generation (RAG), and how it may be the game changer the healthcare LLM scene needs.
If you are floating in the circles of Large Language Model's (LLM) and are interested in what they do or how they work, you have likely come across the concern about hallucinations.
That is, the fact they occasionally make shit up.
RAG is a way of mitigating this risk.
This hallucination problem struck me quite powerfully recently whilst I was looking for my next PC hardware upgrade.
I'm a man who loves technology, and as much as I'd like to say that I'm running a top-end PC with a crazy graphics card for my locally run AI adventures - the truth is far from that.
To be honest I just wanted a PC that can run Windows 11 before Microsoft cuts me off from my security upgrades... What?! You Sam!... self-proclaimed tech programming AI guru - yes, I have a PC too old for a free upgrade from 10 to 11.
(And no, I'm not spending my hard-earned cash on skiing trips, drug problems or anything as exciting as all that. My lack of excess cash is predominately cause and effect from the responsibility of small child-rearing.)
Those who have kids will understand, those without may not. (but as one of my friends who sent me this meme so eloquently put it - this is how she deals with the pain of not having children...)
However, I digress.
The concerning hallucination was when I asked ChatGPT to give me a summary of 2 processors and their compatibility with respective motherboards. It gave me this extremely detailed and fantastic explanation of how one motherboard was slightly superior to the other with the DDR5 ram speeds I was looking for, the overclocking settings and... wait a minute - it just totally made up a load of stuff about compatibility and pricing.
In fact, it sort of conflated the two products into one mishmashed mind-boggle that was hard to pick apart.
But it was so convincing! It's only because I had been salivating over these particular boards that I knew I couldn't afford that I picked up the problem. And it was a problem, because had I thrown this fictional money pile I sit on at a CPU / motherboard / memory stack that it suggested it would have been totally incompatible.
Drama.
So aside from my dreams of PC upgrades, what does this mean for business?
Especially Healthcare business... let's dig in.
AI in Healthcare, LLM and RAG - why should I care?
Healthcare is a tricky beast to bring the AI game to, trust me, I know.
Having spent weeks staring at information governance standard operating procedures and carefully crafting a data request to tread the micron-thin line of data protection, ethics, Caldicott approval, and actual usefulness in practice...
(If it gets approved let's hope I've learnt enough to do something actually useful - If only I'd spent that much time honing my AI game as I have about learning to complete paperwork... Yes NHS, I'm looking at you. )
The second you throw the hurdle of the hallucination error made above into a medical device practice, you've got some real problems. Well, actually you've got no problem. Because no one is going to use your kit. Clinical applications need to be accurate.
How Accurate is Accurate?
The level of accuracy is an area that probably needs to be thought about - everyone screams for 100% accuracy. But not much in medicine is 100% accurate, and no test is perfect.
Even the gold standard tests we do have to find a balance - Check out this great resource on understanding sensitivity, specificity and ROC curves below.
However, there are some things we can't get wrong in medicine, like a diagnosis or advice about treatment side effects.
For example - let's say I'm using an LLM to pick out diagnosis from patient notes. It picks up a label of osteosarcoma when actually it was a soft tissue sarcoma.
They sound similar enough right? But they are in fact totally different, as is the respective treatment.
On an individual level, if this error slipped through the net it would be a tragedy...
AI Cancer Chatbot
So let's make this a relatable context to building something AI flavoured, I'm building a cancer side-effect chatbot.
My patient asks if this temperature of 38 degrees 5 days after chemotherapy is normal, and the bot suggests rest, hydration and over-the-counter medication as it will probably improve and to monitor for warning signs.
... and could be reasonably expected to have been scraped and trained into the model from reputable sourcs such as MSD manual. Which of course is fine, and all good and well, and the correct advice - if you hadn't had chemotherapy...
So can we say LLM's aren't safe for this application?
Can we really say the biggest technological advancement made in AI in history is useless in the healthcare setting due to the potential for hallucination? Not likely.
We just need to mitigate the risks to an acceptable error level.
Retrieval Augmented Generation
As you may have guessed from the name, RAG is an approach that combines two key components: a retrieval system and a generation model.
RAG is designed to generate responses that are grounded in relevant external information (i.e. you feed it your own documents) and it uses these resources with the generation model to answer your questions with contextually relevant responses based on those documents.
So how does it do that then? Let's break it down:
The first step is the embedding and retrieval process
You load up your local LLM with the documents you want it to utilise.
(The Build your own RAG tutorial will follow this, so stay tuned next week)
And by 'load up' I mean you provide your documents to the LLM your working with and it embeds them into a vector database.
Vector database is a bit of a leap of understanding - let's take a step back and break this down further:
In RAG, the model responsible for embedding (turning text into vectors) is usually pre-trained. This means the LLM has already been trained on large amounts of text data to understand language and generate meaningful vector representations (embeddings) of text.
For example, Llama 3.1 8B - which is open source and I use frequently as it is compact at 4 gigs and lends itself to these sort of things.
The retrieval system takes the text documents (or passages) that you want to store in your database and encodes them into vectors. (Vectors are representations of the data in vast chains of numbers, they capture the diction and semantic meaning of the data).
Vectors are a strange beast to understand. If you think about every word and sentiment being given a mathematical pathway and somehow put on a map - it might look vaguely like this diagram below:
You can see Cancer in the vague area of chemotherapy, and right next to chemotherapy is Methotrexate (a chemotherapy) which makes sense...
And then to illustrate the point you've got the topic of exercise and a piano miles away and largely unrelated - but exercise is closer because somewhere in the document it may have described an exercise regime in relation to chemotherapy. And there is a bit of chat about a piano in your uploaded document, but it's nothing to do with the chemotherapy.
The vectors for all documents are stored in a vector database (sometimes called an embedding index). This allows the system to efficiently search for and retrieve the most semantically similar vectors (and thus documents) in response to a query.
Retrieval
So when we enter our query in the fashion we have become accustomed to with LLM's :
The same process happens. Your query is encoded into a vector that reflects its semantic meaning.
Generation
After retrieval, a generation model takes the query along with the retrieved information and generates a response.
The generation step ensures that the output is not just a simple copy-paste from the retrieved text but is synthesized into a coherent, well-formed response based on the retrieved content and the user’s input.
So why is this useful?
Responses are grounded in Reality (or however real the information in your documents are that you've embedded). Unlike pure generation models, which are more likely to hallucinate.
Benefits include the ability to adapt to multiple knowledge bases (e.g., clinical documents, research papers), and if there has been something new happening in the world of medicine that the LLM's weren't trained on, you can providfe the up-to-date data, making it effective in dynamic fields like healthcare where information evolves rapidly.
Wait, you said 'less likely to hallucinate... I did indeed.
I know...
RAG systems hallucinate?!
Yes they can! and it's because of the underlying architecture of how the LLMs work
If we take a step back to our retrieval phase ...
If the system retrieves documents that are semantically related but do not directly answer the user’s query, it can lead to irrelevant or misleading content being presented.
For example, if a patient asks about side effects during chemotherapy, and the system retrieves documents about side effects in immunotherapy which sit right next to it on our vector database - due to vague semantic similarity, this could lead to a misunderstood and incorrect response. Getting information on immunotherapy when you're chemotherapy is a not ideal. They have vastly different management strategies (generally).
A similar issue can occur during the generation phase ...
Even when correct documents are retrieved, the language model may misunderstand or misinterpret the relationships between pieces of information in those documents. This is especially problematic with complex, technical content like medical reports.
The model might overgeneralise or create connections between unrelated facts, generating a response that looks coherent but is factually inaccurate.
I've experienced this first hand whilst building my patient side effects chatbot - where the chatbot and I have exchanged some heated words about where it sourced that information from. I'm pleased to say I finally won after drilling down on it repeatedly to find out where it got its facts from... if only someone would do that to Trump....
This is more likely to happen if the retrieved documents are incomplete or ambiguous, the model might "fill in the gaps" by producing plausible-sounding but incorrect information.
For example let’s say a patient using my breast cancer chatbot asks:
“What are the chances of experiencing nausea during my second chemotherapy cycle?”
If the system retrieves documents that discuss general side effects of chemotherapy but do not specifically mention second cycle-related nausea, the generation model might hallucinate an answer like:
"Nausea is more common during the first cycle but generally decreases after that," which might not be true according to the specific data on the second cycle.
So now I've given a beacon of hope only to tarnish it again, what can actually be done about the hallucination risks with RAG?
There is a lot - but one area that is on fire right now is combining with knowledge graphs.
What are knowlege graphs?
Knowledge graphs (KGs) provide structured, interconnected information that can enhance the system’s understanding of relationships between concepts and improve the relevance of retrieved documents.
KG's are a data structure that represents information as a set of entities (nodes) and their relationships (edges). So in our healthcare context, you might relate disease with a treatment, and a treatment with a side effect.
By organising information this way, knowledge graphs allow systems to understand and reason about the connections between concepts in a query or document.
So for our chat bot if a patient asks about side effects of a specific drug (e.g., Doxorubicin), the system can use the knowledge graph to infer related drug classes, treatments, or cycles and retrieve documents that are more relevant to the query.
This improves retrieval for complex queries by grabbing documents that might not contain exact keyword matches but are highly relevant due to their semantic connection.
Another awesome propoerty of the knowledge graph is that it allows the system to expand user queries to related terms, for example If the patient asks about “fatigue after chemotherapy” the system could expand the query using knowledge from the graph to include related terms like “tiredness” or “lethargy,” improving retrieval.
Similarly, a query about a specific cancer subtype (e.g., HER2-positive breast cancer) can be expanded to retrieve documents that discuss general breast cancer treatments and symptoms.
So this sounds like a no brainer, why don't we all just use this hybrid RAG KG system?
Building and Maintaining the KG: Creating a comprehensive and up-to-date healthcare knowledge graph is a resource-intensive task. You have to map the relationships manually, and even with a small set of data, you're looking at a lot of work.
You also need to design the system so that the retrieval and generation components can effectively interact with the knowledge graph without sacrificing performance or scalability.
Final thoughts
LLMs on their own are a bit risky for healthcare. Hence the tumbleweed you see when you look for any implementations of them in practice.
Integrating knowledge graphs with RAG systems can greatly enhance document retrieval and ranking by providing a structured understanding of relationships between medical concepts. You can reduce hallucinations and improve the system’s accuracy, particularly in complex domains like healthcare.
But at what cost ? Is it too much work to justify the benefit?
Perhaps we need to start drawing our ROC curves that incorporate cost of effort, with some sort of mapped ROC effort and cost of diminishing returns overlay.
Work to be done, but exciting times.
As the autumn rolls in, I'm glad for my actual tartan RAG up here in Scotland.
Thanks for reading and see you next time!
Comments