LLMs can outperform medical experts... but should we use them?

Sam McInerney
Oct 31, 2024
11 min read

As an AI advocate and researcher who is heavily invested in answering the question as to whether or not LLMs can be used in routine clinical care, the paper "Adapted large language models can outperform medical experts in clinical text summarization | Nature Medicine" was a total win for the cause.

I've gone into the paper in some depth and shared my summary of it in this post... However, reflecting on the practical reality of such a tool has also led me to ask the question: Just because we can, should we?

AI meets Electronic Health Records There is a lot of interest in using LLMs with patient notes to create summaries.

Of course, that's not their only potential utility. But analysing vast amounts of unstructured textual data and summarising the key information is something language models seem to be pretty good at (i.e. please enter all mainstream AI services that have ingested the substantial proportions of the internet to answer my whims on-demand)

We clinicians spend an eye-watering amount of time going through notes trying to find any glimmers of useful information to piece together a narrative for our patients.

So you can see why this seems like an easy win for AI.

A system that can potentially automate the summary process, time saved - halleluja. NHS fixed.

Although, is it?

Some of the best learning for me as a junior doctor was putting together these patient summaries myself. I remember (likely with rose-tinted glasses) my first SHO rotation in infectious diseases.

To set the scene, it's winter 2016. Sub-zero temperatures outside, 350 degrees inside the hospital (as it always was), and I'm in at 7:30 preparing the ward round.

It's during this time that I would trawl through extensive note volumes for each patient to create a summary for the daily ward rounds.

As a side note, this was also the time in my life when I developed the skill of reading barely legible handwriting - if you can even call it that - scratchings might be a more accurate term.

It's by putting this narrative together that I really understood the patient's journey. I tracked the blood tests, infection markers, what samples had been sent and what had been looked for.

As a byproduct of this process, I started to understand why they had been sent.

What was the differential diagnosis we were trying to narrow down?

How did we do it, what test did we send, what did that result tell us...

By sitting down, spending the time needed to ingest this journey, the narrative, the differentials, to the point of understanding it so that I could accurately present this patient as part of the ward round - it was an absolutely essential part of my learning to be a doctor.

It taught me and guided me to what I needed to know, and what I needed to do in future when it was my turn to make the decisions.

This was golddust experiential learning. You reflect on what has been done for the patient, what worked, what didn't, and then you're in it, you're part of the team that makes the call on what to do next. Usually doing the jobs you create, and usually first on the team to see the results.

And yeah, that took time.

Lots of time.

In fact, I think across the board we all came in at significantly before we started our shift to do this...

(how much earlier depending largely on the consultant you were under that week, and how meticulous they were with you knowing verbatim memorised test results ... and also probably directly correlated with how much of a bollocking you would receive if you didn't know the answers ... :D)

Without digressing too far, of course, this wasn't paid time...

But we all knew that was part of the job, It's part of being a doctor.

Anyone who'd done their homework on becoming a doctor knew this was part of the gig.

This is why some of us get a bit pissed off when people start to talk about junior doctor salaries and what our hourly wage is - our working hours do not reflect the hours we work.

Anyway, back to the point.

So you're saying we shouldn't use summary systems then?

Self-performed summaries, in my view, can be a critical part of learning to be a doctor.

However, this of course isn't always the case.

This process is not all gold dust. Sometimes you just want facts, in a bullet-point nature, and quickly.

All ward rounds are not equal. They are not all infectious disease ward rounds, with long patient admissions and fruitful learning to be had.

In summary then ... (excuse the pun)

As I think the conversation about AI versus humans keeps coming back to, and maybe it's the case with all of our healthcare problems when you take it to any depth, the answer is nuanced.

Sometimes AI would be great.

Sometimes AI would be a bad idea.

Sometimes we should use both together.

The gift of fire

The problem is, I suppose, when we get a tool that saves time, we're going to use it.

Even when we probably shouldn't.

Even when it's detrimental to our learning.

Especially when it means we don't have to come in early for work.

So whilst I am an advocate for most things AI, and I do believe it's the path to save the NHS.

We've got to be pretty damn careful with this gift of fire.

We don't want a generation of doctors not knowing what to do for a patient because AI did the heavy lifting for them when it mattered.

With that off my chest, we can now delve into the paper! :)

Paper review

I was proud to present this paper to the digital health validation lab (DHVL) team, headed up by Prof. David Lowe.

Clinical note summarisation by large language models ties in heavily with my PhD research question, and this has been the most comprehensive, methodologically robust, and thorough assessment of LLM-generated summaries that I have found to date.

A summary of the journal review is given, with a deeper dive below that pulls apart some of the metrics and methodologies used. Fantastically, the whole project is on github:

https://github.com/StanfordMIMI/clin-summ

So you can see what they've done and how they've done it.

Objective

This paper aimed to assess the effectiveness of large language models performing clinical text summarisation across four tasks:

Radiology report summaries
Progress notes
Patient and clinician dialogue summaries
Patient question summaries

Comparing the model-generated summaries to 'expert' physician-generated summaries, the core hypothesis is that with effective adaptation, LLMs may match or even surpass clinician-generated summaries in quality, completeness, and correctness.

Ultimately suggesting that this could alleviate the workload burden on clinicians.

Methodology

A comprehensive methodology framework was used, illustrated below:

Models

Eight LLMs were selected for the study. They were chosen based on a balance of performance and accessibility, as well as the ability to adapt to clinical tasks.

It included both sequence-to-sequence (seq2seq) and autoregressive models, and open-source and proprietary models.

Methods of Adaptation

Two adaptation methods were utilised:

In-context learning (ICL) : System and user prompts were integrated into the model to provide examples to guide summary creation.

Quantized Low-Rank Adaptation (QLoRA): By providing specific data to the model, this method involves fine-tuning the internal weights/parameters to adapt to clinical text.

There was a bit of comedy snuck into this paper too - for example, prompt engineering is a bit of a dark art. And within the model prompts anatomy explanation, they tried two ways of guiding the language model with its area of expertise.

The first was an expert medical professional. The second was a mystical wizard in middle earth.

Hilariously, yet expectedly, the wizard was worse at the summaries. I am definitely recreating this component of the experiment via the github link :)

The Data

The study used 6 datasets - all publicly accessible. The datasets for imaging and radiology were de-identified reports performed as part of routine clinical care. (Note - no images were used in the study, the experiment was based purely on the textual reports.) A subset of each was selected based on the quality of the diagnostic content.

Quantitative Evaluation

The models were then benchmarked on summarisation tasks across six datasets using metrics for syntactic (BLEU), semantic (BERTScore), and conceptual accuracy (MEDCON).

At this stage, the best model and method were selected to be put forward to the clinical reader study.

This part was pretty key as it determined the "best foot forward", i.e. it determined which model to go into the clinical reader study.

The results of this stage of the analysis are tricky to interpret without some understanding of the metrics being used. However, even without an appreciation of the nuances of the metrics (other than knowing they have something to do with the performance of creating a textual summary), it is relatively evident that GPT-4 was the top performer across the board.

Key:

a Impact of domain-specific fine-tuning. Alpaca versus Med-Alpaca. Each data point corresponds to one experimental configuration, and the dashed lines denote equal performance. b Comparison of adaptation strategies. One in-context example (ICL) versus QLoRA across all open-source models on the Open-i radiology report dataset. note : Obviously being proprietary and at the time of publication, closed, : GPT were not able to be fine tuned with the QLora process. c Effect of context length for ICL. MEDCON scores versus number of in-context examples across models and datasets. We also included the best QLoRA fine-tuned model (FLAN-T5) as a horizontal dashed line for valid datasets. d Head-to-head model comparison. Win percentages of each head-to-head model combination, where red/blue intensities highlight the degree to which models on the vertical axis outperform models on the horizontal axis.

We then take our champion into the clinical reader study.

Clinical Reader Study

This involved 10 physicians rating the summaries for completeness, correctness, and conciseness.

The researchers presented each of the 100 randomly selected sample summaries (generated by the LLM model versus those by medical experts) to the physician readers in a blinded, randomized order.

The study used custom software to present the summaries. Specifically, they deployed a user interface created in Qualtrics, which allowed for A/B comparisons of summaries from the best LLM and medical experts

The study developed a custom Likert scale to assess the quality of model-generated summaries across three attributes: completeness, correctness, and conciseness.

For each attribute, physician readers were asked to evaluate the summaries on a five-point scale, focusing on these specific questions:

Completeness: "Which summary more completely captures important information?" – This assessed the summary’s recall of clinically important details.
Correctness: "Which summary includes less false information?" – This question targeted the summary’s precision and ability to avoid including erroneous or fabricated information.
Conciseness: "Which summary contains less non-important information?" – Here, readers evaluated the summary’s ability to avoid superfluous content.

To standardize the presentation of summaries, the study applied simple post-processing steps that unified capitalization, punctuation, and line breaks across all summaries.

This helped ensure that the reader evaluations focused purely on content quality without bias from formatting differences

Which as you can see below - was a pretty awesome setup.

At the risk of creating a never-ending blog post - I'll get to the results...

Results

Performance of Adapted Models: The GPT-4 model with ICL was often rated superior or non-inferior to human-generated summaries in 81% of cases, showing especially high accuracy in identifying important details.

Error Rates and Safety: The adapted LLMs demonstrated lower fabrication rates than experts, with fewer hallucinations (incorrect or imagined facts), and summaries produced by LLMs showed a lower risk of potential harm than human-generated summaries.

Conciseness and Completeness: LLMs produced summaries that were generally more concise while also more complete, pointing to the model’s advantage in synthesizing complex clinical information efficiently.

So what were the limitations?

Limitations

Adaptation Requirements: While the study focused on adaptation using ICL and QLoRA, these methods require substantial computational resources and are highly sensitive to prompt engineering. This may limit the broader scalability and practicality of the models in general clinical settings without further computational infrastructure.

Potential Overfitting to Task-Specific Examples: ICL involves inputting task-specific examples, which may lead to the model being highly tuned to the example cases rather than generalizing across a broader array of unencountered clinical scenarios. For real-world use, this limitation implies a need for continuous re-calibration with updated examples as medical knowledge and documentation practices evolve.

Although there are best practices, the skill of prompt engineering remains something of a dark art in my experience - with variable results.

Clinical Applicability and Safety Analysis

Limited Scope in Error Analysis:

The study’s safety analysis primarily assessed the LLMs’ capacity to avoid hallucinations, but further validation in live clinical settings is necessary. It's not just about the hallucination, but what it hallucinates is key.

Reader Study Sample Size:

The study’s reader sample size (10 physicians) is relatively small, especially considering the high inter-reader variability that clinical text interpretation often entails. This raises questions about the robustness of the findings and whether similar results would emerge with a larger or more varied sample, including non-physician health professionals who also contribute to clinical documentation.

Dataset and Domain-Specific Constraints

Diversity of Datasets:

While the authors used several datasets, they were all selected from publicly available sources, which may not represent the broader clinical data spectrum encountered in real-world hospital settings. Public datasets often lack some of the proprietary or institution-specific variations in clinical language and structure. This limitation points to the need for real-time validation in diverse health settings.

Context Limitation in Summaries:

The study instructed physicians to disregard certain aspects (e.g., recommendations for further studies), but such recommendations are often crucial in clinical settings.

Since LLMs were unable to handle prior studies or continuity of care documentation, this limits their utility for longitudinal patient documentation where temporal changes are key.

Ethical and Operational Considerations

Reliance on Proprietary Models:

The study found that GPT-4 (a proprietary model) outperformed open-source models, which poses challenges for hospitals in terms of access, cost, and data governance, especially under stringent regulations for patient data protection.

Proprietary models also limit transparency in training data, creating potential biases that might be unknowingly embedded. In short. It's never going to happen with chatGPT unless they release an open model or a trusted environment specific to healthcare purposes.

Custom Likert Scale

There was no mention of where this came from or any methodology of how it was developed. Seemingly just out of thin air. It's logical and makes sense, but typically a Likert scale should go through internal consistency and reliability testing - to ensure it's valid.

Key measures such as Cronbach's alpha and Split-Half reliability should be used in its validation. Perhaps they did this, but I couldn't find it in the paper.

Conclusion

The study provides compelling evidence that adapted LLMs, especially when fine-tuned with clinical data, can perform clinical summarization tasks at a level comparable to or better than human experts in terms of completeness and accuracy. But with such a small sample size of expert physicians, and the limitation of in-context learning with the use if ChatGPT as the best model. It's not a realistically implementable system.

That being said, I thought the way they designed this study was very clever. There are a lot of LLMs out there, and multiple ways of teaching or fine-tuning an LLM.

Which is challenging as a researcher in the area – which one to choose?

But also, there are a lot of evaluation metrics for summarization performance, and they have provided a sensible approach to dealing with some of this complexity.

The paper was really helpful for my own journey – it forced me to learn about some of these metrics that I had been burying my head in the sand about, but also this paper for me raised a bit of a lightbulb moment as well around the point of the accuracy of medical tools, how clinicians aren't 100%, and no other tests we use are either.

But that is a post for another day.

Thanks for reading! :)

Dr Samuel McInerney

BSc (Hons) MBBS PGCert MRCP