Stepping into the rapidly advancing world of large language models (LLMs), researchers put the AI model GPT3.5 under the microscope, challenging it with the complexities of real-world medical protocols in radiology.
Doctors, when ordering an imaging study, typically provide a succinct summary of the patient’s symptoms, medical history, and clinical findings. This information guides the radiologist in selecting the appropriate radiologic protocol – a detailed set of instructions to conduct the exam.
🧑💻 A friend of mine is a medical doctor who spends 30% of his time doing administrative work that could already be automated. The clinic has ~100 medical doctors, so the right tools could save 30 doctors’ worth of full-time labor. Assuming an average six-figure salary, that’s GPT-level administrative work worth 30x$100,000 = $3M per year! For one clinic!
Crucial in MRI exams, the choice of protocol can greatly affect the quality and diagnostic accuracy of the results.
However, determining the right protocol requires a deep knowledge of disease appearances in scans, an understanding of the patient’s clinical situation, and familiarity with the capabilities of the institution’s medical equipment.
🤖 Could an AI model handle such a complex task?
The researchers set out to answer this question by testing GPT3.5 with 4,800 archived physician orders, challenging the AI model to make sense of the complex medical language used in radiological exams.
In addition to gauging GPT3.5’s performance, the researchers delved deeper into its decision-making processes. They sought to understand if the AI model could grasp the intricacies of different pathologies, radiological appearances, and the specialist language associated with human anatomy and physiology.
They didn’t just assess GPT3.5 on its own. They compared its capabilities to those of existing state-of-the-art models like BERT. They examined how the AI selected specific words related to different protocols and how it made decisions compared to a human radiologist. They also probed the errors made by GPT3.5, highlighting potential safety risks in clinical settings.
Setting the Benchmark
To gauge the effectiveness of GPT3.5 and make a comparative analysis, they set up a benchmark using several renowned pre-trained models, including BERT, BioBERT, and RoBERTa.
These models, previously utilized in studies for medical imaging protocol assignment, served as a performance yardstick. Leveraging the HuggingFace Transformers 🤗 library, they fine-tuned these pre-trained models – BERT, BioBERT, and RoBERTa – to establish a robust baseline for comparison with GPT3.5.
If you ask me, comparing general-purpose GPT-3.5 against a fine-tuned BERT is quite a high bar for OpenAI’s LLM!
Here’s the prompt they used to gauge GPT-3.5’s performance:
"You are being evaluated on how well you can perform radiological protocol assignment for MR imaging. Suppose we have 11 possible MR Imaging protocols: MR NASOPHARYNX OROPHARYNX, MR BRAIN MASS/METS/INFECT, MR STROKE, MR SELLA, MR BRAIN SEIZURE, MR BRAIN DEMYELINATING, MR BRAIN MOYA-MOYA DIAMOX, MR SKULL BASE, MR VASCULAR MALFORMATION/ICH/TRAUMA, MR ORBIT SINUS FACE, MR BRAIN ROUTINE. We will provide you with a physician's entry and request that you identify which imaging protocol, out of the 11 options listed, should be used and explain the reasoning behind your choice."
Here’s the prompt they used to get an explanation out of GPT-3.5:
"We will provide you with a physician's entry and request that you identify which imaging protocol, out of the 11 options listed, should be used and explain the reasoning behind your choice. In addition, can you list the 3 most important words (single words) which had the greatest impact on your decision?"
- GPT3.5’s performance, with a weighted average F1 score of 0.72, fell behind fine-tuned language models BERT, BioBERT, and RoBERTa, which scored 0.89, 0.92, and 0.90 respectively.
- Despite not excelling in certain tasks, GPT3.5’s performance is impressive considering it has not been fine-tuned for these specific tasks.
- Word importance analysis revealed that GPT3.5 selected more context-specific words compared to BERT, aligning closer with a radiologist’s choices.
- Comparing individual text analyses, GPT3.5 showed a broader understanding and comprehensive analysis of the complete text, unlike BERT which focused on select keywords.
- GPT3.5 was surprisingly well calibrated out of the box, showing users can trust the accuracy of its predictions without the need for extensive recalibration.
- Error analysis categorized mistakes into seven broad categories: incomplete understanding of the protocol (44%), anatomy (22%), incomplete understanding of medical conditions/terminology (14%), misunderstanding of acronyms (5%), arbitrary (5%), age-related (8%), and ambiguous prompt (2%).
- Errors due to incomplete protocol understanding emphasize the need for explicit protocol training.
- Anatomy-related errors underscore the need for more training on anatomical relationships and spatial reasoning.
- Errors due to misunderstandings of medical conditions or terminology highlight the necessity of comprehensive medical data in training.
- Misunderstanding of acronyms suggests improved acronym disambiguation is required in training.
- Arbitrary errors show the need for refining the model’s capacity to stick closely to the given prompt.
- Age-related errors suggest age-based reasoning should be incorporated into training.
- Errors due to ambiguous prompts highlight the need for developing strategies for handling uncertainties.
My Own Take
Reviewing studies such as these makes me incredibly optimistic about the future of AI across various sectors, particularly within the medical field. While the authors of the study have a straightforward perspective – that a fine-tuned BERT outperforms a general-purpose GPT-3.5 – I believe the performance gap isn’t overwhelmingly significant. It’s quite possible that GPT-4 could already surpass a fine-tuned BERT under the same conditions! Additionally, fine-tuning a GPT-4 model using a select number of expert-guided training examples could potentially outperform medical professionals themselves.
We are already witnessing the future unfold. Those who have the capability to engineer prompts, embed, and fine-tune large language models are in a position to create astonishing applications. These applications have the potential to replace countless medical professionals and save us millions, if not billions, of dollars in healthcare expenses.
Feel free to check out our prompting course on the Finxter Academy and read the following article if you’re interested! 👇
🧑💻 Recommended: 30 Creative AutoGPT Use Cases to Make Money Online
While working as a researcher in distributed systems, Dr. Christian Mayer found his love for teaching computer science students.
To help students reach higher levels of Python success, he founded the programming education website Finxter.com that has taught exponential skills to millions of coders worldwide. He’s the author of the best-selling programming books Python One-Liners (NoStarch 2020), The Art of Clean Code (NoStarch 2022), and The Book of Dash (NoStarch 2022). Chris also coauthored the Coffee Break Python series of self-published books. He’s a computer science enthusiast, freelancer, and owner of one of the top 10 largest Python blogs worldwide.
His passions are writing, reading, and coding. But his greatest passion is to serve aspiring coders through Finxter and help them to boost their skills. You can join his free email academy here.