Med-PaLM 2: Will This Google Research Help You Increase Your Healthspan?

Executive Summary

  1. Med-PaLM 2 is Google’s medical large language model (LLM), designed to accurately and safely answer medical questions.
  2. Med-PaLM 2 was the first LLM to perform at an “expert” test-taker level performance on the MedQA dataset of US Medical Licensing Examination (USMLE)-style questions, reaching 85%+ accuracy.
  3. It was also the first AI system to reach a passing score on the MedMCQA dataset comprising Indian AIIMS and NEET medical examination questions, scoring 72.3%.
  4. The LLM has been assessed against multiple criteria, including scientific consensus, medical reasoning, knowledge recall, bias, and likelihood of possible harm. These evaluations were performed by clinicians and non-clinicians from various backgrounds and countries.
  5. Google allows limited access to Med-PaLM 2 for testing and feedback for a select group of Google Cloud customers. The focus of these evaluations will be on safety, equity, and bias.
  6. The creation of Med-PaLM 2 is part of Google’s ongoing research in generative AI technologies. It is designed to identify complex relationships in large training data sets and create new data from what they learn.
  7. Other healthcare-related AI projects that Google has been involved in include AI-assisted diagnosis technology for cervical and prostate cancer, an AI algorithm for improving the care of head and neck cancers, and AI to improve breast cancer screening.

Before we dive into Med-PaLM 2, let’s learn about the underlying technology: PaLM 2: πŸ‘‡

What Is PaLM 2?

Google’s PaLM 2 is a next-generation language model with enhanced multilingual understanding, reasoning, and coding capabilities.

The Advent of PaLM 2

We recently published an article about the fact that Google actually developed the transformer technology that has kicked off the massive leaps in LLM technology such as ChatGPT.

πŸ§‘β€πŸ’» “We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and …”

Figure: Screenshot of Google’s groundbreaking “Attention Is All You Need” paper

Google has demonstrated that scaling up neural networks can produce remarkable and innovative capabilities. But this journey has taught them that bigger isn’t always better.

Creative research and strategic advances in model architecture and training are key to constructing superior models. PaLM 2, the successor of the groundbreaking PaLM, is a testament to this ethos. This state-of-the-art language model offers enhanced multilingual, reasoning, and coding competencies. πŸ‘‡πŸ‘‡πŸ‘‡

Multilingual, Reasoning and Coding Capabilities

PaLM 2 is extensively trained on multilingual text across over 100 languages, significantly improving its ability to understand, generate, and translate nuanced text. Its advanced language proficiency can solve complex linguistic challenges such as idioms, poems, and riddles.

The model’s reasoning ability is equally impressive, driven by a diverse dataset that includes scientific papers and web pages containing mathematical expressions. This gives PaLM 2 enhanced capabilities in logic, common sense reasoning, and mathematics.

In the coding realm, PaLM 2 has been pre-trained on a wealth of publicly available source code datasets. This preparation makes the model proficient in popular programming languages like Python and JavaScript and capable of generating specialized code in languages such as Prolog, Fortran, and Verilog.

A Versatile Family of Models

In addition to its advanced capabilities, PaLM 2 is faster and more efficient than its predecessors and comes in a variety of sizes: Gecko, Otter, Bison, and Unicorn.

The lightweight Gecko model can function on mobile devices and is ideal for interactive applications on-device, even when offline.

The variety in sizes makes it easy to fine-tune PaLM 2 for many use cases, enabling it to aid more people.

Powering Over 25 Google Products and Features

Over 25 new products and features announced at Google’s I/O event are powered by PaLM 2. This means PaLM 2’s advanced AI capabilities directly impact consumers, developers, and enterprises worldwide.

πŸ§‘β€βš•οΈ One of its exciting applications is the Med-PaLM 2, trained by Google’s health research teams with medical knowledge. Med-PaLM 2 can answer questions and summarize insights from a variety of complex medical texts. It has achieved state-of-the-art results in medical competency and was the first large language model to perform at the “expert” level on U.S. Medical Licensing Exam-style questions.

But before we dive into Med-PaLM 2, let’s quickly examine this question that may be on your mind: πŸ‘‡

What is the Difference Between PaLM 2 and Google Bard?

Google improved Bard with PaLM 2, its large language model, thereby upgrading Bard’s math, logic, and reasoning skills. CEO Sundar Pichai said PaLM 2 will also power 25 of the company’s products and features. So you can think of Bard as a meta-tool incorporating various Google capabilities such as PaLM 2.

I suspect that Med-PaLM 2 will also be integrated into Google Bard to extend its capabilities over time.

Med-PaLM 2 – Quick Overview

The following short video gives you a glimpse into the Med-PaLM 2 technology:

Google Research continues to push the boundaries of Artificial Intelligence (AI) with Med-PaLM 2 – a cutting-edge large language model (LLM) that specializes in medical queries.

A significant step up from its predecessor, Med-PaLM, this sophisticated model surpasses the original with an impressive 86.5% accuracy on USMLE-style questions. This achievement not only marks a 19% improvement, but also provides further evidence of AI’s growing significance in healthcare and medicine:

Compare this to the previous version:

Harnessing the knowledge and capabilities of Google’s advanced language models, Med-PaLM 2 has been fine-tuned and calibrated to provide high-quality responses to various medical questions.

Med-PaLM 2 can pass the US Medical License Exam (USMLE) and offers precise, comprehensible answers to various consumer health inquiries.

However, Med-PaLM 2’s capabilities extend beyond language understanding.

It’s designed to integrate with an array of data sources, from electronic health records and sensor data to genomics and imaging.

This multi-modal approach seeks to simulate the diversity of medical practice, aligning with the future vision of medical AI systems – providing comprehensive, personalized healthcare for everyone.

As the next step, Google plans to offer Med-PaLM 2 to a select group of Google Cloud customers for limited testing. Through this process, Google hopes to gather invaluable feedback and gain insights on how to make Med-PaLM 2 more practical and valuable in real-world settings.

Despite the significant strides achieved, the researchers at Google are fully aware that the journey toward the broad adoption of such AI systems is still underway. The team is committed to collaborating with the global medical community to further refine and enhance this technology, with a singular aim – to revolutionize healthcare delivery worldwide.

I’m super excited about the following vision:

In the near future, you’ll have access to top-notch medicine that can prevent illnesses years before they even occur, not just when you are already sick.

As the costs of AI training continue to decrease rapidly, it is likely that every individual with a smartphone will have access to an equivalent of a team of the top 100 doctors worldwide constantly monitoring and improving their health status around the clock.

The future is bright! 🌞

A Few Words on Med-PaLM Version 1

Med-PaLM Version 2 is great. Let’s quickly dive into the first version that was already revolutionizing medical question-answering capabilities due to an intelligent blend of effective instruction prompt tuning and model scaling.

The initial paper on Med-PaLM v1 was published in Nature, the top research journal in the world: πŸ‘‡

Image source: Nature

The gist is this: the performance of language models like PaLM dramatically improves when the number of parameters increases, from 8 billion (8B) to 540 billion (540B) parameters. To put it into perspective, a larger model can enhance the accuracy by more than 30% when answering medical questions, a massive leap from barely surpassing random performance.

Another important aspect is instruction fine-tuning.

Google’s Flan-PaLM models, incorporating this approach, outshone regular PaLM models across all size variations on multiple-choice datasets. These results debunk the theory that the 540B model’s impressive performance solely hinges on memorizing the training corpus.

Let that sink in: the model doesn’t memorize the training data but shows signs of intelligence that I’ll closely define as “problem-solving capability through pattern recognition”.

πŸ§‘β€πŸ’» Recommended: Finxter Mission – Help Increase Collective Intelligence

Comparing the new models to previous attempts like BioGPT21, PubMedGPT19, and Galactica20, Google’s models come out on top without any dataset-specific fine-tuning. In essence, as models scale up, their capabilities in recall, reading comprehension, and reasoning within the medical context see considerable improvements.

However, let’s not get carried away. While these are massive strides, scaling alone doesn’t cut it. Despite being powerful, LLMs like Flan-PaLM can still generate inappropriate responses for the safety-sensitive medical field. That’s where instruction prompt tuning comes in.

It’s proving to be a game-changer in enhancing factors such as accuracy, factuality, consistency, safety, harm, and bias. This fine-tuning is inching these models closer to matching clinical experts and making them apt for real-world clinical applications.

All in all, Med-PaLM is a giant leap forward in AI-driven medical query resolution. It’s not perfect yet, but it certainly indicates a promising direction in the marriage of AI and healthcare.

Med-PaLM vs Med-PaLM 2

Image source: Research Paper

πŸ‘‰ Enter Med-PaLM 2, the LLM that’s even better than the first version, Med-PaLM 1.

The first version did better than earlier attempts like BioGPT21, PubMedGPT19, and Galactica20. It was more accurate in giving answers to medical questions.

But Med-PaLM 2 is even more impressive. It was tested on a bunch of medical questions from a dataset called MultiMedQA. The results were great, with the program getting 86.5% of the answers right.

It’s not just about the numbers, though. When they compared answers from Med-PaLM 2 and real doctors on 1066 medical questions, the answers from the program were often chosen as better. This suggests that the program’s answers match up well with what doctors expect.

Med-PaLM 2 is a big step forward in using AI to answer medical questions. It shows that AI can do a good job in this field already, sometimes even better than human doctors.

State-of-the-art LLMs are already better than human doctors in many cases!

Med-PaLM 2 Technical Methods and Ensemble Refinement (ER)

The Med-PaLM 2 system was evaluated using several medical question-answering datasets. These datasets comprised multiple-choice and long-form questions from sources like MultiMedQA, MedQA, and more.

Some were general knowledge-based, and others focused on specific topics like clinical knowledge or medical genetics.

The team also introduced two new sets of questions, the “Adversarial” datasets, that were designed to challenge the system and check for potential harmful or biased outputs.

  • The first set covers issues such as health equity, drug use, mental health, COVID-19, obesity, suicide, and medical misinformation.
  • The second set targets areas of healthcare access, quality, and social/environmental factors.

Med-PaLM 2 was trained using a process called ‘instruction finetuning’, which optimizes it to perform across multiple datasets such as the already mentioned MultiMedQA and MedQA.

The model was evaluated on multiple-choice benchmarks using various strategies like “Few-shot prompting”, “Chain-of-thought”, “Self-consistency”, and a newly developed strategy called “Ensemble Refinement”.

Few-shot prompting involves providing the AI with example inputs and outputs before the final input, while chain-of-thought augments each few-shot example with a step-by-step explanation towards the answer. Self-consistency, on the other hand, is about sampling multiple explanations and answers, and the final answer is decided by a majority vote.

See this graphic illustrating Ensemble Refinement:

πŸ’‘ Ensemble Refinement (ER) is a technique used to enhance the quality of the answers generated by an LLM. It takes advantage of other techniques such as chain-of-thought prompting and self-refine, which involve the model learning from its own initial responses (generations) before providing a final answer.

The ER method has two main stages.

In the first stage, the model receives a question and a few examples of how to answer similar questions (a chain-of-thought prompt). Using this information, the model generates multiple potential responses randomly. Each potential response includes an explanation and an answer to the question.

In the second stage, the model revisits the question, the prompt, and its initial responses. Based on this information, the model produces a refined explanation and answer. This stage is repeated several times to enhance performance. The model then reviews all the refined answers it has produced, and the final answer is selected through a majority vote (plurality vote).

Unlike some other strategies, ER can work on a broad range of questions, not just multiple-choice ones. It can also be used to produce more accurate long-form responses by refining multiple potential answers. However, because the method requires repeated sampling from the model, which can be resource-intensive, in this study it was only used for multiple-choice questions. For the first stage, they did 11 samplings, and for the second stage, they did 33 samplings.

Overall, I really enjoyed the newly-introduced techniques that Google researchers proved to be very effective in fine-tuning their internal PaLM 2 LLM for medical problems. In fact, it seems like fine-tuning itself is the secret sauce for medical performance — much like putting an intelligent young person through medical school.

Let’s close this deep dive into the paper with the following presentation from “The Check Up”, Google’s conference for health research:

My Optimistic View

I’m super excited about the perspective of advanced-level medical AI.

Step into the not-so-distant future, where your health is in your hands like never before. With the rapid advancements in AI, top-notch medicine becomes proactive, preventing illnesses years before they strike, rather than just reacting when you’re already sick.

I have listened to myriads of podcasts by medical experts, and they all argue that an ounce of prevention is worth more than a kilo of curation. (I’m sure I messed this up. πŸ˜…)

Thanks to the plummeting costs of AI training, having access to a team of the world’s top 100 doctors monitoring and improving your health around the clock is no longer a far-off dream. All you need is a smartphone, and this remarkable AI companion will be by your side, understanding your unique needs and keeping you in the pink of health.

Imagine a world where medical care is no longer a privilege, but an affordable reality for everyone. It’s not just about extending lifespans; it’s about enhancing the quality of life, making people healthier, happier, and more productive.

πŸ§‘β€πŸ’» Recommended: IQ Just Got 10,000x Cheaper: Declining Cost Curves in AI Training