Large language models (LLMs) are the foundation of highly hyped generative AI tools like Chat-GPT and Google Bard. These models use enormous amounts of data scraped from the internet and other sources to produce eerily natural-seeming responses to everything from poetry prompts to homework help, all with simple, intuitive interfaces that anyone can learn to use.
It hasn’t taken long for healthcare and life sciences organizations to start exploring how to use these capabilities to assist with clinical care, particularly in the realm of diagnostics. LLMs have the ability to synthesize vast quantities of structured and unstructured data in novel and unexpected ways, making them potentially ideal AI doctors to supplement the flesh-and-blood workforce.
However, these models aren’t perfect. In fact, research is increasingly showing that they are far from it. A newly published study in JAMA Pediatrics found that Chat-GPT misdiagnosed 72 of the 100 pediatric medical cases selected. For another 11 queries, it delivered a diagnosis too broad to be considered correct.
The tool wasn’t able to identify certain key relationships between diagnoses and underlying factors and only managed to guess the right organ system involved in the diagnosis less than half the time.
The study builds on previous evidence showing that LLMs aren’t quite prepared for prime-time diagnostics, especially when asked to take on complex specialties like oncology. For example, a team from Switzerland recently found that Chat-CPT’s answers for more than 20% of open-ended questions about radiation oncology were considered “bad” or “very bad” by a team of human clinical reviewers.
And researchers from Mass General, Boston Children’s, and Memorial Sloan Kettering found that LLMs recommended one or more non-NCCN-concordant treatment suggestions in more than a third of queries (34.3%). The model also offered “hallucinations,” or suggestions that were not part of any recommended treatment, in 12.5% of cases. In addition, the model’s output differed significantly based on how the question was written, adding confusion to the equation.
This doesn’t mean that LLMs are hopeless and should be abandoned for diagnostic decision support. Instead, it’s a sign that we need to pour more resources into training and tuning large language models to unlock their full potential.
Chat-GPT and its publicly available competitors are generalized models that have an acceptable level of competency across many, many areas. To make a difference in healthcare and life sciences, we need to go deep into narrower datasets focused on the insights that matter most.
That means developing a larger number of clean, complete, accurate, and representative datasets that combine medical records and clinical trial information with real-world data (RWD) and real-world evidence (RWE) about clinical care and drug efficacy.
This data must be drawn from multiple sources and represent longitudinal patient journeys while being carefully curated with strong adherence to shared data governance principles.
Model trainers will also need to be cautious about introducing unintentional biases into the data, which can happen with poor quality medical literature, outdated care protocols, or studies that are not inclusive and representative of all communities. Training should also involve a variety of human clinical annotators with broad experience treating different patient populations to avoid creating an undesirable loop of confirmation bias.
Creating oncology-specific LLMs that are up to the task of real-world clinical diagnosis will be a long-term project for technology developers, data companies, and clinical partners. Achieving the best results will start with zeroing in on meaningfully curated real-world data that can support bias-free training at scale.
With high-quality fuel for the fire, LLMs could soon become a valuable addition to the clinical and life sciences toolkits and support improved outcomes for patients with cancer and other conditions.