So you want to train an oncology LLM: Start by getting close to the data

Large language models (LLMs) are taking the world by storm with their incredible ability to synthesize and present information like never before. Behind the seemingly simple interfaces of popular models, like Chat-GPT and Google Bard, lies a dizzyingly complex artificial neural network that ingests billions of data points scraped from all across the Internet, allowing users to ask questions, make art, and easily cheat on their history homework.

But before we start getting too excited about integrating LLMs into critical industry functions – especially in healthcare and pharma, where people’s lives are on the line – we need to tackle some fundamental issues related to how these models are taught and how we think critically about their results.

The training data we use matters. All analytics models are prone to biases and inaccuracies if trained on poor quality or imbalanced data, and the intricacies of LLMs may make these algorithms even more sensitive to propagating unwanted artifacts over time.

To create targeted LLMs that are suitable for diving into health-related areas like oncology, we must choose our data carefully and get as close to the original source as possible to avoid “playing telephone” with inaccurate information that could affect drug development processes or cancer outcomes.

How do LLMs get their data?  And how can it go wrong?

LLMs can be trained on many different types of datasets, but let’s take a look at the current market leader, Chat-GPT. The model, developed and stewarded by OpenAI (with Microsoft as a heavy investor), was largely trained using Common Crawl: a repository of more than 240 billion webpages spanning 16 years. The latest versions of the model also incorporate human feedback from users who can respond to results in a method called reinforcement learning.

The enormous breadth of data enables Chat-GPT to return seemingly intelligent and informed results on almost any topic imaginable.

But there’s a problem hidden in plain sight. Shockingly (not shockingly), it turns out that not everything on the Internet is right – nor are humans interacting with chatbots universally well-intentioned. Purposeful and inadvertent misinformation are both rampant on the web, and humans like to test the limits of these models with their interactions and feedback, leading to some suspect results.

For example, a recent study by NewsGuard found that Chat-GPT 4.0 convincingly and persuasively advanced 100 out of 100 known conspiracy theories and false narratives when responding to a series of leading prompts. The FTC is even getting involved to learn more about the spread of misinformation and the genesis of “hallucinations,” or made-up responses when the AI can’t find an appropriate answer to a question.

The point is that data volume is essential for training LLMs, but data quality – and limiting the models’ “urge to please” users, could be even more important for making sure these tools are accurate, trustworthy, and truly useful for high-stakes applications.

What does this mean for oncology specific LLMs?

Base models like Chat-GPT are designed to be broad, but they can be further optimized with specific training data sets to tackle narrower areas of study, like cancer care or drug development. The trick will be to avoid creating an environment vulnerable to the same potential issues as mentioned above.

What we shouldn’t do, for example, is train an oncology specific LLM exclusively on all the published medical literature about cancer for the past fifty years.

For one thing, cancer care has changed dramatically even in the last five to seven years, so much of that information will be outdated and irrelevant to the current standards of care.

For another, medical literature is prone to its own problems with reliability and accuracy. From incorrectly cited references to poor research design to huge numbers of one-off studies that have not been replicated or confirmed, overreliance purely on published literature to train LLMs could set up some major headaches for researchers and clinicians.

So what is an algorithm developer to do? Engineers will need to supplement the use of medical literature by allowing models to dig right down into the raw numbers themselves.

The value of getting close to the source with real-world data

High-quality medical research is based on real-world data (RWD): the multifaceted datasets that can include EHR data, clinical and pharmacy claims, patient-reported information, device data, and more.

When curated appropriately, RWD can offer a rich, comprehensive, and longitudinal view of a cancer patient’s journey across the care continuum that will be essential for ensuring that LLMs truly understand the relationships between diseases, interventions, and outcomes.

It’s the difference between conducting a literature review and doing original research: by removing any potential inaccurate filtering or conclusions that may occur when humans interpret the numbers in a publication, an LLM trained with RWD can give users more direct control over relevant and applicable data points that are closer to the patient – and analyze these data points in an “apples-to-apples” manner.

Taking this approach to integrating RWD into the LLM training process will help future oncology-focused models dodge the challenges of broader algorithms and ensure that new models are built on a firm foundation of useful and trustworthy data that can support better clinical care, faster drug development, and better outcomes for cancer patients.