COTA and Google Partner to Use Natural Language Processing to Harness Unstructured Data

The electronic health record (EHR) has revolutionized the way healthcare providers capture patient data, but it hasn’t eliminated all the challenges of the pen-and-paper era of clinical care.

While the vast majority of critical data is now created and stored digitally, much of the information is still generated in an unstructured format. Free-text clinical notes and PDF documents remain largely invisible to algorithms that can mine structured data fields for key insights into patient care. As a result, clinicians and researchers may be unable to generate a complete picture of the patient journey and could miss out on opportunities to advance the standard of care.

Finding a way to consistently and accurately extract these elements is one of the top goals for technologists, including a newly formed team of machine learning experts at COTA, Google, John Snow Labs, and Quantiphi.  

Together, we are using breakthroughs in machine learning and natural language processing (NLP) to extract unstructured data elements from EHRs to fuel innovation in oncology research and treatment.

The challenges of unstructured oncology data

Oncology care can be extraordinarily complex and may generate huge volumes of data for each patient – far too much for a human to read and extrapolate on their own.  We are already highly skilled in extracting the structured elements from patient records, but there is so much more we can learn from the copious information clinicians are capturing about their patients.

For example, next-generation genomic sequencing is becoming particularly important for personalizing cancer care and understanding the relationship between novel biomarkers and outcomes. But the reports providers receive from the genetics labs are often in a PDF format, which is basically just an image. Traditional data extraction tools can’t “read” the text in these images, so the data may go unused.

Similarly, clinicians capture a great deal of information in their free-text clinical notes, but it’s usually full of abbreviations, irregularities, and even inconsistencies. Without sophisticated AI and machine learning algorithms to comb through the data and identify elements of interest, it isn’t possible to integrate that information into large-scale datasets to support clinical trials and other research.

Partnering to leverage emerging NLP technology

Natural language processing is a game-changer for unstructured data. NLP algorithms use machine learning techniques to “understand” the contents of free-text notes, and other unstructured data assets and extract elements of interest in a more standardized manner. The newly structured data can then be integrated with other assets to further our understanding of important clinical concepts and patient outcomes.

After a competitive selection process, COTA chose to partner with Google, John Snow Labs, and Quantiphi to build a series of new NLP models tailored to the nuances of unstructured oncology data. By training these algorithms specifically on oncology information, we are creating targeted capabilities for a unique segment of the clinical environment.

Google’s artificial intelligence and NLP capabilities offer a strong base model and scalable infrastructure for COTA’s current and future projects. In collaboration with data scientists at all three partner institutions, COTA will be able to improve model accuracy over time and tackle ever more advanced elements that may be buried in unstructured notes.

Envisioning the future of NLP-aided oncology research

With enhanced access to unstructured data elements, clinical researchers and life science partners can generate a much more complete understanding of what is happening in the community care setting and how a patient’s unique clinical history may impact their response to treatment.

For example, NLP tools could automatically identify a patient’s family history and personal history of cancer, adverse events due to systemic therapies and a patient’s response to treatment, all of which are typically documented in a free-text clinical note.

The algorithm would identify the relevant sentences or lines in the unstructured source material, tag its interpretation and extraction of the desired elements, and allow for further expert verification before any information is integrated into other datasets.

As we continue to build and refine our oncology-specific NLP models, this vision is quickly becoming a reality.  With ever more unstructured data to augment our existing data assets, we are looking forward to further enriching the research environment with valuable insights into the next generation of oncology treatments and evidence-based cancer care.