What Makes a Fit-for-Purpose Database for Clinical Oncology Research?

To ask and answer the right questions, researchers need to use databases that are suited for the specific tasks at hand.  What makes a database fit for purpose?  How can researchers choose the best dataset for their needs?

RWD is generated in various forms ranging from registries to wearable devices. As we learned in the first blog of this series, some of this data is highly valuable for supporting clinical oncology research. By supplementing randomized clinical trial (RCT) data with real-world patient information drawn from EHRs, claims, registries, and other sources, we can uncover critical insights into the impact of new therapies or care protocols.

However, with so many different types of data available to us, we need to be sure we are using the right data to ask and answer the right questions. We need fit-for-purpose (FFP) databases to support high-quality research that stands up to rigorous scientific scrutiny and supports replicability in future research and clinical settings.

What is a fit for purpose database?

There is no single dataset that can answer all research questions. Instead, investigators must carefully choose from among the myriad available data sources to develop a dataset that is most applicable to the research question and the context of the decision (e.g., regulatory submission vs market research needs). 

This means picking data that is complete, timely, and relevant to answer a specific, well-defined research question with clearly outlined exposures and endpoints.

For example, a researcher investigating the effectiveness of treatment patterns for HER2- metastatic breast cancer patients will define the outcomes under consideration, such as overall survival and the cost of care since the metastatic diagnosis.  

Clearly, a dataset that does not contain information on survival rates or resource utilization data will not be fit for the purposes of this research project.

In contrast, a fit-for-purpose database will contain the ability to identify relevant elements including:

  • Metastatic breast cancer patients and the date of metastatic cancer diagnosis
  • HER2- biomarker status as determined via biomarker testing
  • Treatments administered or received by the patient, as well as the timing and duration of the treatment
  • Health care utilization rates, including outpatient and inpatient visits, medications, surgeries, scans, and other services
  • Mortality rates, as completely as possible


To develop this fit-for-purpose dataset, researchers may need to draw from multiple data sources and aggregate them into a single database. This can be challenging, especially when there is so much variation in the quality and completeness of these real-world data assets.

Features of a fit-for-purpose database

Starting with clear context and a relevant, answerable research question is the foundation for a successful research initiative. With a well-defined question in hand, researchers can then identify potential data sources to support their investigations. While doing so, they should consider the following variables:

  • Relevance: Does the data include key variables for the right patient population? Does the source cover the correct time frame?  
  • Reliability: Is the data accurate and complete? Can the provenance of the data be traced back to the source? Can patient linking between data sources be conducted consistently?
  • Integrity: Can this data be appropriately anonymized, transferred, and aggregated without a loss of quality and integrity?
  • Bias: Does the dataset include a diverse and representative sample of individuals? Can any important confounding factors be identified and addressed?      


Due to the very nature of real-world data created in an uncontrolled environment, it is likely that some concessions or trade-offs might be necessary. Researchers can still use the data source but must carefully document the limitations of the dataset and the expected impact of these limitations on their results.

For example, the researcher exploring HER2- metastatic breast cancer may have multiple research objectives with overall survival as the top priority and healthcare resource utilization as a secondary objective. In this case, they may opt to use a data source that has more complete mortality information but only some resource utilization data, and clearly note the limitations of the conclusions that can be drawn from the results. 

Leveraging tailored databases for innovative research projects

A data source that is fit-for-purpose for answering a research question must be timely, complete, and relevant. Using a principled approach, existing fit-for-purpose databases can be identified, or customized fit-for-purpose databases can be created by aggregating multiple data sources. Although this is a considerable undertaking, advances in the technological landscape can be leveraged. For example, tokenization approaches can be used i.e., the generation of anonymous identifiers that can be used to link patients’ data from various data sources to create custom fit-for-purpose data. Additionally, it is important to consider the context of the decision when selecting a database, particularly within regulatory submissions. 

In our next blog, we will discuss how these fit-for-purpose databases can be used to support an innovative approach to clinical research, the external control arm (ECA), and what to consider for regulatory submissions when using this real-world data strategy.