Peek behind the paper: introduction of an evaluation methodology for clinical natural language processing


Natural Language Processing

In this feature, we are taking a ‘Peek behind the paper’ with Miren Taberna (Savana, Madrid, Spain) to discuss an exciting new paper describing a methodology to evaluate clinical natural language processing (cNLP) systems. Learn more about Savana here. >>

Miren graduated from the Navarra University Medical School (Spain) in 2008, finishing her medical oncology training at the Catalan Institute of Oncology (ICO) in Barcelona (Spain) in 2013. Internationally, she joined the head and neck cancer unit at the Dana Farber Cancer Institute as a fellowship, in Boston (MA, USA). Additionally, she developed part of her Ph.D at Saint James Cancer Center in Ohio University (OH, USA). In 2019, Miren defended her Ph.D. about the impact of Human Papillomavirus (HPV) in oropharyngeal cancer, receiving two relevant awards from the University of Barcelona (Spain), best Ph.D. in Medicine (Extraordinary Award), and the certificate of recognition within all university degrees of the year.

Miren is author of 38 scientific articles published in high-impact international journals (ORCID). Furthermore, she has also participated in several scientific congresses and events, such as the American Society of Oncology (ASCO, VA, USA) as a chairwoman of the head and neck poster discussion session. In addition, she has taken part in numerous international clinical trials, university teaching  activities, writing several guidelines, and book chapters within the oncological field. In 2020 she started working in Savana to develop the scientific department, conveying her clinical and research knowledge to find the applicability of AI models to advance science and impacting patients’ outcome.

Read the full research paper here >>

Please could you introduce yourself, your research and your role at Savana?

I am the Chief Scientific Officer at Savana. My academic background is medical oncology. In clinical practice, I quickly recognized the potential of real-world data to better understand and improve patient outcomes. Subsequently, over 10 years, my international research interest has been in clinical and translational research, with a focus in head and neck cancer. Together with my team at Savana, I use my research experience in the application of artificial intelligence (AI) models to accelerate health research and improve patient outcomes.

On behalf of the co-authors of this study, and the whole Savana team involved in this work, I am delighted to have this opportunity to discuss our recent publication.

What is the background to Savana publishing this paper?

In daily practice, clinicians represent the medical history of their patients within electronic health records (EHRs). As such, the EHR is a hugely valuable source of real-world health data. Savana has developed a scientific methodology that utilizes AI to gain all the clinical information available within the free, unstructured text – as well as structured text – of EHRs to generate deep real-world evidence (RWE).

However, until now, there has been limited guidance concerning how to evaluate the quality of clinical data extracted from EHRs using AI techniques. Therefore, external evaluation is required to demonstrate to the wider medical and scientific community how NLP technology can be relied upon to help us gain access to this valuable information.

To plug this gap, Savana has developed a methodology for evaluating the performance of cNLP systems. We want to share our knowledge and offer NLP experts a standardized, repeatable methodology for the evaluation of their own cNLP systems.

What do you mean by quality of the clinical data in this context?

By quality we mean the data extracted from EHRs representing exactly what the physician had in mind when entering information into the EHR, including the context of clinical terms. In our methodology, the performance evaluation of the cNLP system is calculated using the standard metrics: precision, recall and F1 score. Precision indicates the accuracy of information retrieved by the system. Recall provides us with the amount of information the system retrieves. Finally, the F1 score evaluates the overall performance of information retrieval.

Why is it so important to have a standardized, repeatable gold standard methodology for cNLP?

At Savana we believe in a new era of medicine where technology can help physicians analyze big data to positively impact patient lives. To underpin this, we need to develop accurate and repeatable methodologies to demonstrate the quality of the clinical data extracted for research. Medicine, like other scientific disciplines, requires results to be checked against a benchmark, a gold standard. This is particularly important during the early stage development of any new healthcare technology. The conclusions derived from the data will impact clinical decision-making. Therefore, the ability to demonstrate the adequacy of the clinical data used after NLP extraction should be mandatory.

Can you summarize your evaluation methodology?

Our evaluation methodology offers guidance to NLP experts on how to approach the evaluation of their cNLP systems over five phases. Briefly, phase one defines the target patient population for the study. Phase two describes how to build a body of data, which represents the characteristics of the target patient population. For this, Savana uses an in-house software tool termed Sample Size Calculator for Evaluations (SLiCE) and stratified sampling. Phase three is the design of the annotation task and guidelines. This requires the collaboration of three clinical experts from each participating hospital site to perform the annotation. That is why the annotation task guidelines should be prepared carefully by the NLP and internal medical experts, as the gold standard is the comparison point for all other data. Phase four is the external annotation undertaken by three external clinical experts, two annotators and one annotator curator. Finally, phase five calculates the performance of the cNLP system against the gold standard using precision, recall and F1 score metrics.

How do you expect RWE researchers and regulators to respond to your paper?

Any cNLP methodology used to extract information from EHRs should be compared to a gold standard. This will ensure that studies are utilizing high-quality data that meets the study objectives (e.g., to describe a disease, evaluate a drug, or create a predictive model to assess patient outcomes). If the data are not of high-quality, it will not matter how precise is the analysis is to answer the objectives of the study. The output will not correspond to reality and therefore will negate the results.

Savana’s responsibility, as part of the scientific and medical community, is to validate the quality of the extraction method. This is because the data is being utilized to answer medical questions. Therefore, we are sharing our methodology so anyone with their own NLP system can assess the performance of their own system. We’ve also described a real use case study of patients with asthma as a practical example of how to apply this methodology.

Could this methodology be applied to other NLP tasks?

We apply Savana NLP to extract information from EHRs. Therefore, our evaluation methodology has been developed to extract clinical variables. Nevertheless, the different phases proposed in our paper could be adapted for other disciplines to extract alternative variables. Irrespective of discipline, NLP experts will need to collaborate closely with other experts from other fields. The synergy between the NLP expert and those from other disciplines is essential for the success of the process.

How do you think higher quality RWE will impact clinical research in the future?

RWE is becoming essential in clinical research. It describes how patients respond in the clinical settings, which may be different from clinical trials. Information from both RWE and clinical trials is needed to improve clinical research and is a strong partnership to inform clinical decision-making and patient outcomes.

In the past, RWE studies have utilized structured data such as ICD codes. The full potential of the data contained in the EHR i.e., all the clinical variables contained within the free text, such as clinical notes or discharge reports, were not realized. By applying NLP we have the opportunity to extract all the information contained within EHRs homogeneously and without limit. This exponentially increases the capability of RWE for health research.

The ability to guarantee high-quality data extracted utilizing AI techniques, such as cNLP, is impacting the reliability and confidence of researchers and regulators in RWE today and for the future.


The opinions expressed in this feature are those of the interviewee/author and do not necessarily reflect the views of The Evidence Base® or Future Science Group.

In association with: