Extracting Relations between Radiotherapy Treatment Details

We present work on extraction of radiotherapy treatment information from the clinical narrative in the electronic medical records. Radiotherapy is a central component of the treatment of most solid cancers. Its details are described in non-standardized fashions using jargon not found in other medical specialties, complicating the already difficult task of manual data extraction. We examine the performance of several state-of-the-art neural methods for relation extraction of radiotherapy treatment details, with a goal of automating detailed information extraction. The neural systems perform at 0.82-0.88 macro-average F1, which approximates or in some cases exceeds the inter-annotator agreement. To the best of our knowledge, this is the first effort to develop models for radiotherapy relation extraction and one of the few efforts for relation extraction to describe cancer treatment in general.


Introduction
Radiotherapy is the use of ionizing radiation, which is radiation with enough energy to remove electrons from atoms and molecules, to treat disease (Gunderson and Tepper, 2020). The predominant indication for radiotherapy is the treatment of cancer, where it exerts its antineoplastic effect via DNA damage, which preferentially kills cancer cells over health tissue cells (McDermott and Orton, 2010). Radiotherapy plays a central role in the curative and palliative treatment of many cancers. It is estimated that up to 30% of cancer patients receive radiotherapy as a part of their first-line treatment, and approximately 50% of all cancer patients receive radiotherapy during the course of their cancer care (Delaney et al., 2005;Smith et al., JCO 2010).
Despite its importance in cancer treatment, radiotherapy is included in cancer registries in only high-level, often cursory detail, if at all. For example, radiotherapy details are only available by custom request in the publicly available Surveillance, Epidemiology, and End Results Program (SEER) cancer registry, acknowledging incompleteness and errors in this manually extracted data (Surveillance, Epidemiology, and End Results Program). The reason for this is multifold. First, radiotherapy is a highly technical field not extensively taught in medical school, and uses its own jargon not found in other medical texts. Additionally, radiation treatment details are frequently not entered into the electronic medical records (EMR) as structured data. Instead, radiotherapy is described in clinical free text using descriptive and very non-standardized language. Radiotherapy treatment descriptions are more similar to the documentation of operative procedures than to documentation of medication regimens. Because radiation is personalized to each patient's disease and anatomy, it cannot be described with standard reporting of the type of radiation, dose, and frequency. Additionally, radiotherapy is often delivered in multiple phases, each treating a different anatomical site to different doses and with different types of radiation, yielding complex descriptions of treatment courses. These features, in concert with a lack of widely used standardized nomenclatures (Mir et al., 2020;Phillips et al., 2020;Traverso et al., 2018), limit manual data extraction, hindering the potential of big data to improve cancer research and clinical care.
While algorithms for named entity recognition have previously been reported for radiotherapy details (Bitterman et al., 2020) and other cancer therapies (Yin et al., 2018;Yim et al, 2016;Savova et al., 2019), relation extraction remains a relatively underexplored task in clinical NLP (Sheikhalishahi et al., 2019). There are few examples of relation extraction models for cancer characteristics in general (Bozkurt et al., 2016;Sheikhalishahi et al., 2019), and to the best of our knowledge none for cancer treatment, including radiotherapy. Identifying treatment entities in isolation and not linking them to a specific treatment instance is insufficient to coherently describe cancer therapies, especially because concurrent and serial treatments are often described together in the same note. For example, extracting frequency without linking it to a specific treatment is not informative by itself. Relation extraction is a critical component of information extraction for radiotherapy as this treatment is often given in multiple sequential or nested phases. Linking relevant treatment entities with the same phase is necessary to summarize how and why a treatment was delivered, necessary components to understanding treatment outcomes and quality. However, this is a very challenging task even for expert human annotators, as demonstrated by the challenges in accurate extraction of such data for SEER. Therefore, there is a need for more reliable relation extraction methods for radiotherapy.
Relations can be modeled in various ways in neural networks, including inserting special tokens around the arguments of interest and using this augmented text as input into the model (Dligach et al., 2017), and using token position embeddings to encode the relative distance of words to the arguments (Zeng et al., 2014;Nguyen and Grishman et al., 2015;Shi and Lin et al., 2019;. Using the former approach, we aimed to investigate several approaches for relation extraction from clinical texts describing radiotherapy, with a goal of augmenting reporting of cancer treatment details for research and clinical purposes. The contributions of the work described in the paper are (1) the definition of the task of radiotherapy information extraction from the EMR clinical narrative, (2) the creation of resources for the task (annotation guidelines and corpus), (3) the exploration of state-of-the-art neural methods to this highly impactful clinical task, and (4) establishing a baseline for the task.

Data
Data for this work consisted of texts describing radiotherapy from three complementary sources. First, we included 270 clinical descriptions of radiotherapy regimens from HemOnc.org, which is a publicly available wiki of cancer and blood disorder treatment regimens and interventions. Second, we included 73 radiotherapy descriptions from a state cancer registry. These are abstractions from patients' EMR, often copied and pasted from clinician notes, describing details of cancer treatment for use by cancer registrars within a patient-level XML. We used the entire text of the XML categories that contained radiotherapy details as model input. Third, 79 completed breast and colorectal cancer clinician notes that contained radiotherapy details from the THYME corpus from Clinical TempEval (Bethard et al., 2015) and an internal corpus with breast cancer notes.
Annotation guidelines for radiotherapy properties and treatment instances were developed (Bitterman et al., 2020). 1 If an overall radiotherapy treatment course was delivered in more than one

Dose-Treatment Site relationship is represented as:
…She presented after a screening mammogram showed a nodule in the left breast upper outer quadrant. After lumpectomy, she was treated with radiation to a dose of 50 Gy in 25 fractions to the left breast, followed by a boost of RT_DosageStart 10 Gy RTDosage_End in 5 fractions to the TxSiteStart tumor bed TxSiteEnd… Figure 1: Mock text segment describing radiotherapy with start and stop tokens, illustrating the Dose-Treatment Site relationship between the two bolded entities: 10 Gy (Dose) and tumor bed (Treatment Site). The zigzag and dashed lines indicate adjacent spans describing two different radiotherapy instances, and italicized entities are non-related anatomical/treatment sites close to the anchor dose mention. Both of these characteristics limited rule-based approaches to relation extraction in radiotherapy texts. 1 https://github.com/RTParse/RTAnnot phase, as described above, each phase was considered a separate radiotherapy instance ( Figure 1). Gold annotations for relations between the following key properties were created: Dose, Fraction Number, Fraction Frequency, Treatment Site, and Boost. Radiotherapy is most often delivered in many small doses, or fractions, over a given period of time. Dose is any description of radiation dosage in the text, either the total dose or fractional doses, generally described using the unit Gray (Gy). Fraction Number is any mention of the number of fractions delivered, and Fraction Frequency is the frequency of fraction delivery. Treatment site is the actual or relative anatomical site that is targeted with radiotherapy. Boost is a mention that conveys the treatment instance is a second phase of radiotherapy that brings a smaller treatment site to a higher dose. Properties that described the same treatment instance were linked together as a relation.
Two expert human annotators completed gold annotations for 47 radiotherapy instances containing 310 relation instances to calculate interannotator agreement, after which a single human annotator completed the gold annotations. As most radiotherapy instances included Dose, we chose to classify the relations between each Dose mention and every other property mentioned in the radiation instance. Thus, Dose mentions served as the anchor for the relations within a radiotherapy instance and were labeled as Dose-Mention. Of note, Dose-Dose refers to a relation of two different Dose mentions in the same radiotherapy instance. In our dataset, there were on average 1.4 radiation instances per document. Documents were split into train, development, and test sets. The gold annotated HemOnc.org corpus will be made publicly available for research purposes.

Methods
We explored two state-of-the-art neural network methods for this task. First, we used Flair (Akbik et al,, 2018), which is a pre-trained character language model trained on one billion words of text (Chelba et al., 2013) to train a multi-layer Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) to generate contextual embeddings. Text is then passed into this model, put into an LSTM to obtain a text representation, and this is passed into a final linear layer for classification. We were motivated to explore this approach as we hypothesized that the contextual character embedding may be better at handling the rare and misspelled words in cancer texts, as well as numbers and short abbreviations common in radiation descriptions. Models had a hidden state of 128 and a dropout of 0.15-0.24, and were trained for 100 epochs using a mini-batch size of 8 and learning rate starting at 0.2 with an anneal factor of 0.5, using an SGD optimizer (Robbins and Monro, 1951;Kiefer and Wolfowitz, 1952;Bottou et al., 2016). Second, we assessed the performance of bidirectional encoder representations from transformers (BERT) base uncased model, finetuned on this relation task using a recurrent neural network to predict the class label (Devlin et al.;. We chose to explore this method given the excellent relation classification performance of attention-based models in biomedical texts (Verga et al., 2018;Wei et al., 2019;Lee et al., 2020). These models had a hidden size of 128 and dropout of 0.5, and were trained for 30 epochs using a minibatch size of 8 and learning rate starting at 0.00003 with an anneal factor of 0.5, using an Adam optimizer (Kingma and Ba, 2014). For all, the model that performed best on the development set was evaluated on the held-out test set. Rule-based methods were considered, but as there are often more than one radiotherapy instance mentioned in a clinical text, frequently in close proximity and described in nested fashion, we did not feel there was a straightforward approach (Figure 1).
To generate candidate relations, we extracted text windows encompassing two different token lengths on each side of the gold annotated Dose mention anchor: 46 tokens and 90 tokens (95 th and 99 th percentile of token span lengths between Dose-Mentions in the same gold radiotherapy instances in the train and development sets, respectively). Every Dose-Mention pair in the text window was considered a candidate relation. For each relation candidate, start and stop tokens were inserted around the Dose and candidate property (Figure 1). The text segment was labeled with a positive Dose-Property relation if the Dose and candidate property were in the same radiotherapy instance.
The precision (True Positives/Predictions), recall (True Positives/Gold Positives), and F1 score ((2*Precision*Recall)/(precision+recall)) are reported for each model. Error analysis via manual inspection was carried out to better understand how and where the models performed poorly. Table 1 shows the IAA for each of the relation categories. IAA was ≥ 0.9 for all categories except for Dose-Boost (0.67) and None (0.74). Of note, there were only 6 total Boost mentions in the pilot dataset, limiting interpretation of this IAA. Table 2 shows the performance of the Flair and BERT relation classification models. Overall, the best performing model was the BERT model finetuned on the 92 token text windows (macroaverage F1: 0.88), followed by the Flair model fine-tuned on the 92 token text windows (macroaverage F1: 0.86). Precision for the Flair model with this window was on average slightly higher than that for the BERT model. The models finetuned on the 180 text windows had slightly worse performance (BERT model macro-F1: 0.85, Flair model macro-F1: 0.82).

Evaluation
Qualitative error analysis of each model revealed five main categories of failure: 1) human errors in gold labeling, 2) false positives due to two mentions being in close proximity despite relating to different treatment instances, 3) false negatives   because related mentions were distant and/or crossing sentence boundaries 4) incorrect labeling in texts with very atypical descriptions of treatment, 5) incorrect labeling in tabular text, and 6) other/unknown. Examples of errors are shown in Table 3. All models suffered from similar methods of failure, and texts describing radiotherapy courses with several phases appeared to be particularly challenging. Interestingly, the BERT model fine-tuned on the 180 token text windows was better able to correctly label relations in tables than the other models, although there were only rare examples of tables in these corpora. Additionally, in the Flair models using the 92 and 180 token window texts, there were 3 and 8 true positive relations, respectively, that were labeled with an incorrect label other than "None". All incorrect labels in the BERT models were either a "None" label assigned to a true relation, or a false positive relation assigned to a true "None" relation.

Discussion and Conclusion
The neural models had very good performance on the radiotherapy relations explored in these experiments, often approaching or exceeding IAA. The performance of the BERT and Flair models were overall comparable, with the best BERT model outperforming the best Flair model. Interestingly, the size of the text windows used for fine-tuning appeared to have a larger impact on performance than the type of model itself, with shorter texts yielding better results. This may be due to the frequent repetition of similar but unrelated entities and treatments in clinical texts, and optimizing this parameter should be explored when developing models for clinical relation extraction.
These findings suggest that neural methods may be a good avenue for clinical relation extraction for complex, highly specialized treatments such as radiotherapy. Future work will develop models to extract relations between Dose and additional relevant entities, and will investigate end-to-end entity and relation extraction systems for robust information extraction pipelines.