Automatic recognition of abdominal lymph nodes from clinical text

Lymph node status plays a pivotal role in the treatment of cancer. The extraction of lymph nodes from radiology text reports enables large-scale training of lymph node detection on MRI. In this work, we first propose an ontology of 41 types of abdominal lymph nodes with a hierarchical relationship. We then introduce an end-to-end approach based on the combination of rules and transformer-based methods to detect these abdominal lymph node mentions and classify their types from the MRI radiology reports. We demonstrate the superior performance of a model fine-tuned on MRI reports using BlueBERT, called MriBERT. We find that MriBERT outperforms the rule-based labeler (0.957 vs 0.644 in micro weighted F1-score) as well as other BERT-based variations (0.913 - 0.928). We make the code and MriBERT publicly available at https://github.com/ncbi-nlp/bluebert, with the hope that this method can facilitate the development of medical report annotators to produce labels from scratch at scale.


Introduction
Lymph nodes are organs of the lymphatic system that are present throughout the body. Their status plays a pivotal role in the staging and treatment of cancer (Amin et al., 2017). The development of deep learning (DL) for computer vision has led to increasing interest in applying DL-based AI to identify and segment lymph nodes and detect lymph nodes and detecting lymph node metastasis in imaging studies, such as Magnetic Resonance Imaging (MRI). Applications of machine learning to MRI not only contribute to improving diagnostic accuracy but also reduce the workload of radiologists * These authors contributed equally to this work. † Co-corresponding. and enable them to spend additional time on highlevel decision-making tasks. However, DL algorithms need to be sufficiently trained and evaluated using large-scale data before clinical adoption. Unlike general computer vision tasks, medical image analysis currently does not have enough annotated data (comparable to ImageNet and MS COCO), which is mainly because the conventional methods for harvesting labels cannot be applied in the clinical domain, as it requires extensive clinical expertise and because of security and privacy issues. Therefore, there is an unmet need to construct a large-scale annotated dataset of lymph nodes to increase the generalizability and robustness of the DL algorithms.
Radiologists report any abnormal lymph node detected in computed tomography (CT) and MRI exams by describing the regional name (type) of the lymph node. Example MRI scans and annotations are shown in Figure 1, where the radiologist describes the lymph node with the sentence "Abdominal/pelvic lymph nodes: There is intraperitoneal and retroperitoneal lymphadenopathy, for example, enlarged mesenteric/peripancreatic lymph node measuring Bookmark1[ [(2.8  The radiologist places a hyperlink (hereafter "bookmark") in the context to refer to the specified lymph node annotation in the image. Therefore, clinical reports provide a detailed and personalized account of assessments, offering a better context for clinical decision making and follow up.
Natural language processing (NLP) has been explored recently to unlock evidence buried in clin- ical narratives, making it available for large-scale analysis. In the clinical domain, NLP has been applied to identify positive, negative, and uncertain findings from radiology reports (Peng et al., 2018;Irvin et al., 2019;Yan et al., 2018). For MRI reports, NLP has been used to identify breast imaging lexicons for breast cancer (Sippo et al., 2013;Liu et al., 2019). However, most of these systems are rule-based, and few studies have investigated NLP in MRI reports of the lymph nodes.
To tackle these obstacles and challenges, this paper outlines a framework based on deep learning to harvest lymph node annotations and construct an annotated dataset of lymph nodes by automatically extracting lymph nodes from clinical reports. The contributions of this study are threefold: (1) We construct an ontology of 41 types of abdominal lymph nodes with a hierarchical relationship.
(2) We develop a transformer-based deep learning module to extract and classify the abdominal lymph node types (or a non-abdominal lymph node or not a lymph node) for each bookmark mentioned in the sentence. (3) We make codes and pre-trained models publicly available.
The rest of the paper is organized as follows. We first present related work in Section 2. Then, we describe the method to construct the ontology and dataset in Section 3, followed by our experimental setup, results, and discussion in Section 4. We conclude with future work in the last section.

Related work
In recent years, there has been considerable interest in harvesting information and knowledge from freetext on electronic health records (EHRs) (Jensen et al., 2017). However, manually annotating a large dataset to fulfill the needs of deep learning models downstream is time-consuming and expensive. Therefore, researchers have applied NLP systems to identify structured labels from radiology reports (Irvin et al., 2019;Johnson et al., 2019;Wang et al., 2017;Smit et al., 2020).
Previous efforts in this area have focused mostly on two directions. One is the rule-based methods. NegEx, in combination with the Unified Medical Language System (UMLS), is a widely used algorithm that utilizes regular expressions to determine the negative concepts in the clinical narratives (Chapman et al., 2013;Aronson and Lang, 2010;Chapman et al., 2011). NegBio extended NegEx by utilizing universal dependencies and sub-graph matching to detect both negative and uncertain lung diseases in chest X-rays and was used to generate labels for the NIH Chest X-ray and MIMIC-III-CXR datasets (Johnson et al., 2019;Wang et al., 2017;Peng et al., 2018). The CheXpert labeler further extended NegBio by increasing the rule sets and improving the NLP pipeline to construct report-level disease annotations (Irvin et al., 2019). CheXpert++ trained a hybrid ruleand BERT-based labeler on the radiograph domain but offers additional commentary on the utility of active-learning strategies to inform the interplay between the hybrid and rule-based labeler (McDermott et al., 2020).
The other direction is to apply machine learning methods to construct labels (Huang and Lowe, 2007;Clark et al., 2011;Xue et al., 2019;Peng et al., 2019a). Huang et al. described a hybrid approach to automatically detect negations in clinical radiology reports (Huang and Lowe, 2007). Clark et al. combine machine learning (conditional random field and maximum entropy) and rules to determine the assertion status of medical problems mentioned in clinical reports (Clark et al., 2011). Recently, deep learning approaches have also been studied intensively. Chen et al. applied CNNs to classify pulmonary embolism in chest CT reports (Chen et al., 2018). Drozdov et al. compared thirteen supervised classifiers and demonstrate that bidirectional long short-term memory (BiLSTM) networks with attention mechanisms effectively identify labels in CXR reports (Drozdov et al., 2020). Wood et al. present a transformer-based network for brain magnetic resonance imaging (MRI) radiology report classification, which automates this task by assigning image labels based on free-text expert radiology reports (Wood et al., 2020). Smit et al. introduced a BERT-based approach to medical image report labeling that exploits both the scale of available rule-based systems and the quality of expert annotations (Smit et al., 2020).

Methods
In this section, we first describe the process of constructing the abdominal lymph node ontology and gold-standard labels from the MRI reports associated with lymph nodes on MRI images. Then we demonstrate the development of the transformerbased method to detect lymph nodes from the reports.

Abdominal lymph node ontology construction
The labeling task in this study is to extract the presence of abdominal lymph nodes from radiology reports. Therefore, the first step is to construct the lymph node ontology. The challenge here is that the nomenclature of abdominal lymph nodes is complicated. Most of them are named after the anatomical organs their lymphatics are draining from, but some are named after an adjacent structure, and some are named for an anatomical compartment space. This makes them have confusing synonyms or sometimes overlapping areas, giving them a hierarchy.
To make a standardized version of the abdominal lymph node ontology, we used three widely used guidelines (Amin et al., 2017) and textbooks (Harisinghani, 2013;Richter and Feyerabend, 2012) to establish the hierarchical relationship, representative synonyms, and relationships with overlapping areas.

MRI dataset
For model development and validation, we collected large-scale MRI studies from NIH Clinical Center, performed between Jan 2015 to Sept 2019, along with their associated radiology reports. (Figure 2). The majority (63%) of the MRI studies were from the oncology department. The initial search from the Picture Archiving and Communication System (PACS) database at the NIH Clinical Center returns 21,786 studies with 9,343 patients. We excluded non-abdomen studies and studies with missing reports. The final dataset consists of a total of 2,099 lymph node bookmarks from 1,379 studies of 917 unique patients, and their corresponding text reports retrospectively from the Picture Archiving and Communication System (PACS) database at the NIH Clinical Center. These lymph node labels were reviewed by a radiologist with 12 years of post-graduate experience. The study was a retrospective study and was approved by the Institutional Review Board with a waiver of informed consent. This data set comprised the reference (gold) standard for our evaluation and comparative analysis.

Framework
We developed a hybrid system to extract abdominal lymph nodes from the MRI reports. It consists of two modules: (1) a rule-based lymph node detection, and (2) a transformer-based lymph node

Sentence extraction with potential lymph node bookmarks
In the reports of our institute, radiologists describe the lymph nodes and insert hyperlinks, size measurements, or slice numbers in the sentence to refer to the imaging findings of interest (called a bookmark). A bookmark thus is a hyperlink connection between the annotation in the image and the written description in the report. From the reports, we selected the full sentences that included the hyperlink, presuming that they had information most relevant to the connected image annotation. In this step, we extract sentences with bookmarks that potentially link to lymph nodes. We first split the reports into sections. For our reports, the text is often organized into five sections: Clinical Indication, Technique, Comparison, Findings, and Impression. Among others, the "Findings" section lists the normal, abnormal, or potentially abnormal observations the radiologist saw in each area of the abdomen or pelvis in the exam. Hence, this section is often organized by organs such as the liver and kidney, blood vessels, and lymph nodes.   Each section/subsection begins with a heading and ends with one or more empty lines. If available, the section headings were phrases from the beginning of a new line to a colon (e.g., "Liver and Gallbladder:"). We, therefore, use this information to split the reports into sections. Second, we tokenized the sentences using NLTK (Bird, 2006). If a report contains the "Lymph node" subsection, we extracted sentences with lymph nodes from this subsection; otherwise, we extracted sentences with "lymph node" mentioned in the "Finding section" using regular expressions. We skipped the reports if it is not sectioned (0.3%). In our study, 85% of lymph node bookmarks are from the "Lymph node" subsection, and the remaining 16% are from reports with the "Lymph node" subsection but "Finding section" sections.

Bookmark classification for the abdominal lymph node type
After obtained candidate bookmarks that may link to lymph nodes, the next step is to classify bookmarks for the lymph node types. Here, we use the full sentences that included the bookmark, presuming that they had information most relevant to the connected image annotation. However, the bookmarked sentences often contain a complex mixture of information describing not only various bookmarked lymph nodes but also other bookmarked abnormalities. A sample sentence is shown in Figure 1. There are four bookmarks in a sentence, each of which has a different lymph node type. Table 1 shows that more than 30% of sentences have at least two bookmarks.
To solve this problem, we developed a transformer-based deep learning module with 43 labels (41 abdominal lymph node types, nonabdominal lymph node, and not a lymph node). Specifically, we treat the lymph node recognition task as a sentence classification by replacing the bookmark of interest in the sentence with a prede-fined tag $BMK$. Suppose that h 0 is the output embedding of the token [CLS], the probability that a bookmark labeled as class c is predicted by a fully connected layer and a logistic regression with softmax:P (c|X) = sof tmax(ah 0 + b). We finetune the model on the training set using the categorical cross-entropy loss, − c δ(y c =ŷ)logP (c|X) where δ(y c =ŷ) = 1 if the classificationŷ of X is the correct ground-truth for the class c ∈ C; otherwise δ(y c =ŷ) = 0.
BERT is a contextualized word representation model that is pretrained based on a masked language modeling using bidirectional transformers (Devlin et al., 2019). In this paper, we fine-tuned the model using the BlueBERT base model (Peng et al., 2019b). The BlueBERT was pre-trained on the combination of PubMed and MIMIC-III clinical notes. We also compared the performance of our method using other BERT variants.

Abdominal lymph node ontology
We construct an ontology of 41 abdominal lymph nodes relevant to MRI (Figure 4). Because of the nature of lymph node nomenclature, the labels had to have a hierarchical structure and some labels overlapped with others (Harisinghani, 2013;Richter and Feyerabend, 2012;Amin et al., 2017). Those subgroups include coarse, high-level lymph nodes such as "mediastinal lymph node", "retroperitoneal lymph node", and "pelvic lymph node", as well as fine-grained lymph nodes such as "perigastric lymph node along greater curvature" and "pericecal lymph node". Table 2 shows the distribution of lymph nodes in the dataset, which is imbalanced. The majority of abdominal lymph nodes in the dataset are periportal and para-aortic lymph nodes.

Results of the lymph node classification
We trained the model on one NVIDIA® V100 GPU using the TensorFlow framework26. We used the Adamax optimizer (Kingma and Ba, 2015) with a learning rate of 10 −5 and a batch size of 32. We used the BlueBERT base model as the domainspecific language model. As a result, all the tokenized texts using wordpieces (Wu et al., 2016) were chopped to spans no longer than 128 tokens. We set the maximum number of epochs to 30.
To evaluate the performance of the framework, we use 70% for training, 10% for development, and    Table 3 shows the performance of our systems on the classification of 5 coarsegrained lymph node types by (P)recision, (R)ecall, and (F)1-score. The micro metrics count the total true positives, false negatives, and false positives across all lymph node types. The macro metrics calculate precision, recall, and F1 for each lymph node type and find their unweighted mean. The for each lymph node type and find their average weighted by the number of true instances for each type. Our system achieved an overall precision of 0.960, recall of 0.959, and F1-score of 0.957. We achieved F1-score ≥ 0.850 on all coarse-grained lymph node types. On the other hand, we observed that on "negative" cases (not a lymph node), the recall is 0.5. This is because the dataset has fewer negative instances (26) in total, which may not be sufficient to train and test the model. In the future, more negative cases shall be manually included to handle the imbalanced dataset. However, we consider it not a major issue in our framework since the first step utilizes rigid extraction patterns and achieves high precision. Table 4 shows the performance on the classification of all fine-grained lymph node types. Our system achieved an overall precision of 0.925, recall of 0.913, and F1-score of 0.912. We achieved F1-score 1.00 on 8 types, ≥ 0.90 on 17 types, and ≥ 0.80 on 23 types.
We also compare our model on BERT variants: ClinicalBERT (Alsentzer et al., 2019), BioBERT (Lee et al., 2020), and BlueBERT. The Clinical-BERT was pretrained on MIMIC-III generic clinical text. The BioBERT was pretrained on PubMed. For reference, we include a rule-based system where the type of lymph node is selected based on the nearest keyword (e.g., cardiophrenic, inguinal, etc.) from the bookmark in the sentence. Table 5 shows that deep-learning-based methods can successfully classify the type of each lymph node mentioned in the sentences. The system using BlueBERT (MriBERT) outperforms that using BioBERT. This observation shows the impact of using clinical notes during the pre-training process. On the other hand, the system using ClinicalBERT achieved lower performance. It may suggest that the MIMIC-III clinical text alone may not be large enough to sufficiently pre-train the BERT model.

Conclusion
In this study, we introduced an ontology of 41 types of abdominal lymph nodes with a hierarchical relationship. We then proposed an end-to-end framework for combining rules and deep learning for accurate bookmark classification for lymph node types from MRI reports. In this framework, the rule-based method is first used to extract sentences with potential lymph node bookmarks. Then a BERT-based model pretrained on MRI reports was used to classify each bookmark into one of 41 types of abdominal lymph node, non-abdominal lymph nodes, or not a lymph node. We evaluated our framework on 2,099 bookmarks manually annotated by a radiological expert. We also compared our framework with a rule-based system and other BERT-based models. We find that our framework achieved 0.912 in F1-score, which outperforms the rule-based system and other BERT variations.
Our study has several limitations. First, our model is limited to the 41 abdominal lymph nodes. While we believe the list is comprehensive, we may miss some lymph node types due to training corpus bias. Second, our evaluation is performed on a single corpus. Cross-institutional experiments need to be performed in the future to evaluate the generalizability of the model. While our work only scratches the surface of using text mining techniques and deep learning to extract the lymph node from radiology reports, we hope it will shed light on the development of generalizable NLP models that can extract highly accurate labels.

Acknowledgment
This work was supported by the Intramural Research Programs of the NIH National Library of Medicine and NIH Clinical Center. This work was also supported by the National Library of Medicine of the NIH under award number 4R00LM013001. This work utilized the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih. gov).