Cancer Registry Information Extraction via Transfer Learning

A cancer registry is a critical and massive database for which various types of domain knowledge are needed and whose maintenance requires labor-intensive data curation. In order to facilitate the curation process for building a high-quality and integrated cancer registry database, we compiled a cross-hospital corpus and applied neural network methods to develop a natural language processing system for extracting cancer registry variables buried in unstructured pathology reports. The performance of the developed networks was compared with various baselines using standard micro-precision, recall and F-measure. Furthermore, we conducted experiments to study the feasibility of applying transfer learning to rapidly develop a well-performing system for processing reports from different sources that might be presented in different writing styles and formats. The results demonstrate that the transfer learning method enables us to develop a satisfactory system for a new hospital with only a few annotations and suggest more opportunities to reduce the burden of cancer registry curation.

continuously are critical issues and burdens of healthcare system. However, the maintenance of an individual cancer registry from patient healthcare trajectories needs different types of domain knowledge which is pronouncedly both labor-intensive and time-consuming. In addition, how to validate and integrate between different hospitals or between local healthcare resource and national database are crucial topics.
To facilitate the integration of models for a specific cancer, applying information technology tools to improve acquisition and classification of patients' healthcare trajectories can enable more accurate phenotyping of cancer information. Nevertheless, addressing the issues needs more cooperation both on information technology and medical expertise. In order to assist integration among the institutes, a national project was established under the Cancer Center Support Grant Program (CCSG) supported by MOHW. As the coordinator of this project, we conducted research studies and cooperated with several hospitals to establish a platform to work out a model system based on existing cancer data.
One major goal of this project is to apply natural language processing (NLP) techniques to automatically analyze unstructured data including surgical reports, pathology reports, oncology clinical notes, and laboratory findings that may not be easy to acquire or share across hospitals for specific cancers. Pathology reports are usually abundant and contain operative findings, general tumor information, pathological assessment, cancer staging, and end-results which need to be extracted and classified clearly. In the pilot study, we focus on tasks including the collection and deidentification of pathology reports, data annotation for developing and evaluating deep learning-based NLP systems to extract cancer registry variables from different hospital sites.
To standardize the annotation of pathology material for developing our NLP system, the variables and their definitions were defined by the consensus from expertise committee composed of hospital investigators and annotators. Furthermore, we applied transfer learning and conducted experiments to examine the performance of the developed neural networks on the cross-hospital pathology materials to gain insights on how effective and concise transfer learning can be. The results not only enable us to understand which layers of the developed network convey the most important parameters for transfer but also let us know how many annotations are needed for training a system for a new hospital to achieve reasonable performance.

Datasets
In the presented study, we primarily focused on the colorectal cancer, which is the third leading cause of cancer-specific death in Taiwan. We cooperated with two medical centers, namely China Medical University Hospital (CMUH) and Kaohsiung Medical University Chung-Ho Memorial Hospital (KMUH), to collect colorectal pathology reports and the data were excluded non-tumor reports as well as the reports without cancer registration data for compiling our corpora. Table 1 shows the grouping and the number of the collected datasets.

Corpus Construction
In order to produce high quality annotations for developing our system, we established a NLP working group focusing on the construction of high-quality corpora. For our purpose, the annotation process was conducted by eight annotators based on an annotation guideline developed by consulting the committee composed of hospital investigators and cancer registrars. According to the standard of American Joint Committee on Cancer, nine cancer registry variables were defined for extraction in order to achieving a better understanding and unified effects on pathological materials. Table 2 summarizes the nine variables defined for the colorectal cancer including stage classification (SC), pathological TNM classifications (TNM), the number of examined nodes (NE) and positive nodes (PN), tumor size (TS) histology types (H), and grades (G). The entire annotation process is elaborated as follows. A preliminary consistency test was conducted by asking the annotators to individually annotate an identical set of 100 reports randomly selected from the collected datasets. All of them used the annotation tool ( Figure 1) developed by our collaborator to conduct their annotations. We then measured their inter-annotation agreement by Kappa statistic (Viera & Garrett, 2005).
Afterwards a labeling meeting was organized to discuss issues and concerns encountered during the annotation process and the annotation guideline was adjusted according to the conclusion of the meeting. The above process was conducted iterative until they achieved an agreement above substantial. Finally, the remaining unlabeled datasets were evenly distributed to all annotators for labeling. The same annotation process was applied individually for the data collected from the two hospitals.
The aforementioned 100 annotation data generated by all annotators individually on the same reports were collected as the test set for evaluating the performance of the developed systems. They were combined by voting; only those annotations that were annotated by more than four annotators at the same time were kept. The other reports evenly annotated by annotators were collected as the training sets.

Cancer Registry Information Extraction with Different Approaches
For a given pathology report, our clinical toolkit (Dai, Syed-Abdul, Chen, & Wu, 2015) was employed to segment sentences and generate tokens based on MedPost (Smith, Rindflesch, & Wilbur, 2004). The numerical normalization method proposed by Tsai et al. (2006) was employed to reduce variations in numerical parts of each token. We then formulated the problem as a sequential labeling task and applied the IOB-2 tag scheme to encode the span information generated by annotators. All sequences including those that did not contain any annotations were included in the training set to train a neural sequence labeling network model whose architecture is briefly described as follows.
The input of the network is the pre-processed sequence of tokens in a pathology report and the output being the sequence of labels for each token. The input tokens was represented as a vector by concatenating the pre-trained word representations obtained by using GloVe (Pennington, Socher, & Manning, 2014) and RoBERTa (Liu et al., 2019). The parameters of the concatenated vectors were kept fixed during the training process.  Description likes: well differentiated, and undifferentiated The concatenated representation was then feed to a fully connected layer (denoted as FC1) along with a variational dropout before passing the embeddings into the bidirectional long-short term memory (BiLSTM) network with one layer consisting of 256 hidden nodes. The output of the BiLSTM layer goes through another fully connected layer (denoted as FC2) to generate an output of a size equal to the number of the classes, which becomes the input of the inference layer in which a conditional random field (CRF) layer was used to model the dependencies between labels in neighborhoods with the Viterbi loss to jointly decode the best chain of labels for the given sequence.
In addition to the aforementioned architecture, we implemented the following baselines for comparison: Dictionary-based approach: For a given token, output the most frequent assigned tag estimated on the training sets. Support vector machine (SVM): Formulate the task as a token-based classification task and apply SVM with a polylinear kernel to learn a classification model. CRF: The normalized word features with a context window of three along with transition features were used for training a CRF model.
BiLSTM: Similar to the aforementioned network architecture, but a linear layer was used instead of the CRF layer as the output layer.
All of the above neural networks were implemented by using PyTorch trained on NVidia Tesla P-100 GPUs. CRF was implemented by using CRF++ 1 and scikit-learn 2 were used for the remaining implementations.

Transfer Learning for Extracting Information between Different Hospitals
Transfer learning (Pan & Yang, 2009)

Experiment Configurations
We conducted three experiments to study the characteristics of the compiled corpora and the effectiveness of the developed models on the compiled corpora. The first compared the proposed model with the aforementioned baseline methods. The second examined the effectiveness of transfer learning and the last checked the robustness of the developed models under the evaluation of crosscorpus. The standard micro-precision (P), recall (R) and F-measure (F) were used to evaluate the models' outputs against the gold annotations. For training the neural networks in all of our experiments, we randomly kept 50 reports in the training sets as the validation sets to determine the best performed models during the training process. The validation sets were not used in training. The mini-batch gradient descent along with the stochastic gradient descent algorithm (with a learning rate of 0.1, a momentum of 0.9 and a weight decay of 10 -5 ) was used for optimizing the parameters. Unless specifically described, the batch size and epoch were set to 2,048 and 150 respectively in the following experiments. The training process was early stopped if the learning rate was lower than 10 -5 . For consistency, we used the same set of hyper-parameters and a fixed random seed across all experiments.

Corpus Statistics
A total of 2,008 reports collected from the two hospitals were annotated. The final Kappa values estimated for CMUH and KMUH are 0.802 (substantial) and 0.914 (almost perfect) respectively. Table 3 shows the detail statistics of the compiled corpora. As one can see that the size of KMUH is much larger than that of CMUH. Although the size of the KMUH corpus is much larger than that of CMUH, the annotations for pathological M is much less in KMUH. It's because that pathological M stage need the other reports (e.g., image reports from other examination division) to conclude the outcome, the pathological M stage was shown inconclusive results on the current pathological data frequently in KMUH.

Performance Comparison with Different Methods
In the first experiment, we trained the developed models on the two training sets separately and evaluated their performance on the test sets of the two hospitals. The results were illustrated in Table  4. In general, the developed models performed better on the KMUH test set which may be owing to the larger numbers of training samples. The CRF model achieved a comparable F-score on the KMUH test set but its F-score is lower than that of BiLSTM-CRF by 0.214 on the CMUH test set. Table 5 shows the detail results for the nine annotation types of the BiLSTM-CRF model on the two test sets. Overall, the developed networks demonstrated promising F-scores for all items.

The Effect of Transfer Learning
In this experiment, we would like to gain insights on what extent transfer learning improves the performance on the cross-hospital datasets. We used KMUH as the source dataset since its size is  larger than that of CMUH. We conducted experiments to examine the effect of transfer knowledge learned from KMUH to CMUH by 1) analyzing the importance of each layer of the developed neural networks, and 2) quantifying the performance gain by varying the sizes (20%~100%) of the CMUH training set when we fine-tuned the model pre-trained on KMUH. Note that because the size of the 20% CMUH dataset is quite small, we reduced the batch size to 512 for this case. Figure 2 shows the results. Here "Non-transfer" refers to that we only used the reduced sizes of the CMUH training set to develop the BiLSTM-CRF models without relying on any pre-trained parameters. "FC1" initialized the learned parameters of the FC1 layer of the BiLSTM-CRF model by adopting the pre-trained parameters on the KMUH corpus, "BiLSTM" further included the learned parameters of the BiLSTM layer of the source model and so on. Consider the comparable results achieved by CRF models, we also include the configuration "Non-transfer-CRF" in which we trained several CRF models by using the corresponding reduced CMUH datasets.
In Figure 2, we can observe that with more numbers of the training samples used, the performance can be apparently improved for the 'Non-transfer' models. However, the improvement for the CRF models is relatively flat comparing with that of the neural networks. On the other hand, even with only 20% of the CMUH training set, the models learned with transferred parameters achieved satisfactory F-scores, which outperformed the 'Non-transfer' models trained on more training samples (being equal or less than 60%) of the full CMUH training set. The above results give us an insight that we can exploit the parameters of the neural networks learned from source hospitals to rapidly develop a reliable system relying on a small annotated dataset to boost the annotation process in the new hospital for creating and evaluating a customized system.
The results shown in Figure 2 also reveal the importance of parameters of each layer of the developed model in the manner of transfer learning. We can observe that transferring parameters of all layers in general leading to slightly better F-scores, but transferring the parameters of the first layer only is almost as efficient as transferring all. The result is consistent with the observations of other previous works (Giorgi & Bader, 2018, 2020Lee, Dernoncourt, & Szolovits, 2018) and the hypothesis that the lower layers of a neural network learn generic features and the higher layers learn task-specific (or we can say that hospital-specific) features.

Cross-corpus Evaluations
To assess the performance of the developed model in a more realistic setup, we conducted cross- n / a n / a n / a  Table 6 shows the results. Given that both corpora were annotated by the same annotators under the same annotation guideline, we can still see the generality of the developed models is not well; a larger drop in performance can be found on both datasets. The results exhibited that the format and the writing styles of the descriptive pathology in surgical biopsy reports across hospitals are heterogeneous in real-world scenarios.
We also estimated the performance of the transferred model on its source dataset in Table 6. The result illustrates an apparent drop of F-score from 0.976 to 0.762 on the KMUH test set. The results demonstrated that the developed systems suffered the catastrophic forgetting problem (French, 1999) which is now known to be a challenge for artificial neural networks when the network is trained sequentially on multiple tasks because the weights in the network that are important for the original task are now changed to meet the objectives of the new task.

Conclusions
In this work, we investigated the feasibility of applying transfer learning via neural networks on the task of extraction cancer registry information from cross-hospital pathology reports. Because the writing styles and formats of the pathology reports is different in each hospital, to estimate the requirements of the number of annotated datasets when we migrate from one hospital to the others and iteratively improve the effectiveness of the developed systems, we conducted experiments to quantify the impact of transfer learning on the datasets collected from two hospitals. From the evaluations of the results, we confirmed that when transfer learning is adopted, the model pre-trained on a source hospital can be trained with fewer annotations of the target hospital and achieve satisfactory performance as when the full training set of the target hospital is used. The results suggest us to apply the transfer learning techniques for developing a customized system for a new hospital with only a few annotations. We will develop method to estimate the required numbers of annotations based on the language properties of the narrative reports and the characteristics of the developed neural networks. Furthermore, our experiment results also reveal challenges requiring to be addressed including the generalizability and catastrophic forgetting problem, which should be addressed in the future.