Joint Learning with Pre-trained Transformer on Named Entity Recognition and Relation Extraction Tasks for Clinical Analytics

In drug development, protocols define how clinical trials are conducted, and are therefore of paramount importance. They contain key patient-, investigator-, medication-, and study-related information, often elaborated in different sections in the protocol texts. Granular-level parsing on large quantity of existing protocols can accelerate clinical trial design and provide actionable insights into trial optimization. Here, we report our progresses in using deep learning NLP algorithms to enable automated protocol analytics. In particular, we combined a pre-trained BERT transformer model with joint-learning strategies to simultaneously identify clinically relevant entities (i.e. Named Entity Recognition) and extract the syntactic relations between these entities (i.e. Relation Extraction) from the eligibility criteria section in protocol texts. When comparing to standalone NER and RE models, our joint-learning strategy can effectively improve the performance of RE task while retaining similarly high NER performance, likely due to the synergy of optimizing toward both tasks’ objectives via shared parameters. The derived NLP model provides an end-to-end solution to convert unstructured protocol texts into structured data source, which will be embedded into a comprehensive clinical analytics workflow for downstream trial design missions such like patient population extraction, patient enrollment rate estimation, and protocol amendment prediction.


Introduction
Clinical trial protocols, often called "study protocols" or just "protocols", are the foundational documents that specify the detailed plans of conducting clinical trials to validate the safety and/or efficacy of drugs. They contain key information about the targeted disease indications, the eligible patients, the investigated medication, the visit schedules, and the treatment endpoints etc. Across the entire lifecycle of clinical trials starting from study design & planning to data analysis & publication, it is always critical to comprehend this information accurately and unambiguously. However, since protocols are mainly unstructured or semi-structured texts (i.e. natural languages), application of computer-aided information extraction is challenging and thus limited. Current protocol analytic practices are labourand time-intensive, involving numerous manual resource checking and cross referencing activities. The pressing needs of reducing the costs and boosting the speed of drug development have created an industry-wide demand in developing a more efficient, effective, and scalable mechanism to process text-based protocols.
To address the above demand, we present in this paper our efforts and progresses in developing a deep learning Natural Language Processing (NLP) approach to extract clinically relevant information from protocols. In particular, we targeted two tasks: Named Entity Recognition (NER) and Relation Extraction (RE), and transferred the Bidirectional Encoder Representations from Transformers model (BERT, a pre-trained transformer NLP model) via a joint-learning strategy to extract clinically relevant entities and their syntactic relationships simultaneously by training on our in-house clinical trial protocol corpus.
In alignment with the industry's patient-centric business emphasis, we focused this work on extracting the patient eligibility information from the "Eligibility Criteria" section in the protocols, which unambiguously determines whether a patient could be included in or excluded from the clinical trial. This is particularly important because patient recruitment is an essential and currently rate-limiting step in clinical trials. Accurate parsing of this part of protocols can facilitate quick identification of eligible patients as well as other clinical analytics missions.
Clinical trial protocols are a type of professional documents with rigorous and highly domainspecific terms associated via complex yet precise relations. Like other professionally developed documents, protocols have to pass multiple qualitycontrol checkpoints, and thus require less preprocessing (e.g. text correcting/cleaning) than many other types of documents such as social media posts before submitting to NLP models. On the other hand, the domain-specific nature of protocols requires extra attention when transferring models trained from generic or other professional domains. To elaborate, a protocol contains many clinical and medical terms (e.g. medications and diseases etc.) that are not commonly seen in other domains, but those are exactly the entities that our model needs to recognize; furthermore, it is also challenging that the entities may be connected in dramatically different ways under different domain-specific contexts. For example, in the clinical domain, the word "trial" refers to "clinical trial" that is associated with "patients", "diseases", and "medicines" etc.; whereas in the legal or even generic domains, "trial" commonly means "legal trial" that is frequently connected to "jury", "prosecutor" and "defendant" etc. Therefore, the success of the transfer-learning largely relies on maximizing domain-specific "gradients" for fine-tuning the model parameters.
This presented work is continued from our recent study on clinical protocols, in which we developed standalone BERT-based NLP models for NER and RE tasks for processing the "Eligibility Criteria" section in protocols (Chen et al., 2020).
Based on the observation that different clinically relevant entities are not equally involved in all relations, we hypothesized that by combining the NER and RE tasks in the same BERT network via a jointlearning strategy, the textures of clinical trials may become more visible for training and thus improve the performance of both tasks. As will be shown in the later sections of this paper, our results validated this hypothesis and showed that the joint-learning model can provide significant improvement over standalone models.
This improved model is being embedded into an automated pipeline that aims to accelerate the current manual process of identifying similar clinical trials from the historical protocols, and to streamline the querying process of identifying potentially eligible patients for clinical trials.

Related Work
NER and RE are two classic NLP tasks that have been studied separately for decades. In its earlier developments, NER, as a sequence labeling task, has mainly employed probabilistic sequence labeling techniques such as conditional random fields (CRF), maximum entropy Markov models, and hidden Markov models (Lafferty et al., 2001;McCallum et al., 2000;Bikel et al., 1998). More recently, researchers have started using deep learning family of algorithms to capture the transitions between hidden states for NER tasks, including recurrent neural networks (RNN), bidirectional long short-term memory (BiLSTM) together with CRF. Lately, pre-trained transformer models, with BERT as a prominent example, have been developed and used to represent contextual embeddings of text and gained great success in NER tasks along with other NLP tasks (Devlin et al., 2018;Lee et al., 2019).
With regard to RE, it is usually treated as a text classification between the interested entity pairs. Many classification algorithms, such as support vector machine, logistic regression, perceptron etc., have been applied to this problem (Bach and Badaskar, 2007;Jurafsky, 2000). Similar to NER, the latest developments in solving RE tasks have also employed deep learning algorithms using neural network models to emulate entity relations with components such as attention, biaffine, and bidirectional tree-structured LSTM-RNNs (Nguyen and Verspoor, 2019; Wang et al., 2019a;Miwa and Bansal, 2016). Pre-trained models were also used to provide contextualized encoding information to the neural network layers for the RE task (Lee et al., 2019;. Although they can be tackled independently, NER and RE tasks are in fact synergistically connected: if we knew two entities and their types in a sentence, it would be easier to classify their relations; similarly, if we knew the relation between two phrases, then it would be less challenging to label their entity types. This has naturally motivated efforts in joint or multi-task learning for NER and RE, hoping to achieve better performances in both tasks by simultaneously training the same neural network towards combined objectives. Despite the differences in their details, the practices in NER and RE joint learning usually share a general highlevel architecture: they sequentially stack together the word and sequence embedding layers, the NER prediction layer, the NER entity embedding layer, and the relation representation/handling layers. For word and sequence contextualized embedding layers, where many network variations be present, researchers have investigated using BiLSTM, RNN, and BERT pre-trained transformers. (Bekoulis et al., 2018b;Wang et al., 2019a;Giorgi et al., 2019;Huang et al., 2019b;Katiyar and Cardie, 2017). These studies usually emphasized more on evaluating different joint models, leaving the comparison between joint and standalone models to be investigated.
Pre-trained transformer models, e.g. BERT, XL-Net, and GPT, have achieved state-of-the-art performance across a great number of benchmark NLP tasks (Devlin et al., 2018;Yang et al., 2019;Radford et al., 2018). They provide the benefits of representing bidirectional context and encoding text sequence by a series of attention layers. From the transfer learning standpoint, various NLP tasks can be treated as downstream tasks appended to the pre-trained models, and the pre-trained parameters (usually from large scale corpora in a generic domain) together with the NLP task specific parameters are fine-tuned via continued training on a relatively smaller and task-specific training data set. To enhance domain specificity, BERT has also been customized and retrained on specific domains such as the biomedical domain against relevant corpus, examples including BioBERT, ClinicalBERT, and SciBERT (Lee et al., 2019;Alsentzer et al., 2019;Beltagy et al., 2019). Also, there has been a surge in studies applying BERT in specific NLP contexts for fine-tuning tasks such as predicting hospital re-admission, extracting bacteria-biotope relations, biomedical named entity normalization, etc. (Huang et al., 2019a;Jettakul et al., 2019;Li et al., 2019).
We have previously investigated the applications of fine-tuning pre-trained BERT models on a protocol corpus for NER and RE tasks separately. Encouraged by many successful studies on pretrained transformer and joint models, we continued to explore joint-learning strategies combined with BERT to co-train NER and RE tasks against our in-house clinical protocol corpus. We abstracted a neural architecture from two popular joint models and experimented with a number of variations (Bekoulis et al., 2018b;Giorgi et al., 2019). We believe these continued efforts not only help selecting the best-performing model for our applications, but also provide a comprehensive understanding of various transfer learning strategies' performance under a real-world setup, shedding light on developing business-oriented AI applications for the healthcare and clinical trial industry.

Data Set
Data. Our data is comprised of the eligibility criteria sections from 470 Covance in-house drug development study protocols (less than 2% of the total number of in-house protocols). The eligibility criteria section in a protocol explicitly and unambiguously defines the rules to include or exclude a patient, thus directly determining the patient population available for the trial. The corpus contains 30,183 criteria sentences in total. We randomly split the sentences into training and test sets with a 2:1 ratio, resulting in 20,122 sentences for training and 10,061 for test.
The sentences were manually annotated by biomedical experts. Clinically relevant entities and the associated entity relations are labelled based on our annotation guideline. We used the BIO tag format to denote the beginning, inside, and outside of the entities (Ramshaw and Marcus, 1999). We focused on 15 types of entities and 7 types of syntactic relations (as shown in Table 1 and Table  2): After reviewing the previous NER and RE joint models, we established a general network architecture that includes key components for the joint learning while allowing experiment using varia- tions in local network designs. The main structure of the joint model is shown in Figure 1. We used BERT pre-trained transformer as the embedding/encoding layer. The NER layer occurs after the BERT layer and uses softmax for NER classification. More specifically, it takes the BERT output vectors as its input and first passes through a fully connected layer and then the output layer where NER labels are classified using softmax function (Goodfellow et al., 2016). The NER classification loss function based on cross-entropy is: where n is the total number of NER tokens, l i is the actual NER label for the i th token, k is the number of NER label classes, c j denotes any of the NER label classes, s i,l i is the linear score for the i th token belonging to its actual class l i , and s i,c j is the linear score for the i th token belonging to entity class c j . Following the NER layer, we appended an NER label embedding layer, which is concatenated with the outputs from the previous BERT layer to serve as the input for the subsequent RE task. Because an entity could be paired with other entities before or after it in a sentence, it should be mapped differently in these 2 cases. In our model, the entity vectors are processed in the relation pair handling layer, by 1) mapping them using a fully connected layer to head vectors for representing entities as heads in a pair, and 2) mapping them using a parallel fully connected layer to tail vectors for representing tails in a pair. Then an entity pair, composed of a head and a tail vector, employs a classification function, being either softmax or biaffine function, to produce the RE classification result, i.e., the relation type between the two entities. More details about the RE model variations can be found in section 4.2.1.
The RE loss function is also cross-entropy based: where n is the total number of relations, q i is the actual relation label for the i th entity pair, s i,q i is the linear score for the i th relation belonging to its actual class q i , k is the number of relation types, and s i,r j is the linear score for the i th relation belonging to relation class r j . The overall joint model loss is derived by summing the NER and RE losses:

Model Options
By keeping the NER layers unchanged in this general network architecture, we further experimented with different options for the RE sub-network and evaluated their effects on joint task performance.

RE Sub-network Options
For the RE task, we explored methods of representing entities pairs and classifying their relations, which are rendered as the relation handling and the classification layers in Figure 1. We tested two options, denoted as re m1 and re m2 respectively. Model option re m1 is based on (Bekoulis et al., 2018a,b), which passes entity vectors derived from the NER layer through a fully connected layer for obtaining its head entity representation and similarly through another fully connected layer for tail entity representation, and then adds vectors of a pair of entities to serve as the relation vector for the two paired entities (i.e. head and tail entities): where h i and h j are vectors for head and tail entities and h i,j is the resulting vector from summing the two. We subsequently constructed a fully connected layer to classify the relation vectors. Differing from (Bekoulis et al., 2018a,b), in which the RE classes were assumed to be not mutually exclusive and the RE classification was treated as a multi-label classification task using a sigmoid function, relations in our study are mutually exclusive from each other and henceforth we used a softmax function to classify the relations. Also note that (Bekoulis et al., 2018a,b) used bidirectional LSTM for embedding/encoding and we replaced it with BERT as described in 4.1. The other model option, re m2, is similar to the practice in (Giorgi et al., 2019; Nguyen and Verspoor, 2019): we first applied two parallel fully connected layers to derive head and tail entity representation respectively, and then performed biaffine classification using the head and the tail vectors in an entity pair. The biaffine classification function is: (5) where h i and h j denotes the head and tail entity vectors respectively, U is a tensor of size of m×l × m, W is a matrix of size of l * 2m, with m being the hidden size of the head/tail vector and l being the number of RE labels, h i ||h j denotes concatenating the two vectors, and b is a bias vector of size of l.
The above biaffine function has a bilinear term h i T U h j and a linear term W (h i ||h j ), along with the bias term. We experimented with either including both the bilinear and linear terms (bilin-ear+linear), or only including the bilinear term (bilinear only).

RE Negative Sample Construction during Training
A common challenge in RE classification tasks is the overwhelming choices of negative samples. In principle, any entity pair without syntactic relations is a negative sample. To address this challenge, we evaluated two negative sample construction strate-gies. One strategy is to scan through all the possible entity pairs and mark the pairs without syntactic relations as negative relation samples. Since this option relies on relation information from gold standard data, we denote it as gs-based. The other strategy, denoted as incremental, incrementally builds negative relation samples based on NER predicted labels, as in (Giorgi et al., 2019;Nguyen and Verspoor, 2019). More specifically, in the incremental strategy, an entity pair is included as a negative sample if 1) two entities in this pair are correctly predicted by the NER layer and are without relations, or 2) any of the entities in this pair is incorrectly predicted by the NER. Hence, the former way of constructing negative samples is static as the samples remain unchanged throughout the training, whereas the latter way is dynamic, as whether or not an entity pair is included as a negative sample depends on the NER prediction result during training sessions.

Evaluation Options
We evaluated the joint learning model's performance on NER and RE tasks, by reporting microlevel precision, recall, and f1-measure for both tasks. For RE, we evaluated on relations between gold standard entities without considering NER predicted entities (the gs-based option), and also evaluated on relations yielded from NER predicted entities (the end-to-end option). In other words, the gs-based option evaluates RE performance when we know which tokens are actual entities, and the end-to-end option evaluates the performance in the scenario when we do not have actual entity information, which is a more realistic scenario when evaluating a RE model's performance in production systems.

Standalone Models
To assess the effects of joint learning options, we built NER and RE standalone models from the corresponding sub-networks in the joint learning architecture for NER and RE tasks separately and evaluate their performances.
For the NER standalone model, following the BERT layer, we added a fully connected layer with softmax classification. For the RE standalone model, instead of having an intermediate NER layer, we appended two parallel fully connected layer directly on the BERT output to derive the head and tail entity representations, and then classified entity pair relations using a softmax function. For standalone model evaluations, we employed the same precision/recall/f1-measures as in the joint models by evaluating against the gold standard (the gs-based option).
It is worth noting that in real-world practice, we do not know which tokens are entities so we have to use NER prediction as entity input for RE evaluation. Thus, we included a real-world inspired end-to-end metric for RE that evaluates the performance using NER standalone model prediction as inputs, which effectively takes into account the propagated NER prediction errors (the end-to-end evaluation option).

Pre-trained Models
For pre-trained models, we experimented with BERT base, a smaller version of BERT that comprises 110 million parameters, and BioBERT, a model derived from retraining the original BERT using large-scale biomedical texts (Lee et al., 2019). We chose to use the uncased version of BERT base in which all text is lower cased; and since BioBERT is cased only, we used the model with all original cases preserved in text.

Hyperparameters
We used the same hyperparameter values across all the models as shown in Table 3. For BERT layer hyperparameters we used the same values as in the original BERT model. The models were implemented using the Tensorflow library.

Results and Analysis
Our results are summarized in Table 4 and we elaborate our finding below. re m1 vs. re m2 RE sub-network option. For the NER task, the four re m1 models performed similarly to the eight re m2 models. The highest recall, precision and f1-measure are achieved in re m2's gs-based negative sampling option (model #12), which performs only marginally better than the other re m1 and re m2 models. For the RE task, in the BERT scenario, the re m2 models greatly outperform the re m1 models in all three measures (P/R/F). For example, model #5, a re m2 model using gs-based negative sampling, achieved endto-end f1-measure of 58.14%, whereas its counterpart, model #1, in the re m1 model family, has f1-measure of 44.25%, a 13.89% drop from model #5. This result demonstrates that the biaffine classification, the entity pair representation and the classification option used in re m2, can lead to much better RE performance than softmax classification as used in re m1. However, for BioBERT pre-trained model, the result is not as decisive: re m2 does not consistently outperform re m1. For example, model #15 (re m2) has better RE performance than model #11 (re m1) yet model #12 (re m2) exhibits lower RE performance than model #11 (re m1).
Biaffine variations for the re m2 option. Within the re m2 model, we evaluated the strategies to classify relations using both the bilinear and linear parts of the biaffine function (bilinear + linear) or using only the bilinear part (bilinear only). The results are exhibited as models #3 to #6 (BERT) and #12 to #15 (BioBERT) in Table  4. The two strategies achieved similar results on the NER task for both BERT and BioBERT cases. For RE end-to-end performance, the bilinear only strategy combined with BERT and gs-based negative sampling for training (model #5) achieved the best f1-measure and recall; and the bilinea + lin-ear strategy together with BioBERT and gs-based negative sampling got the highest precision (model #12). Overall, we observed that the biaffine classification options play a less significant role in model performance comparing to other modeling components such like the negative sampling strategies and pre-trained model options.
gs-based vs. incremental RE negative sampling. We tested the two negative sampling strategies on both re m1 and re m2 models (models #1 to #6 and #10 to #15 in Table 4). In the case of using pre-trained BERT model, we observed that gs-based negative sampling outperforms the incremental option (models #1 vs. #2, #3 vs. #4, #5 vs. #6) with significant margin. In particular, model #1 exceeded #2 by 4.45% for end-to-end f1-measure, and models #3 and #5 exceeding #4 and #6 by 4.91% and 5.47%, respectively. Interestingly, in contrast, for the BioBERT case, the incremental strategy is superior to the gs-based strategy by an even larger margin, e.g. with model #11 excedding #10 by 19.69%. Therefore the effect of RE negative sampling strategies is jointly determined with the pre-trained model option, and can be significant.
Joint-learning vs. standalone model. Our results show that joint-learning models generally improve RE performance over the standalone RE model but do not significantly affect the NER task ( 1% drop in f1-measure). because the incremental strategy requires NER net, the standalone RE can only be evaluated using the gs-based strategy, and we had to use gold standard entity information as RE input. The joint-learning models outperform the standalone RE in most of the scenarios when measured with the gs-based evaluation option. When conducting the end-to-end RE task, the joint-learning models exhibit dramatic performance improvement over the standalone models, e.g. f1-measure of 58.14% for model #5 (joint) vs. 48.15% for #9 (standalone) and 55.37% model #15 (joint) vs. 26.41% for #18 (standalone). Despite of the slightly weaker performance in NER task, the great gain in the end-to-end RE task demonstrates that the joint-learning models a better solution in real-world applications.
BERT vs. BioBERT. Comparing between the two pre-trained models, for the NER task, BioBERT yields better result (around 70%) than BERT (around 69%), possibly due to its additional language model pre-trained from biomedical corpus. BERT-based joint-learning models, when using re m2 negative sampling strategy, outperforms the re m1 strategy, but this trend does not hold in the BioBERT-based joint-learning models. BioBERT standalone model performance on the RE task is severely impacted by this configuration (model #18). Although BioBERT-based model achieves reasonable performance in some joint-learning (e.g. model #13) strategies, it fails in others (e.g. model #10). These results indicate that joint-learning with BERT is more robust with more stable performance than BioBERT.
In summary, our results demonstrated that jointlearning is a superior strategy, thanks to its steady and significant performance gain in the end-to-end RE task. However, since no model can achieve the best NER and RE performance simultaneously, it is still necessary to balance the two tasks' performances when choosing the proper joint-learning model to prioritize production needs.

Conclusion and Future Work
In this reported work, we employed joint-learning models to identify entities and relations in clinical protocols by using pre-trained transformer NLP deep learning models. To the best of our knowledge, this is the first attempt to tackle the NER and RE tasks in a joint and pre-trained deep learning fashion on real-world protocols, which is inherently a corpus of high complexity. Our contribution is three fold: 1) we abstracted a neural network architecture from literature combining pre-trained transformer model with joint learning for NER & RE tasks, 2) we experimented with different model options based on the joint learning network architecture, 3) we examined performance on a complex clinical corpus, which is a less studied but highly impactful domain for such tasks. Our results demonstrated that joint-learning models can greatly improve RE performance over the standalone models despite of a minor decrease in the NER performance. Among all the evaluated joint-learning strategies, the biaffine RE model, gold-standard based negative sampling, together with the BERT pre-trained model, led to generally better performance than other strategies. These results provide evidence on the effectiveness of using joint and deep learning on parsing clinical protocol text, and thus for future work, we will continue exploring more sophisticated joint and multi-task learning network architectures to further enhance the NER and RE parsing performance.