Trigger Word Detection and Thematic Role Identification via BERT and Multitask Learning

The prediction of the relationship between the disease with genes and its mutations is a very important knowledge extraction task that can potentially help drug discovery. In this paper, we present our approaches for trigger word detection (task 1) and the identification of its thematic role (task 2) in AGAC track of BioNLP Open Shared Task 2019. Task 1 can be regarded as the traditional name entity recognition (NER), which cultivates molecular phenomena related to gene mutation. Task 2 can be regarded as relation extraction which captures the thematic roles between entities. For two tasks, we exploit the pre-trained biomedical language representation model (i.e., BERT) in the pipe of information extraction for the collection of mutation-disease knowledge from PubMed. And also, we design a fine-tuning technique and extra features by using multi-task learning. The experiment results show that our proposed approaches achieve 0.60 (ranks 1) and 0.25 (ranks 2) on task 1 and task 2 respectively in terms of F_1 metric.


Introduction
Using the natural language processing methods to discover and mine drug-related knowledge from text has been a hot topic in recent years. For the goal of drug repurposing, an active gene annotation corpus (AGAC) was developed as a benchmark dataset (Wang et al., 2018b). The AGAC track is part of the BioNLP Open Shared Task 2019, aims to gather text mining approaches among the BioNLP community to propel drugoriented knowledge discovery. It consists of three tasks for the extraction of mutation-disease knowledge from PubMed abstracts: trigger words NER, thematic roles identification, and mutation-disease knowledge discovery. We participated in the trigger words NER and thematic roles identification tasks.
Recently, pre-trained models have been the dominant paradigm in natural language processing. They achieved remarkable state-of-the-art performance across a wide range of related tasks, such as textual entailment, natural language inference, question answering, etc. BERT, proposed by Devlin et al. (2019), has achieved a bettermarked result in GLUE leaderboard with a deep transformer architecture (Wang et al., 2018a). BERT first trains a language model on an unsupervised large-scale corpus, and then the pretrained model is fine-tuned to adapt to downstream tasks. This fine-tuning process can be seen as a form of transfer learning, where BERT learns knowledge from the large-scale corpus and transfer it to downstream tasks. While BERT was built for general-purpose language understanding, there are also some pre-trained models following BERT architecture that effectively leverage domain-specific knowledge from a large set of unannotated biomedical texts (e.g. PubMed abstracts, clinical notes), such as SciBERT (Beltagy et al., 2019), BioBERT , NCBI BERT , etc. These models can effectively transfer knowledge from a large amount of unlabeled texts to biomedical text mining models with minimal task-specific architecture modifications.
In this paper, we investigate different methods to combine and transfer the knowledge from the three different sources and illustrate our results on the AGAC corpus. Our method is based on finetuning BERT base , NCBI BERT and BioBERT using multi-task learning, which has demonstrated the efficiency of knowledge transformation (Liu et al., 2019) and integrating models for both tasks with ensembles. The proposed methods are proved effective for natural language understanding in the biomedical domain, and we rank first place on task 1 (Trigger words NER) and second place on task 2 (Thematic roles identification).  Figure 1: The pipeline of our approach. We first split PubMed abstracts into sentences, tokenize them into words and extract some features like POS tags, then a BERT-based method for NER offset and entity recognition, and finally predict relations for each potential entity pair.

Background
The model architecture of BERT (Devlin et al., 2019) is a multi-layer bidirectional Transformer encoder based on the original Transformer model (Vaswani et al., 2017). The input representation is a concatenation of WordPiece embeddings (Wu et al., 2016), positional embeddings, and the segment embedding. A special classification embedding ([CLS]) is inserted as the first token and a special token ([SEP]) is added as the final token. It is firstly pre-trained with two strategies on large-scale unlabeled text, i.e., masked language model and next sentence prediction. The pre-trained BERT model provides a powerful context-dependent sentence representation and can be used for various target tasks, i.e., text classification and machine comprehension, through the fine-tuning procedure.
Hence, the BERT model can be easily extended to the medical domain information extraction pipeline, first extracting the trigger words and then determining the relationship between them, as illustrated in Figure 1.

Task 1: Trigger Words NER
Task 1 aims to identify trigger words in the PubMed digest and annotating them as correct trigger markers or entities (Var, MPA, Interaction, Pathway, CPA, Reg, PosReg, NegReg, Disease, Gene, Protein, Enzyme). It can be seen as an NER task involving the identification of many domainspecific proper nouns in the biomedical corpus.
We first split each PubMed abstracts into sentences using '\n' or '.', and convert each sentence into words by NLTK 1 tokenizer. After that, words are further tokenized into its word pieces x = (x 1 , . . . , x T ). Then we use a representation based on the BERT from the last layer H = (h 1 , . . . , h T ). In order to make better use of the word-level information, POS tagging labels and word shape embedding representation (Liu et al., 2015) of each word 2 are also concatenated into the output of BERT, passing through a single projection layer, followed by a conditional random fields (CRF) layer with a masking constraint 3 to calculate the token-level label probability p = (p 1 , . . . , p T ). When fine-tuning the BERT, we found that the performance of the model performed better in the case of BIO for the selection of the tagging schemes compared to BIOES. We further extend our model to multi-task learning joint trained by sharing the architecture and parameters. Although the differences in different datasets, multi-task means joint learning with other biomedical corpora. The assumption is to make more efficient use of the data and to encourage the models to learn more generalized representations. More specially, the same token-level information and BERT encoder are shared and each data set has a specific output layer, e.g., CRF layer. Our final loss function is obtained as follows: where y c i denote true tag sequence and x c i denote the input tokens for corpora c i , λ c i and λ r are weighted parameters.

Task 2: Thematic Roles Identification
Task 2 is to identify the thematic roles (ThemeOf, CauseOf) between trigger words. We treat it as a multi-label classification problem by introducing "no relation (NA)" label. When constructing the training data of task 2, we use the relationship of two entities with a distance of no more than one sentence. For NA label, random sampling is performed. In the testing process, relation label will be assigned to the corresponding thematic role when its probability is maximum and larger than the threshold. Otherwise, it will be predicted as no relation. We also anonymously use a predefined tag (such as %Disease) to represent a target named entity. And we additionally append two concrete predicted entity words separated by the [SEP] tag after each sentence. Following Shi and Lin (2019), we also add the token-level relative distance to the subject entity information for each token, i.e. 0 for the position t between two entities, t − s for tokens before first entity and t − e for tokens after second entity, where s, e are the starting and ending positions of first and second entity after tokenization, respectively. The relation logits of two entities are performed using a single output layer from the BERT, as where h cls denotes the hidden state of the first special token ([CLS]).

Experiments
In this section, we provide the leaderboard performance and conduct an analysis of the effect of models from different settings.

Experimental Setup
The AGAC track organizers develop an active gene annotation corpus (AGAC) (Wang et al., 2018b;Gachloo et al., 2019), for the sake of knowledge discovery in drug repurposing. The track corpus consists of 1250 PubMed abstracts: 250 for public, 1000 for final evaluation. We randomly split the public texts into train and development data sets with the radio of 8:2. The training set is used to learn model parameters, the development set to select optimal hyper-parameters. For evaluation results, we measure the trigger words recognition and thematic roles extraction performance with F 1 score. Table 1 shows the external data sets used under the joint learning method. The BIO form of these data sets is different from that of task 1, hence we use different projection and CRF layers. But not the more data sets, the better. We found that the NCBI disease (Dogan et al., 2014) and BC5CDR (Li et al., 2016) datasets are helpful for the final results, and the performance is reduced when using BC2GM (Smith et al., 2008) and 2010 i2b2VA dataset (Uzuner et al., 2011).

Implementation and Hyperparameters
We tried the original BERT 4 , BioBERT 5 and NCBI BERT 6 pre-trained models. Each training example is pruned to at most 384 and 512 tokens for named entity recognition (NER) and relation extraction (RE). We use a batch size of 5 for NER, and 32 for RE. We also use the hierarchical learning rate in the training process so that the pretrained parameters and the newly added parameters converge at different optimization processes. For fine-tuning, we train the models for 20 epochs using a learning rate of 2 × 10 −5 for pre-trained weights and 3 × 10 −5 for others. The learning parameters were selected based on the best performance on the dev set. For NER, we ensemble 5 models from 5-fold cross-validation and 2 models using the normal training-validation approach. For RE, we ensemble 3 models that used all the construction data in training. Table 2 compares the results of the two tasks of the pre-trained model in trigger words NER and thematic roles identification. We report the impact of using different pre-training models on the 4 https://github.com/google-research/ bert 5 https://github.com/dmis-lab/biobert 6 https://github.com/ncbi-nlp/NCBI_BERT   The results for task 1 is summarized in Table 3. The difference in the performance in the different labels is partly sourced by the imbalance distribution of trigger labels in the corpus. Our method ends up first place on the leaderboard and substantially improving upon previous state-of-the-art methods. The results for task 2 is summarized in Table 4. Our method ends up second place on the leaderboard. Our method has a large discrepancy between the development set performance and test set performance. It may be the test set is quite different from our constructed data set. This is also related to how we use recognized entities, sentence-or document-level combinations.

Ablation Study
As shown in Table 5, we found that adding a layer of BiLSTM behind the BERT encoder did not improve the performance of the model, resulting in a 0.04 loss of F 1 . For NER tasks, external features are effective for the model's performance. So we verified the efficacy of word shape and POS tags on task 1, and we found that adding this information can increase the F 1 value of our model by more than 0.01.

Conclusion
In this paper, we have explored the value of integrating pre-trained biomedical language representation models into a pipe of information extraction methods for collection of mutation-disease knowledge from PubMed. In particular, we investigate the use of three pre-trained models, BERT base , NCBI BERT and BioBERT, for fine-tuning on the new task and reducing the risk of overfitting. By considering the relationship between different data sets, we achieve better results. Experimental results on a benchmark annotation of genes with active mutation-centric function changes corpus show that pre-trained representations help improve baseline to attain state-of-the-art performance. In future work, we would like to train the entity recognition and relation extraction tasks simultaneously, reducing the cascading error caused by the pipeline model in biomedical information extraction.