Multi-Task Learning for Chemical Named Entity Recognition with Chemical Compound Paraphrasing

We propose a method to improve named entity recognition (NER) for chemical compounds using multi-task learning by jointly training a chemical NER model and a chemical com- pound paraphrase model. Our method en- ables the long short-term memory (LSTM) of the NER model to capture chemical com- pound paraphrases by sharing the parameters of the LSTM and character embeddings be- tween the two models. The experimental re- sults on the BioCreative IV’s CHEMDNER task show that our method improves chemi- cal NER and achieves state-of-the-art perfor- mance.


Introduction
Named Entity Recognition (NER) is one of the important basic technologies for Natural Language Processing (NLP) such as Information Extraction and Entity Linking.
LSTM-CRF NER models, which combine a conditional random field (CRF) and a long short-term memory (LSTM), have achieved high performance (Lample et al., 2016;Ma and Hovy, 2016). The LSTM-CRF models that use a neural language model (NLM) pre-trained from a large-scale unlabeled corpus (Akbik et al., 2018;Peters et al., 2018) have shown state-of-the-art performances on the CoNLL 2003 shared task (Sang et al., 2003).
Chemical compound databases are widely used for investigating properties of chemical compounds or for developing new chemical products. However, updating the databases by hand is hard because new findings on chemical compounds are mainly reported in scientific papers and patents every day. Hence, chemical NER has been studied to recognize chemical compounds from chemical * Taiki Watanabe belonged to Ehime University when this work was done.
One of the problems in chemical NER is notation variants of chemical compounds. For example, Phenylalanine has different notations such as L-β-phenylalanine and (S)-2-Benzylglycine. The average number of notations of each compound in PubChem 1 is approximately 3.88. 2 If these expressions are dealt with differently, the statistics of the same compound can be dispersed, especially for low frequency compounds. In other words, the more distinct these representations are, the more difficult identifying chemical compound entities becomes. However, existing chemical NER methods do not deal with notation variants of chemical compounds, derived from the partial structures or the notation fluctuation peculiar to these chemical compounds.
We propose HanPaNE, which Handles Paraphrase in NER by utilizing chemical compound paraphrase pairs as multi-task learning.
To train expression identity of different notations, HanPaNE learns shared parameters between paraphrases using multi-task learning on NER and paraphrase generation. This contrasts with widely used approaches that automatically augment training data by replacing NEs with other NEs (Yi et al., 2004). To train paraphrase generation, we use an attention-based neural machine translation (ANMT) model (Luong et al., 2015;Bahdanau et al., 2015) that shares parameters with the NER model in the translation encoder. The experimental results on the BioCreative IV s CHEMDNER task (Krallinger et al., 2015) show that our method achieves the best accuracy (+1.43 F-score). 2 Proposed Method Figure 1 shows an overview of the method Han-PaNE. We first describe our NER and chemical compound paraphrase model, and then describe our multi-task learning of the NER and the paraphrase model.

NER using Character-level NLM
This section describes the NER model using a character-level NLM (Akbik et al., 2018), which is our baseline comparison with our approach. This model showed state-of-the-art performance on the CoNLL 2003 shared task. For this NER model, given an input sentence w = w 1 , w 2 , · · · , w N , each word w i is converted into a vector x i represented as the concatenation of the following three types of embeddings.
• 1) word embeddings obtained by looking up a parameter matrix.
• 2) CWSE is the hidden state of a BiLSTM (LS T M (wc) ) on the character sequence consisting of a word (Lample et al., 2016).
• 3) CSE is the hidden state of a BiLSTM (LS T M (sc) ) on the character sequence consisting of a whole sentence (Akbik et al., 2018).
CWSE for a word is the concatenation of the last hidden states of the forward LSTM and the backward LSTM of LS T M (wc) . CSE for a word is the concatenation of the hidden state of the last character of the word for the forward LSTM of LS T M (sc) , and the hidden state of the first character of the word for the backward LSTM of LS T M (sc) . We note that LS T M (sc) is pretrained as character-level NLMs with log-likelihood. For example, a forward LSTM is trained with the following: where c t is the t-th character in a sentence. After obtaining the vector representations of the input words X = x 1 , · · · , x N , the hidden state h i of BiLSTM and the scores of NE tags e i for each word w i are computed as: where the dimension of h i is d, W e ∈ R k×d is a weight matrix, and k is the number of NE types to be identified.
The BiLSTM-CRF+CSE model predicts a tag sequence with the score matrix P = (e 1 , · · · , e N ) T ∈ R N×k using CRF. Note that P i, j is the score of the j-th tag for the i-th word. In particular, the probability of the output tag sequence y = y 1 , · · · , y N is calculated as follows: where Y w is the set of all possible tag sequences, A is a matrix of transition scores, A i, j represents a score that transits from the i-th tag to the j-th tag, and y −1 and y N+1 are the special tag for the start of the sentence and the end of sentence, respectively. During training, the model maximizes the following equation using the correct tag sequences: (8) where y is the correct tag sequence of w. When recognizing NEs, the model outputs a tag sequence that maximizes the score calculated by the following equation: y * = arg max y∈Y w s(w,ỹ).

Paraphrase Model
We used the ANMT model (Luong et al., 2015;Bahdanau et al., 2015), which is a standard in machine translation, as a chemical compound paraphrase model. The ANMT model converts an input sequence w into another sequence y trg = y trg 1 , y trg 2 , · · · , y trg T , which is a paraphrase of the input in our model, using an RNN encoder and an RNN decoder. The RNN encoder converts an input sequence to a multiset of fixed-length vectors, and then the RNN decoder generates an output sequence from the converted fixed length vector. We use a bidirectional LSTM defined by Eq. (2)  where W e is a weight matrix and tanh is the hyperbolic tangent function. The context vector o j is a weighted average of the encoder's hidden states: where exp is the natural exponential function. The decoder generates an output based on the probability distribution of the j-th token: p(y trg j |y trg < j , w) = softmax(W sŝ j ), where W s is a weight matrix. The objective function is defined as follows: where D is the training data and θ is the set of model parameters.

Handling Paraphrases in NER
Our method, HanPaNE, learns the NER model described in Section 2.1 and the chemical compound paraphrase model described in Section 2.2 at the same time through multi-task learning. In the multi-task learning, the character embedding weight matrix and the LSTM parameters in Eq.
(3) and the parameters of LS T M (wc) are shared between the two models. By the parameter sharing, the LSTM part of the NER model is expected to convert the same compound with different notations into a similar vector expression. The objective functions of the two models are Eq. (8) and Eq. (11), respectively.

Experimental Settings
We used the BioCreative IV CHEMDNER data set, which was preprocessed by Luo et al (2018)  NLM are set to 100, 25, 50, 200, 400, and 2048, respectively. We compared the following methods. "+P" indicates with consideration of paraphrases in Pub-ChemDic and "-P" indicates without consideration of paraphrases.
• A BiLSTM-CRF+CSE of (Akbik et al., 2018) described in Section 2.1 was used as our Baseline.
• VE-P is a baseline trained with virtual examples (VEs) created by randomly replacing NEs of training data with chemical compounds in the PubChemDic similar to (Yi et al., 2004).
• VE+P is a baseline trained with VEs created by replacing NEs of training data with their corresponding paraphrases in the Pub-ChemDic.
• HanPaNE-P is a multi-task for NER and paraphrasing trained with randomly generated sentence pairs with PubChemDic.
• HanPaNE+P is the proposed method trained with generated sentence pairs by replacing NEs in sentences with their corresponding paraphrases in the PubChemDic.
The baseline was trained with 55,458 sentences of CHEMDNER training data. VE-P was trained with 110,916 sentences in total consisting of the original CHEMDNER training data and 55,458 sentences automatically generated from the CHEMDNER training data. VE+P was trained with 59,033 sentences consisting of 3,575 sentences automatically generated from the original CHEMDNER training data with paraphrases of chemical compounds and the original CHEMD-NER training data. 5 For HanPaNE-P and Han-PaNE+P, we used randomly selected 100,000 sentences from PubMedAbs for training paraphrasing and the original CHEMDNER training data for NER.
For example, for +P, "... Phenylalanine is ..." are converted into "... L-β-phenylalanine is ..." and ".. (S)-2-Benzylglycine is ...", where L-βphenylalanine and (S)-2-Benzylglycine are paraphrases of Phenylalanine. As for -P, "... Phenylalanine is ..." is converted into like "... ethylene is ...." where ethylene is randomly sampled from chemical compounds. Table 1 shows the experimental results. We can see that HanPaNE+P showed the highest accuracy and HanPaNE+P and VE+P, with consideration of paraphrases, showed a higher accuracy than Baseline. In contrast, HanPaNE-P and VE-P, without consideration of paraphrases, did not. The results indicate that the use of paraphrases contributed to improved accuracy.

Experimental Results
We also conducted the following two types of hypothesis testing. The first one is a McNemar paired test on the labeling disagreements of words assigned by HanPaNE and the others as in (Sha and Pereira, 2003). All the results except for Baseline were significantly different (p < 0.01). The second one is a bi-nominal test used in (Sasano and Kurohashi, 2008). For this test, the number of the entities correctly recognized by only HanPaNE and the number of entities correctly recognized by only the other method are counted. Then, based on the assumption that outputs have the binomial distribution, we apply a binomial test. All the results were significantly different for this test (p < 0.05). These results showed that HanPaNE works better than augmented training data.
We also evaluated the accuracy on NEs not covered by training data and PubChemDic. From Table 2, we can see that our method also showed the best performance for both of the NEs not covered by the training data, PubChemDic and the covered NEs. The results indicate that our method contributed to recognizing NEs not covered by existing large data sets. The training data includes 8,508 types of chemical terms and covers 60.76% of chemical terms in the test data. Pub-ChemDic includes 337,289,536 types of chemical compound names and covers 10.02% of the chem-

Related Works
BioCreative IV's CHEMDNER task Table 3 shows a comparison with the previous best results. (Leaman et al., 2015) and  proposed a feature-based approach to improve the chemical NER performance. (Lin et al., 2018) proposed a neural network approach that treats document level information to maintain tagging consistency across sentences. By learning paraphrasing, our method showed the best accuracy on the CHEMDNER task.

Multi-task learning
Multi-task learning is often utilized to leverage the performance of NLP systems (Liu et al., 2015;Luong et al., 2016;Dong et al., 2015;Hashimoto et al., 2017), including NER.  and Rei (2017) studied multi-task learning of sequence labeling with language models. Aguilar et al. (2018) and Cao et al. (2018) proposed multi-task learning of NER with word segmentation. Peng and Dredze (2017)'s method of multi-task learning leverages the performance of domain adaptation. 's method utilizes multi-task learning of NER with several NLP tasks such as POS tagging and parsing. Crichton et al. (2017) and  leverage the performance of NER by multi-task learning of several tasks of biomedical NLP.