Comparing CNN and LSTM character-level embeddings in BiLSTM-CRF models for chemical and disease named entity recognition

We compare the use of LSTM-based and CNN-based character-level word embeddings in BiLSTM-CRF models to approach chemical and disease named entity recognition (NER) tasks. Empirical results over the BioCreative V CDR corpus show that the use of either type of character-level word embeddings in conjunction with the BiLSTM-CRF models leads to comparable state-of-the-art performance. However, the models using CNN-based character-level word embeddings have a computational performance advantage, increasing training time over word-based models by 25% while the LSTM-based character-level word embeddings more than double the required training time.


Introduction
Bi-directional Long-Short Term Memory Conditional Random Field models (BiLSTM-CRF), in which a BiLSTM is coupled with a CRF layer to connect output tags, have been shown to achieve state-of-art performance in sequence tagging tasks including part of speech (POS) tagging, chunking, and NER (Huang et al., 2015). The combination of word embeddings and character-level word embeddings has been explored in this context, with Ma and Hovy (2016) using Convolutional Neural Networks (CNNs) to construct character-level word embeddings and Lample et al. (2016) applying LSTM networks. This work showed that the use of character-level word embeddings improves the performance of the models, by contributing the ability to recognize unseen words.
Biomedical Named Entity Recognition (BNER) is a vital initial step for information extraction tasks in the biomedical domain, including the Chemical-Disease Relationship (CDR) extraction task where both chemical and disease entities must be identified (Li et al., 2016). Character-level word embeddings could be particularly significant in this context, given that new entity names are frequently created, and may follow consistent patterns including productive morphology such as common prefixes (e.g., di-) or suffixes (e.g., -ase).
Features that capture word-internal characteristics have been shown to be effective for BNER tasks in CRF models (Klinger et al., 2008). Lyu et al. (2017) applied a BiLSTM-CRF model with LSTM-based character-level word embeddings to a gene and protein NER task, demonstrating state-of-art performance that outperformed traditional feature-based models. Luo et al. (2018) further improved on this result on a chemical NER task by adding an attention layer between the BiL-STM and CRF layers (Att-BiLSTM-CRF).
In an experiment by Reimers and Gurevych (2017b), optimal hyper-parameters for LSTM networks in sequence tagging tasks were explored, with the finding that incorporation of characterlevel word embeddings significantly improved performance on NER tasks on general datasets including CoNLL 2003 (Tjong Kim Sang and De Meulder, 2003). However, the choice of CNN-based (Ma and Hovy, 2016) or LSTM-based character-level word embeddings (Lample et al., 2016) did not affect the performance significantly. Since the CNN has fewer parameters to train than BiLSTM network, it is better in terms of training efficiency, and was recommended as the preferred approach.
In this paper, we implement and compare models with each type of word embedding to generate empirical results for the tasks of chemical and disease NER, using the BioCreative V CDR corpus (Li et al., 2016). These BNER categories are the most searched entities in the biomedical literature (Islamaj Dogan et al., 2009), and hence particularly important to study.
The results show that models with CNN-based character-level word embeddings achieve stateof-the-art results comparable to LSTM-based character-level word embeddings, while having the advantage of reduced training complexity, demonstrating that the prior results also hold for the BNER task.

Experimental methodology
This section presents our empirical approach to comparing state-of-the-art neural network models for chemical and disease NER.

Dataset
In our experiments, we use the BioCreative V CDR corpus (Li et al., 2016). This corpus provides a set of 1000 manually-annotated abstracts (9193 sentences) for training and development, and another set of 500 manually-annotated abstracts (4840 sentences) for test. In particular, we used a pre-processed version of the CDR corpus from Luo et al. (2018), 1 which provides predicted POS-, chunking-and gazetteer-based tags: • POS and chunking tags are predicted by the GENIA tagger (Tsuruoka et al., 2005). 2 • Gazetteer tags are encoded in BIO tagging scheme based on matching to the external Jochem chemical dictionary (Hettne et al., 2009).
Following Luo et al. (2018), we randomly sample 10% from the set of 1000 abstracts for development, and use the remaining for training.

Models
We use the following BiLSTM-CRF-based sequence labeling models: • Baseline BiLSTM model (Schuster and Paliwal, 1997;Hochreiter and Schmidhuber, 1997) which uses a softmax layer to predict NER labels of input words.
• BiLSTM-CRF (Huang et al., 2015) extends the BiLSTM model with a CRF layer which allows the model to use sentence-level tag information for sequence prediction.
• BiLSTM-CRF + CNN-char (Ma and Hovy, 2016) extends the BiLSTM-CRF model with character-level word embeddings. For each word, its character-level word embedding is derived by applying a CNN to the character sequence in the word.  Table 2: Hyper-parameters for learning characterlevel word embedding. "charEmbedSize" and "# of Params." denote the vector size of character embeddings and the total number of parameters, respectively.
• BiLSTM-CRF + LSTM-char also extends the BiLSTM-CRF model with character-level word embeddings which are derived by applying a BiLSTM to the character sequence in each word (Lample et al., 2016).
Following Luo et al. (2018), we also consider the impact of extra features including syntactic features such as POS and chunking tags, and a chemical term feature based on matching to an external gazetteer. Figure 1 illustrates the general BiLSTM-CRF model architecture with characterlevel word embeddings and additional features, while Figure 2 illustrates CNN-based and LSTMbased architectures for learning the character-level word embeddings.

Implementation details
We used a well-known implementation of BiLSTM-CRF-based models from Reimers and Gurevych (2017b). 3 We used the training set to learn model parameters, the development set to select optimal hyper-parameters, and the test set to report final results. Here, we tune the model hyper-parameters using the performance across both NER categories ("Overall") on the development set.
We employed pre-trained 50-dimensional word vectors from Luo et al. (2018). These pre-trained vectors were derived by training the Word2Vec skip-gram model (Mikolov et al., 2013) on a large text collection of 2 million MEDLINE abstracts.  Reimers and Gurevych (2017b) showed that the BiLSTM-CRF model achieved best performance with 2 BiLSTM layers. Therefore, in our experiment, we only evaluated models up to 2 stacked BiLSTM layers. The size of LSTM hidden states in each layer was selected from [100,150,200,250]. We achieved the highest F1 score on the development set when using 250-dimensional LSTM hidden states for all models.
By default, each of the additional features (POS, chunking tags, gazetteer match tag) was incorpo-rated into the model via a 10-dimensional embedding. Other hyper parameters were also fixed as in Reimers and Gurevych (2017b) during initialization. See tables 1 and 2 for more details.
In the training process, we used the score on development set to assess model improvement. Early stopping was applied if there was no improvement after 10 epochs. The threshold for a word that was not in the word embedding vocabulary to be added into the embedding was set to 5. The average training time for each epoch was also recorded.  3 Main results Table 3 presents our empirical results. The first three rows show the performance of baseline models without the CRF layer, the next three rows show the performance of BiLSTM-CRF models without additional features, and then the next three rows show the results for BiLSTM-CRF models with additional gazetteer features. As the empirical results in Table 3 show, the model with CNN character-level embeddings (CNN-char) and the model with LSTM characterlevel embeddings (LSTM-char) achieved similar overall F1 scores (87.88% and 87.79%, respectively), outperforming BiLSTM-CRF by approximately 1% in absolute terms. In particular, on chemical NER, both BiLSTM-CRF-based models with character-level word embeddings obtained the same F1 score (91.94%), while on disease NER the model with CNN-char obtained slightly higher performance (83.01%) than the model with LSTM-char (82.83%). All models with the CRF layer outperformed their respective baseline BiL-STM models in F1 scores for all entity categories.

Effect of additional features
When incorporating additional POS and chunking features into three baseline BiLSTM-CRF-based models, we found that no performance improvement based on the baseline models was observed.
On chemical NER, the additional gazetteer feature improved the baseline BiLSTM-CRF by about 0.8% while it only improved the baselines BiLSTM-CRF + CNN-char and BiLSTM-CRF + LSTM-char by about 0.3%, thus clearly indicating that character-level word embeddings can capture unseen word information. Considering both NER categories together ("Overall"), the best performance was also obtained when the gazetteer feature was added, reaching overall F1 scores of 88.02% and 87.99%, respectively, for the two CNN-based and LSTM-based character-level embedding models.

Comparison with prior work
The performance comparison between our BiLSTM-CRF-based models and other machine learning approaches to the two studied NER tasks is also shown in Table 3. The pattern of chemical NER outperforming disease NER is consistent across all tools.
The Att-BiLSTM-CRF model (Luo et al., 2018) used a BiLSTM-CRF model with LSTM character-level word embedding and an additional attention layer. It achieved an F1 score of 91.96% on chemical NER without additional features. The positive effect of a gazetteer feature was also observed in their results; the model with syntactic and gazetteer features reached an F1 score of 92.57%. Note that the datasets used in this paper might not be exactly the same as ours due to random sampling.
The last three rows of Table 3 show the results presented in , where 950 of the abstracts were used for training and 50 for development (cf. our 900/100 split). Dnorm (Leaman et al., 2013) is a model based on pairwise learning to rank on disease name normalization, which achieved F1 score of 80.7% on disease NER. The tmChem (Leaman et al., 2015) is based on CRF; using numerous hand-crafted features it reached an F1 score of 88.4% on chemical entities. As a semi-Markov model with a richer set of features for NER tasks, TaggerOne  achieved F1 score of 91.4% and 82.6% on chemical and disease entities, respectively.
Compared to previous non-deep-learning methods using CRFs, the BiLSTM-CRF models have significant advantage on F1 score of both chemical and disease entities, primarily due to improvement on recall.

Discussion
In our experiment on the effect of additional features, we found that syntactic features such as POS and chunking information did not have clear positive effect on the performance. In contrast, the match/partial match between words and entries in the chemical gazetteer is a good indicator for the presence of chemical entities. Since the Jochem dictionary contains only chemical entities, it is not surprising that the performance on diseases was not substantially impacted by adding the gazetteer feature, although some small variations in performance can be observed, likely due to changed influences from neighboring terms.
The empirical results shown that models using either CNN-char or LSTM-char achieve a similar overall F1 score on chemical and disease NER. The results are further comparable with other state-of-the-art models. This indicates that these character-level models have sufficient complexity to learn the generalizable morphological and lexical patterns in biomedical named entity terms.
On the other hand, as shown by the substantial differences in the number of parameters in Table 2, CNN (LeCun et al., 1989) has the advantage of reduced training complexity as compared to the LSTM models (Hochreiter and Schmidhuber, 1997) under similar experimental settings. In our experimental environment, the execution time of the model with LSTM-char increased 115% relative to the baseline BiLSTM-CRF model, while it only increased by 25% for with CNN-char, as detailed in Table 4. Therefore, consistent with prior results on general NER, we conclude that CNNbased embeddings are preferable to LSTM-based embeddings for BNER.
We analyzed the error cases of the CNN-char and LSTM-char models without additional fea-

Model
Avg. Runtime per Epoch (seconds) ∆ BiLSTM-CRF 106 0 + CNN-char 134 +26% + LSTM-char 229 +115% tures: 3326 and 3271 words were incorrectly predicted using CNN-char and LSTM-char, respectively, with 2138 mistakes in common. In errors which only was made by one of the two models, we found that CNN-char made more false positive predictions and fewer false negative predictions, while LSTM-char made approximately an even number of the two kinds of false predictions.
The relationship between the length of words and these errors was also explored. For words less than 20 characters in length, the distribution of errors is almost identical for the two models. However, for longer words, the model with LSTM-char tends to make more mistakes. This supports prior observations that LSTM can be difficult to apply to long sequences of input (Bradbury et al., 2017). In approximately 50% of error cases, the word length is short, less than 5 characters. Short biomedical named entities are usually abbreviations and tend to be out-of-vocabulary terms, and are therefore particularly difficult for the character-level word embedding models to capture (Habibi et al., 2017).

Conclusion
We compared the performance of BiLSTM-CRF models with CNN-based and LSTM-based character-level word embeddings for biomedical named entity recognition. We confirmed previously published results on chemical and disease NER that demonstrate that character-level embeddings are helpful. We further show empirically, generalizing prior results for general NER to the biomedical context, that there is little difference between the two approaches: both types of character-level word embeddings achieved identical F1 score on the chemical NER task, and similar performance on disease NER (with CNN-char showing a slight performance advantage). However, the CNN embeddings show a substantial advantage in reduced training complexity.