Neural Language Modeling for Named Entity Recognition

Named entity recognition is a key component in various natural language processing systems, and neural architectures provide significant improvements over conventional approaches. Regardless of different word embedding and hidden layer structures of the networks, a conditional random field layer is commonly used for the output. This work proposes to use a neural language model as an alternative to the conditional random field layer, which is more flexible for the size of the corpus. Experimental results show that the proposed system has a significant advantage in terms of training speed, with a marginal performance degradation.


Introduction
With the help of various neural network architectures, named entity recognition (NER) systems nowadays achieve very promising performance for tasks with a limited number of entity types (Akbik et al., 2018;Devlin et al., 2019;Jiang et al., 2019). Besides various types of embedding techniques and long shortterm memory (LSTM) (Hochreiter and Schmidhuber, 1997) recurrent neural networks (RNN), another important component of those systems is the conditional random field (CRF) layer (Lafferty et al., 2001). The CRF layer is very effective for tasks such as NER and part-of-speech (POS) tagging. However, the training time of such a model increases quadratically with the vocabulary size, which is the number of different entity types in the case of NER.
To tackle this problem, we use LSTM-based neural language models (LM) on tags as an alternative to the CRF layer. With a separately trained LM (without using additional monolingual tag data), the training of the new system is about 2.5 to 4 times faster than the standard CRF model, while the performance degradation is only marginal (less than 0.3%). Thanks to its time efficiency, our system can easily be applied to corpora containing hundreds and thousands of entity types.
In addition, the LSTM-based LM potentially can perform better as it encodes more contextual information than the CRF layer. To unlock the full power of the LM, we also try to train the tagging model and LM jointly at the sequence level. In this case we lose the speed advantage, but the jointly trained system achieves comparable performance as the state-of-the-art NER model on four different corpora.

Background
In a bidirectional LSTM (BLSTM) CRF-based NER system, the score of a name entity sequence for a given word sequence is defined as The first component of the score models the transitions between the current tag at time t and its predecessor. Therefore, it is referred to as transition score. The second term models the dependencies of the current tag y t on the word sequence x T 1 and it can be called emission score. The probability of the tag sequence can be defined as where θ denotes the trainable parameters of the model, including the transition matrix A and free parameters of the BLSTM model. The training objective is therefore to maximize the log likelihood of the truth tag sequence θ = arg max It is to be noted that the denominator of the probability in Equation 2 is calculated from scores of all possible tag sequences. For example, suppose the total number of unique tags is Y , the computation complexity of the sequence probability is O(Y T ) (ignoring the computational complexity of BLSTM features). Since the CRF layer models only a first-order dependency, dynamic programming can be employed in training, reducing the time complexity to O(T Y 2 ). The objective of decoding is to find a tag sequence that maximizes the score for given word sequence and model parameters: CRF based models have long been applied in sequence tagging problems. Compared to other non-CRF models, such as a pure BLSTM model, CRF models offer two advantages: first, explicit modeling of transitions between tags; second, optimization at the sequence level.
Despite the benefits, the training time of BLSTM-CRF models increases quadratically with the dimension of the tag set. On the other hand, the CRF layer models a first-order dependency of the tag sequences. Intuitively, it might be helpful to model a higher-order dependency, although this could aggravate the training problem. To this end, we propose a hybrid system, which models the tag sequence dependencies with an LSTM-based LM rather than CRF.

Hybrid System with Neural Language Models
By taking the logarithm of the sequence probability, we obtain The score of the tag sequence can be replaced by the right terms of the Equation 1. We observe that the summation of the transition scores is very similar to the log probability of the tag sequence, which is calculated from a first-order LM. If we replace the transition matrix A with an LSTM-based tag LM, the sequence score can be defined as Note that in practice, we add a scale λ to the log probabilities. However, due to the introduction of the LSTM-based LM, sequence training with dynamic programming is no longer feasible. The following sections discuss two different approaches to train the hybrid model.

Separate Training
The training objective is to maximize the sequence score of the true tags, i.e. s θ (x T 1 ,ŷ T 1 ). We note that the first term and the second term of the new sequence score are independent of each other, eventually leading to cross-entropy training of the BLSTM-based tagging model and the LSTM-based LM separately: Since no additional monolingual data is used and the tag set is usually quite small (compared to the word vocabulary), training such a tag LM is very cheap. The computational complexity is therefore dominated by the BLSTM-based tagging model, whose training is much faster than that of the CRF version. Compared to the CRF layer, the computational complexity of the softmax is O(T Y ). During decoding, we apply beam search and calculate the transition scores from the LSTM-based LM.

Joint Training
In addition to the separate training approach, the LSTM-based LM can also be trained jointly with the tagging model. In this case, a single training loss is computed and back-propagated to both LM and BLSTM-based tagging model.
To jointly train the hybrid model at the sequence level, theoretically the scores for all possible tag sequences must be calculated. The dynamic programming algorithm used for the first-order transition model is infeasible for our joint training with the LSTM-based LM.
A number of publications dealt with issues related to sequence training, particularly those related to optimizing the search process (Wiseman and Rush, 2016;Daumé III and Marcu, 2005). Inspired by them we develop a straightforward training method. Looking at the denominator in Equation 2, although it is not feasible to go through all tag sequences, we can estimate the denominator by considering only those hypotheses with the highest score and that can be generated from the beam search. The log likelihood of the true tags can be approximated by beam score (8) and it can be written as the difference between the gold score and the beam score. The beam search enables us to generate the top K hypotheses and their sequence scores, from which the beam score can then be calculated. The problem with this approximation is that maximizing the log likelihood may penalize the beam score. This is inconsistent with decoding because the best hypothesis is generated from the beam. Therefore, we want the best scored hypotheses to stay on the beam. To address this issue, we refer to Wiseman and Rush (2016), in which subsequent candidates are generated from the true token when it falls off the beam. We adopt this idea and develop our simplified method: for each time step, we replace the K-th candidate with the true tag if it is not included in the K-best list.

Experiments
We experiment on four benchmark NER tasks in three languages: CoNLL 2003 English/Dutch, OntoNotes English and GermEval 2014 German.

Models
As we conduct all our experiments using Flair (Akbik et al., 2019a), we adopt the recommended setup for the baseline BLSTM-CRF models. We use GloVe word embeddings (Pennington et al., 2014) for the CoNLL03 English task, while FastText embeddings (Mikolov et al., 2018) for CoNLL03 Dutch and GermEval, and FastText web crawl embeddings for OntoNotes are used. Pooled Flair embeddings (Akbik et al., 2019b) are used for all experiments except for OntoNotes, for which the non-pooled version (Akbik et al., 2018) is used. The BLSTM has a single 256-unit forward and backward LSTM layer followed by a CRF layer. For non-CRF baseline models, the CRF layers are simply removed leaving all other parameters unchanged. For each task, we train four CRF and non-CRF models respectively.
In all our hybrid model experiments, the BLSTM tagger component has the same configuration as the non-CRF baselines, while the LSTM-based tag LM has an embedding matrix of size |Tags| × 10, and an LSTM layer of 50 units. 1

Results
The F1 scores of all different models are shown in Table 1. The baseline BLSTM models with or without the CRF layer are indicated by CRF and Non-CRF. Hybrid models are denoted by Hybrid, and separate and joint indicate the used training strategy. The joint training method with forcing true tags in the beam is indicated by force. As shown in Table 1, the separately trained hybrid models significantly outperform the non-CRF baselines and performs slightly worse than the CRF models with respect to the F1 score. The jointly trained hybrid models perform comparably to the baseline CRF models. The modified sequence training method with forcing true tags in the beam does not help, probably because the impact of the problem described in Subsection 3.2 decreases with increasing beam size.

Training Time
To evaluate the training time, we perform all relevant experiments with one NVIDIA GeForce 1080 Ti. When using pre-trained embeddings, training time is dominated by the CRF layer due to low GPU utilization and high complexity, and CRF models require much more training time than non-CRF ones. Therefore, the separately trained hybrid model, which is intrinsically a non-CRF NER model and an LSTM-based tag LM, has an advantage over the CRF models in terms of training time. Table 2 shows the total training time of the baseline BLSTM-CRF-based NER model and our separately trained hybrid model. Both models are trained to converge with a similar number of epochs. We can see that the training of the hybrid model is about 2.5 to 4 times faster than that of the CRF model. Also, the more entity types the corpora contain, the greater the speed advantage.

Discussion and Future Work
In this work, we attempt to use an LSTM-based tag LM as an alternative to the CRF layer, and to build the BLSTM-based NER / LSTM-based LM hybrid model. The LM can either be trained in advance and used only for decoding or jointly trained with the NER model: The separately trained hybrid system speeds up the training significantly with a marginal performance degradation; And the jointly trained system, which has no speed advantage, can provide comparable performance with state-of-the-art baselines. In future work, the hybrid model can be modified so that the transition score contains more information. And the current joint training approach is straightforward, but primitive. A more sophisticated training method could also be helpful. In addition, we will test the hybrid system on larger corpora with more entity types.