Constrained Decoding for Computationally Efficient Named Entity Recognition Taggers

Current state-of-the-art models for named entity recognition (NER) are neural models with a conditional random field (CRF) as the final layer. Entities are represented as per-token labels with a special structure in order to decode them into spans. Current work eschews prior knowledge of how the span encoding scheme works and relies on the CRF learning which transitions are illegal and which are not to facilitate global coherence. We find that by constraining the output to suppress illegal transitions we can train a tagger with a cross-entropy loss twice as fast as a CRF with differences in F1 that are statistically insignificant, effectively eliminating the need for a CRF. We analyze the dynamics of tag co-occurrence to explain when these constraints are most effective and provide open source implementations of our tagger in both PyTorch and TensorFlow.


Introduction
Named entity recognition (NER) is the task of finding phrases of interest in text that map to real world entities such as organizations ("ORG") or locations ("LOC"). This is normally cast as a sequence labeling problem where each token is assigned a label that represents its entity type. Multi-token entities are handled by having special "Beginning" and "Inside" indicators that specify which tokens start, continue, or change the type of an entity. Ratinov and Roth (2009) show that the IOBES tagging scheme, where entity spans must begin with a "B" token, end with an "E" token and where single token entities are labeled with an "S", performs better than the traditional BIO scheme. The IOBES tagging scheme dictates that some token sequences are illegal. For example, one cannot start an entity with an "E" tag (such as a transition from an "O", meaning it is outside of an entity, to "E-ORG") nor can they change types in the middle of an entity-for example, transitioning from "I-ORG" to "I-LOC". Most approaches to NER rely on the model learning which transitions are legal from the training data rather than injecting prior knowledge of how the encoding scheme works.
It is conventional wisdom that, for NER, models with a linear-chain conditional random field (CRF) (Lafferty et al., 2001) layer perform better than those without, yielding relative performance increases between 2 and 3 percent in F1 (Ma and Hovy, 2016;Lample et al., 2016). A CRF with Viterbi decoding promotes, but does not guarantee, global coherence while simple greedy decoding does not (Collobert et al., 2011). Therefore, in a bidirectional LSTM (biLSTM) model with a CRF layer, illegal transitions are rare compared to models that select the best scoring tag for each token.
Due to the high variance observed in the performance of NER models (Reimers and Gurevych, 2017) it is important to have fast training times to allow for multiple runs of these models. However, as the CRF forward algorithm is O(N T 2 ), where N is the length of the sentence and T is the number of possible tags, it slows down the training significantly. Moreover, substantial effort is required to build an optimized, correct implementation of this layer. Alternately, training with a cross-entropy loss runs in O(N ) for sparse labels and popular deep learning toolkits provide an easy to use, parallel version of this loss which brings the runtime down to O(log N ).
We believe that, due to the strong contextualized local features with infinite context created by today's neural models, global features used in the CRF do little more than enforce the rules of an encoding scheme. Instead of traditional CRF training, we propose training with a cross-entropy loss and using Viterbi decoding (Forney, 1973) with heuristically determined transition probabilities that prohibit illegal transitions. We call this constrained decoding and find that it allows us to train models in half the time while yielding F1 scores comparable to CRFs.

Method
Training a tagger with a CRF is normally done by minimizing the negative log likelihood of the sequence of gold tags given the input, parameterized by the model, where the probability of the sequence is given by By creating a feature function, f j , that is spanencoding-scheme-aware, we can introduce constraints that penalize any sequence that includes an illegal transition by returning a large negative value. Note the summation over all possible tag sequences. While efficient dynamic programs exist to make this sum tractable for linear-chain CRFs with Markov assumptions, this is still a costly normalization factor to compute.
In neural models, these feature functions are represented as a transition matrix that represents the score of moving from one tag y at index i to another at i+1. We implement a mask that effectively eliminates invalid IOBES transitions by setting those scores to large negative values. By applying this mask to the transition matrix we can simulate feature functions that down-weigh illegal transitions.
Contrast the CRF loss with the token-level crossentropy loss where y is the correct labels andŷ is the model's predictions.
Here we can see that the loss for each element in the input i can be computed independently due to the lack of a global normalization factor. This lack of a global view is potentially harmful, as we lose the ability to condition on the previous label decision to avoid making illegal transitions. We hypothesize that, using our illegal transition heuristics, we can create feature functions that do not have to be trained, but can be applied at test time and allow for contextual coherence while using a cross-entropy loss.
We can use the mask directly as the transition matrix to calculate the maximum probability sequence while avoiding illegal transitions for models that were not trained with a CRF. Using these transitions scores in conjunction with cross-entropy trained models, we can achieve comparable models that train more quickly. We call this method constrained decoding.
Constrained decoding is relatively easy to implement, given a working CRF implementation, all one needs to do is apply the transition mask to the CRF transition parameters to create a constrained CRF. Replacing the transition parameters with the mask yields our constrained decoding model. Starting from scratch, one only needs to implement Viterbi decoding, using the mask as transition parameters, to implement the constrained decoding model-avoiding the need for the CRF forward algorithm and the CRF loss.
For constrained decoding, we leverage the IOBES tagging scheme rather than BIO tagging, allowing us to inject more structure into the decoding mask. Early experiments with BIO tagging failed to show the large gains we realized using IOBES tagging for the reasons mentioned in Section 4.

Experiments & Results
To test if we can replace the CRF with constrained decoding we use two sequential prediction tasks: NER (CoNLL 2003 (Tjong Kim Sang and De Meulder, 2003), WNUT-17 (Derczynski et al., 2017), andOntoNotes (Hovy et al., 2006)) and slot-filling (Snips (Coucke et al., 2018)). For each (task, dataset) pair we use common embeddings and hyperparameters from the literature. The baseline models are biLSTM-CRFs with character compositional features based on convolutional neural networks (Dos Santos and Zadrozny, 2014) and our models are identical except we train with a crossentropy loss and use the encoding scheme constraints as transition probabilities instead of learning them with a CRF. Our hyper-parameters mostly follow Ma and Hovy (2016), except we use multiple pre-trained word embeddings concatenated together (Lester et al., 2020). For Ontonotes we follow Chiu and Nichols (2016). See Section A.7 or the configuration files in our implementation for more details.
As seen in OntoNotes is the only dataset where the difference in performance between the CRF and constrained decdoing is statistically significant (p < 0.5). All scores are entity-level F1 and are reported across 10 runs. the only dataset with a statistically significant difference in performance. We explore this discrepancy in Section 4. Similarly, Table 2 shows that when we apply constrained decoding to a variety of internal datasets, which span a diverse set of specific domains, we do not observe any statistically significant differences in F1 between CRF and constrained decoding models.
The models were trained using Mead-Baseline (Pressel et al., 2018), an open-source framework for creating, training, evaluating and deploying models for NLP. The constrained decoding tagger performs much faster at training time. Even when compared to the optimized, batched CRF provided by Mead-Baseline, it trained in 51.2% of the time as the CRF.
In addition to faster training times, training our constrained models produces only 65% of the CO 2 emissions that the CRF does. While GPU computations for the constrained model draw 1.3 times more power-due to the greater degree of possible parallelism in the cross-entropy loss functionthan the CRF, the reduction in training time results in smaller carbon emissions as calculated in Strubell et al. (2019).
Constrained decoding can also be applied to a CRF. The CRF does not always learn the rules of a transition scheme, especially in early training iterations. Applying the constraints to the CRF can improve both F1 and convergence speed. We establish this by training biLSTM-CRF models with and without constraints on CoNLL 2003. We find that   Table 3: Results on well-known datatsets presented as relative differences to help frame results in Table 2 the constraint mask yields a small (albeit statistically insignificant) boost in F1 as shown in Table  4. Our experiments suggest that injecting prior knowledge of the transition scheme helps the model to focus on learning the features for sequence tagging tasks (and not the transition rules themselves) and train faster. Table 5 shows that our constrained model converged 1 on CoNLL 2003 faster on average than an unconstrained CRF.

Analysis
The relatively poor performance of constrained decoding on OntoNotes suggests that there are several classes of transition that it cannot model. For example, the transition distribution between entity types, 1    or the prior distribution of entities. We analyzed the datasets to identify the characteristics that cause constrained decoding to fail. One such presumably obvious characteristic is the number of entity types. However, our experiments suggest that number of entity types does not affect performance: Snips has more entity types than OntoNotes yet constrained decoding works better for Snips.
We define an ambiguous token as a token whose type has multiple tag values in the dataset. For example the token "Chicago" could be "I-LOC" or "I-ORG" in the phrases "the Chicago River" and "the Chicago Bears" respectively. Such ambiguous tokens are the ones for which we expect global features to be particularly useful. A "strictly dominated token" is defined as a token that can only take on a single value due to the legality of the transition from the previous tag. In the above example given that "the" was a "B-LOC" then "Chicago" is strictly dominated and forced to be an "I-LOC'. Contrast this with a non-strictly dominated token that can still have multiple possible tag values when conditioned on the previous tag. As constrained decoding eliminates illegal transitions we would expect that it would perform well on datasets where a large proportion of ambiguous tokens are strictly dominated. This tends to hold true-only 15.9% of OntoNotes' ambiguous tokens are strictly dominated while 70.7% of CoNLL's tokens are and for WNUT-17 73.6% are.
We believe that the ambiguity of the first and last token of an entity also plays a role. Once we start an entity, constrained decoding vastly narrows the scope of decisions that need to be made. Instead of making a decision over the entire set of tags, we only decide if we should continue the entity with an "I-" or end it with an "E-". Therefore, we expect constrained decoding to work well with datasets that have fairly unambiguous entity starts and ends. We quantify this by finding the proportion of entities that begin (or end) with an unambiguous type, that is, the first token of an entity only has a single label throughout the dataset, for example, "Kuwait" is only labeled with "S-LOC" in the CoNLL dataset. We call these metrics "Easy First" and "Easy Last" respectively and find that datasets with higher constrained decoding performance also have a higher percentages of entities with an easy first or last token. A summary of these characteristics for each dataset is found in Table 6.
This also explains why constrained decoding doesn't work as well for BIO-encoded CoNLL as it does for IOBES. When using the IOBES format, more tokens are strictly dominated. The other stark difference is the proportion of "Easy Last" entities. Without the "E-" token, much less structure can be injected into the model, resulting in decreased performance of constrained decoding. These trends also hold true in internal datasets, where the Automotive dataset had the fewest incidences of each of these phenomena.
While not perfect predictors for the performance of constrained decoding, the metrics chosen are good proxies and can be used as a prescriptive measure for new datasets.

Previous Work
Our approach is similar in spirit to previous work in NLP where constraints are introduced during training and inference time (Roth and Yih, 2005;Punyakanok et al., 2005) to lighten the computational load, and to Strubell et al. (2018) where prior knowledge is injected into the model by manual manipulation. In our approach, however, we focus specifically on manipulating the model weights themselves rather than model features.
There have been attempts to eliminate the CRF layer, notably, Shen et al. (2017) found that an additional LSTM greedy decoder layer is competitive with the CRF layer, though their baseline is much weaker than the models found in other work. Additionally, their decoder has an auto-regressive relationship that is difficult to parallelize and, in practice, there is still significant overhead at training time. Chiu and Nichols (2016) mention good results with a similar technique but don't provide in-depth analysis, metrics, or test its generality.

Conclusion
For sequence tagging tasks, a CRF layer introduces substantial computational cost. We propose replacing it with a lightweight technique, constrained  Table 6: Analysis of the tag dynamics and co-occurrence. We see that OntoNotes is an outlier in the percentage of ambiguous tokens that are strictly dominated by their context, the entities that have easy to spot starting tokens, and entities with clearly defined ends. All of these quirks of the data help explain why we only see a statistically significant performance drop for OntoNotes.
decoding, which doubles the speed of training with comparable F1 performance. We analyze the algorithm to understand where it might work or fail and propose prescriptive measures for using it. The broad theme of the work is to find simple and computationally efficient modifications of current networks and suggest possible failure cases. While larger models have shown significant improvements, we believe there is still relevance in investigating small, targeted changes. In the future, we want to explore similar techniques in other common NLP tasks. G.D. Forney. 1973

A.1 Hyperparameters
Mead/Baseline is a configuration file driven model training framework. All hyperparameters are fully specified in the configuration files included with the source code for our experiments.

A.2 Statistical Significance
For all claims of statistical significance we use a t-test as implemented in scipy (Virtanen et al., 2020) and using an alpha value of 0.05.

A.3 Computational Resources
All models were trained on a single NVIDIA 1080Ti. While multiple GPUs were used for training many models in parallel to facilitate testing many datasets and to estimate the variability of the method, the actual model can easily be trained on a single GPU.

A.4 Evaluation
To calculate metrics, entity-level F1 is used for NER and slot-filling. In entity-level F1, entities are created from the token-level labels and compared to the gold entities. Entities that match on both type and boundaries are considered correct while a mismatch in either causes an error. The F1 score is then calculated using these entities. We use the  evaluation code that ships with the framework we use, MEAD/Baseline, which we have bundled with the source code for our experiments.

A.5 Model Size
The number of parameters in different models can be found in Table 7.

A.6 Dataset Information
Relevant information about datasets can be found in Table 8. The majority of data is used as distributed, except we convert NER and slot-filling datasets to the IOBES format. All public datasets are included in the supplementary material. A quick overview of each dataset follows: CoNLL: A NER dataset based on news text. We converted the IOB labels into the IOBES format. There are 4 entity types, MISC, LOC, PER, and LOC.
WNUT-17: A NER dataset of new and emerging entities based on noisy user text. We converted the BIO labels into the IOBES format. There are 6 entity types, corporation, creative-work, group, location, person, and product.
OntoNotes: A much larger NER dataset. We converted the labels into the IOBES format. There are 18 entity types, CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, and WORK OF ART.
Snips: A slot-filling dataset focusing on commands one would give a virtual assistant. We converted the dataset from its normal format of two associated files, one containing surface terms and one containing labels in the more standard CoNLL file format and converted the labels into the IOBES format.     (2018). "Character Filter Size" is the number of token the character compositional convolutional neural network cover is a single window, "Character Feature Size" is the number of convolutional features maps used, and "Character Embed Size" is the dimensionality of the vectors each character is mapped to before it is the input to the convolutional network. The "RNN Size" is the size of the output after the RNN which means that bidirectional RNNs are composed to two RNNs, one in each direction, where both are half the "RNN Size". "Drop In" is the probability that an entire token will be drop out from the input, while "Drop Out" is the probability that individual neurons are dropped out (Srivastava et al., 2014).