Masked Conditional Random Fields for Sequence Labeling

Conditional Random Field (CRF) based neural models are among the most performant methods for solving sequence labeling problems. Despite its great success, CRF has the shortcoming of occasionally generating illegal sequences of tags, e.g. sequences containing an “I-” tag immediately after an “O” tag, which is forbidden by the underlying BIO tagging scheme. In this work, we propose Masked Conditional Random Field (MCRF), an easy to implement variant of CRF that impose restrictions on candidate paths during both training and decoding phases. We show that the proposed method thoroughly resolves this issue and brings significant improvement over existing CRF-based models with near zero additional cost.


Introduction
Sequence labeling problems such as named entity recognition (NER), part of speech (POS) tagging and chunking have long been considered as fundamental NLP tasks and drawn researcher's attention for many years.
Traditional work is based on statistical approaches such as Hidden Markov Models (Baum and Petrie, 1966) and Conditional Random Fields (Lafferty et al., 2001), where handcrafted features and task-specific resources are used. With advances in deep learning, neural network based models have achieved dominance in sequence labeling tasks in an end-to-end manner. Those models typically consist of a neural encoder that maps the input tokens to embeddings capturing global sequence information, and a CRF layer that models dependencies between neighboring labels. Popular choices of neural encoder have been convolutional neural network (Collobert et al., 2011), and bidirectional LSTM (Huang et al., 2015). Recently, pretrained language models such as ELMo  or BERT * Corresponding author. (Devlin et al., 2019) have been proven far superior as a sequence encoder, achieving state-of-the-art results on a broad range of sequence labeling tasks.
Most sequence labeling models adopt a BIO or BIOES tag encoding scheme (Ratinov and Roth, 2009), which forbids certain tag transitions by design. Occasionally, a model may yield sequence of predicted tags that violates the rules of the scheme. Such predictions, subsequently referred to as illegal paths, are erroneous and must be dealt with. Existing methods rely on hand-crafted post-processing procedure to resolve this problem, typically by retaining the illegal segments and re-tagging them. But as we shall show in this work, such treatment is arbitrary and leads to suboptimal performance.
The main contribution of this paper is to give a principled solution to the illegal path problem. More precisely: 1. We show that in the neural-CRF framework the illegal path problem is intrinsic and may accounts for non-negligible proportion (up to 40%) of total errors. To the best of our knowledge we are the first to conduct this kind of study.
2. We propose Masked Conditional Random Field (MCRF), a constrained version of the CRF that is by design immune to the illegal paths problem. We also devise an algorithm for MCRF that incurs almost zero overhead and requires only a few lines of code to implement. Further, we provide a theoretical justification of the proposed method.
3. We show in comprehensive experiments that MCRF performs significantly better than its CRF counterpart, and that its performance is on par with and sometimes better than more sophisticated models. We achieve new Stateof-the-Arts in two Chinese NER datasets.
The remainder of the paper is organized as follows. Section 2 describes the illegal path problem  et al., 2000), see Section 2.2 for more details. In the table "TP" and "FP" refer to "True Positive" and "False Positive" respectively. The column named " illegal & TP illegal " indicates the proportion of illegal segments that are correct predictions. The column named " illegal & FP FP " indicates the proportion of erroneous predictions that are due to illegal segments. The column named " illegal total " stands for the proportion of illegal segments over all predictions. and existing strategies that resolve it. In Section 3 we propose MCRF, its motivation and an approximate implementation. Section 4 is devoted to numerical experiments. We conclude the current work in Section 5.
2 The illegal path problem 2.1 Problem Statement As a common practice, most sequence labeling models utilize a certain tag encoding scheme to distinguish the boundary and the type of the text segments of interest. An encoding scheme makes it possible by introducing a set of tag prefixes and a set of tag transition rules. For instance, the popular BIO scheme distinguishes the Beginning, the Inside and the Outside of the chunks of interest, imposing that any I- * tag must be preceded by a B- * tag or another I- * tag of the same type. Thus "O O O I-LOC I-LOC O" is a forbidden sequence of tags because the transition O → I-LOC directly violates the BIO scheme design. Hereafter we shall refer to a sequence of tags that contains at least one illegal transition an illegal path.
As another example, the BIOES scheme further identifies the Ending of the text segments and the Singleton segments, thereby introducing more transition restrictions than BIO. e.g. an I- * tag must always be followed by an E- * tag of the same type, and an S- * tag can only be preceded by an O, an E- * or another S- * tag, etc. For a comparison of the performance of the encoding schemes, we refer to (Ratinov and Roth, 2009) and references therein.
When training a sequence labeling model with an encoding scheme, generally it is our hope that the model should be able to learn the semantics and the transition rules of the tags from the training data. However, even if the dataset is noiseless, a properly trained model may still occasionally make predictions that contains illegal transitions. This is especially the case for the CRF-based models, as there is no hard mechanism built-in to enforce those rules. The CRF ingredient by itself is only a soft mechanism that encourages legal transitions and penalizes illegal ones.
The hard transition rules might be violated when the model deems it necessary. To see this, let us consider a toy corpus where every occurrence of the token "America" is within the context of "North America", thus the token is always labeled as I-LOC. Then, during training, the model may well establish the rule "America ⇒ I-LOC" (Rule 1), among many other rules such as "an I-LOC tag does not follow an O tag" (Rule 2), etc. Now consider the test sample "Nathan left America last month", which contains a stand-alone "America" labeled as B-LOC. During inference, as the model never saw a stand-alone "America" before, it must generalize. If the model is more confident on Rule 1 than Rule 2, then it may yield an illegal output "O O I-LOC O O".

Strategies
The phenomenon of illegal path has already been noticed, but somehow regarded as trivial matters.
For the BIO format, Sang et al. (2000) have stated that The output of a chunk recognizer may contain inconsistencies in the chunk tags in case a word tagged I-X follows a word tagged O or I-Y, with X and Y being different. These inconsistencies can be resolved by assuming that such I-X tags starts a new chunk.
This simple strategy has been adopted by CoNLL-2000 as a standard post-processing procedure 1 for the evaluation of the models' performance, and gain its popularity ever since.
We argue that such treatment is not only arbitrary, but also suboptimal. In preliminary experiments we have studied the impact of the illegal path problem using the BERT-CRF model for a number of tasks and datasets. Our findings (see Table 1) suggest that although the illegal segments only account for a small fraction (typically around 1%) of total predicted segments, they constitute approximately a quarter of the false positives. Moreover, we found that only a few illegal segments are actually true positives. This raises the question of whether retaining the illegal segments is beneficial. As a matter of fact, as we will subsequently show, a much higher macro F1-score can be obtained if we simply discard every illegal segments.
Although the strategy of discarding the illegal segments may be superior to that of (Sang et al., 2000), it is nonetheless a hand-crafted, crude rule that lacks some flexibility. To see this, let us take the example in Fig. 1. The prediction for text segment World Boxing Council is (B-MISC, I-ORG, I-ORG), which contains an illegal transition B-MISC→I-ORG. Clearly, neither of the post-processing strategies discussed above is capable of resolving the problem. Ideally, an optimal solution should convert the predicted tags to either (B-MISC, I-MISC, I-MISC) or (B-ORG, I-ORG, I-ORG), whichever is more likely. This is exactly the starting point of MCRF, which we introduce in the next section.

Approach
In this section we introduce the motivation and implementation of MCRF. We first go over the 1 We are referring to the conlleval script, available from https://www.clips.uantwerpen.be/ conll2000/chunking/. conventional neural-based CRF models in Section 3.1. We then introduce MCRF in Section 3.2. Its implementation will be given in Section 3.3.

Neural CRF Models
Conventional neural CRF models typically consist of a neural network and a CRF layer. The neural network component serves as an encoder that usually first maps the input sequence of tokens to a sequence of token encodings, which is then transformed (e.g. via a linear layer) into a sequence of token logits. Each logit therein models the emission scores of the underlying token. The CRF component introduces a transition matrix that models the transition score from tag i to tag j for any two consecutive tokens. By aggregating the emission scores and the transition scores, deep CRF models assign a score for each possible sequence of tags.
Before going any further, let us introduce some notations first. In the sequel, we denote by x = {x 1 , x 2 , . . . , x T } a sequence of input tokens, by y = {y 1 , . . . , y T } their ground truth tags and by l = {l 1 , . . . , l T } the logits generated by the encoder network of the model. Let d be the number of distinct tags and denote by [d] := {1, . . . , d} the set of tag indices. Then y i ∈ [d] and l i ∈ R d for 1 ≤ i ≤ T . We denote by W the set of all trainable weights in the encoder network, and by A = (a ij ) ∈ R d×d the transition matrix introduced by the CRF, where a ij is the transition score from tag i to tag j. For convenience we call a sequence of tags a path. For given input x, encoder weights W and transition matrix A, we define the score of a path p = {n 1 , . . . , n T }as where l i,j denotes the j-th entry of l i . Let S be the set of all training samples, and P be the set of all possible paths. Then the loss function of neural CRF model is the average of negative loglikelihood over S: where we have omitted the dependence of s(·, ·) on (W, A) for conciseness. One can easily minimize L(W, A) using any popular first-order methods such as SGD or Adam. Let (W opt , A opt ) be a minimizer of L. During decoding phase, the predicted path for a test sample  which results in two erroneous predictions: MISC for "World" and ORG for "Boxing Council". When using MCRF instead, the decoding algorithm has to search for an alternative path (red arrows), as all illegal transitions are blocked. In this example, MCRF correctly predicts MISC for the entity "World Boxing Council".
x test is the path having the highest score, i.e.
The decoding problem can be efficiently solved by the Viterbi algorithm.

Masked CRF
Our major concern on conventional neural CRF models is that no hard mechanism exists to enforce the transition rule, resulting in occasional occurrence of illegal predictions. Our solution to this problem is very simple. Denote by I the set of all illegal paths. We propose to constrain the "path space" in the CRF model to the space of all legal paths P/I, instead of the entire space of all possible paths P. To this end, 1. during training, the normalization term in (2) should be the sum of the exponential scores of the legal paths; 2. during decoding, the optimal path should be searched over the space of all legal paths.
The first modification above leads to the following new loss function: which is obtained by replacing the P in (2) by P/I. Similarly, the second modification leads to obtained by replacing the P in (3) by P/I, where (W opt , A opt ) is a minimizer of (4). Note that the decoding objective (5) alone is enough to guarantee the complete elimination of illegal paths. However, this would create a mismatch between the training and the inference, as the model would attribute non-zero probability mass to the ensemble of the illegal paths. In Section 4.1, we will see that a naive solution based on (5) alone leads to suboptimal performance compared to a proper solution based on both (4) and (5). Although in principle it is possible to directly minimize (4), thanks to the following proposition we can also achieve this via reusing the existing tools originally designed for minimizing (2), thereby saving us from making extra engineering efforts.

Algorithm
the set of all illegal transitions. For a given transition matrix A, we denote byĀ(c) = ā ij (c) the masked transition matrix of A defined as (see Fig. 2) where c 0 is the transition mask. Then for arbitrary model weights Moreover, for negatively large enough c we have Proof. See Appendix. Proposition 1 states that for any given model state (W, A), if we mask the entries of A that correspond to illegal transitions (see Figure 2) by a negatively large enough constant c, then the two objectives (2) and (4), as well as their gradients, can be arbitrarily close. This suggests that the task of minimizing (4) can be achieved via minimizing (2) combined with keeping A masked (i.e. making a ij = c constant for all (i, j) ∈ Ω) throughout the optimization process.
Intuitively, the purpose of transition masking is to penalize the illegal transitions in such a way that they will never be selected during the Viterbi decoding, and the illegal paths as a whole only constitutes negligible probability mass during training.
Based on Proposition 1, we propose the Masked CRF approach, formally described in Algorithm 1.

Experiments
In this section, we run a series of experiments 2 to evaluate the performance of MCRF. The datasets used in our experiments are listed as follows: 2 Our code is available on https://github.com/ DandyQi/MaskedCRF.
a ij ← c maintain the mask 8: end for 9: end while 10: Output: Optimized W and A.
• Chinese NER: OntoNotes 4.0 (Weischedel et al., 2011), MSRA (Levow, 2006), Weibo (Peng and Dredze, 2015) and Resume (Zhang and Yang, 2018). The statistics of these datasets are summarized in  In preliminary experiments, we found out that the discriminative fine-tuning approach (Howard and Ruder, 2018)   the standard fine-tuning as recommended by (Devlin et al., 2019). In discriminative fine-tuning, one uses different learning rates for each layer. Let r L be the learning rate for the last (L-th) layer and η be the decay factor. Then the learning rate for the (L − n)-th layer is given by r L−n = r L η n . In our experiments, we use r L ∈ {1e − 4, 5e − 5} and η ∈ {1/2, 2/3} depending on the dataset. The standard Adam optimizer is used throughout, and the mini-batch size is fixed to be 32. We always finetune for 5 epochs or 10000 iterations, whichever is longer.

Main results
In this section we present the MCRF results on 8 sequence labeling datasets. The baseline models are the following: • BERT-tagger: The output of the final hidden representation for to each token is fed into a classification layer over the label set without using CRF. This is the approach recommended in (Devlin et al., 2019).
• BERT-CRF: BERT followed by a CRF layer, as is described in Section 3.1.
We use the following strategies to handle the illegal segments (See Table 4 for an example): • retain: Keep and retag the illegal segments. This strategy agrees with (Sang et al., 2000).  Table 4: An example illustrating the difference between "retain" strategy and "discard" strategy.
We distinguish two versions of MCRF: • MCRF-decoding: A naive version of MCRF that does masking only in decoding. The training process is the same as that in conventional CRF.
• MCRF-training: The proper MCRF approach proposed in this work. The masking is maintained in the training, as is described in Section 3.3. We also refer to it as the MCRF for simplicity.
For each dataset and each model we ran the training 10 times with different random initializations and selected the model that performed best on the dev set for each run. We report the best and the average test F1-scores as the final results. If the dataset does not provide an official development set, we randomly split the training set and use 10% of the samples as the dev set.

Results on Chinese NER
The results on Chinese NER tasks are presented in Table 3. It can be seen that the MCRF-training approach significantly outperforms all baseline models and establishes new State-of-the-Arts for Re-sume and Weibo datasets. From these results we can assert that the improvement brought by MCRF is mainly due to the effect of masking in training, not in decoding. Besides, we notice that the "discard" strategy substantially outperforms the "retain" strategy, which agrees with the statistics presented in Table 1.
We also plotted in Fig. 3 the loss curves of CRF and MCRF on the development set of MSRA. It can be clearly seen that MCRF incurs a much lower loss during training. This confirms our hypothesis that the CRF model attributes non-zero probability mass to the ensemble of the illegal paths, as otherwise the denominators in (4) and in (2) would have been equal, and in that case the loss curves of CRF and MCRF would have converged to the same level. Note that some of the results listed in Table 3 are based on models that utilize additional resources. Zhang and Yang (2018) and Ma et al. (2020) utilized Chinese lexicon features to enrich the token representations.  combined Chinese glyph information with BERT pre-training. In contrast, the proposed MCRF approach is simple yet performant. It achieves comparable or better results without relying on additional resources.

Results on Slot Filling
One of the main features of the AITS and SNIPS datasets is the large number of slot labels (79 and 39 respectively) with relatively small training set (4.5k and 13k respectively). This requires the sequence labeling model learn the transition rules in a sample-efficient manner. Both ATIS and SNIPS provide an intent label for each utterance in the datasets, but in our experiments we did not use this information and rely solely on the slot labels.
The results are reported in Table 5. It can be seen that MCRF-training outperforms the baseline models and achieves competitive results compared to previous published results.

Results on Chunking
The results on CoNLL2000 chunking task are reported in Table. 6. The proposed MCRF-training outperforms the CRF baseline by 0.4 in F1-score.

Ablation Studies
In this section, we investigate the influence of various factors that may impact the performance of MCRF. In particular, we are interested in the quantity MCRF gain, which we denote by ∆, defined simple as the difference of F1-score of MCRFtraining and that of the conventional CRF (with either "retain" or "discard" strategy).

Effect of Tagging Scheme
In the previous experiments we have always used the BIO scheme. It is of interest to explore the performance of MCRF under other tagging schemes such as BIOES. The BIOES scheme is considered more expressive than BIO as it introduces more labels and more transition restrictions.
We have re-run the experiments in Section 4.1.1 using the BIOES scheme. Our results are reported in Fig. 4

Effect of Sample Size
One may hypothesize that the occurrence of illegal paths might be due to the scarcity of training data, i.e. a model should be less prone to illegal paths if the training dataset is larger. To test this hypothesis, we randomly sample 10% of the training data from MSRA and Ontonotes, creating a smaller version of the respective dataset. We compare the proportion of the illegal segments produced by BERT-CRF trained on the original dataset with the one trained on the smaller dataset. We also report the performance gain brought by MCRF in these two scenarios. Our findings are summarized in Table 8. As can be seen from the

Effect of Encoder Architecture
So far we have experimented with BERT-based models. Now we explore effect of neural architecture. We trained a number of models on CoNLL2003 with varying encoder architectures.
The key components are listed as follows: • ELMo: pretrained language model 4 that serves as an sequence encoder.
• CNN: CNN-based character embedding layer, with weights extracted from pretrained ELMo. It is used to generate word embeddings for arbitrary input tokens.
The results of our experiments are given in Table 9. We observe that the encoder architecture has a large impact on the occurrence of illegal paths, and the BERT-based models appear to generate much more illegal paths than ELMo-based ones. This is probably due to the fact that transformer-encoders are not sequential in nature. A further study is needed to investigate this phenomenon, but it is beyond the scope of the current work. We also notice that the MCRF gain seems to be positively correlated with the proportion of the illegal paths generated by the underlying model. This is expected, since the transition-blocking mechanism of MCRF will (almost) not take effect if the most probable path estimated by the underlying CRF model is already legal.  Table 9: Ablation over the encoder models. The column named "err." indicates the proportion of erroneous predictions that are due to illegal segments.

Related Work
Some models are able to solve sequence labeling tasks without relying on BIO/BIOES type of tagging scheme to distinguish the boundary and the type of the text segments of interest, thus do not suffer from the illegal path problems. For instance, Semi-Markov CRF (Sarawagi and Cohen, 2005) uses an additional loop to search for the segment spans, and directly yields a sequence of segments along with their type. The downside of Semi-Markov CRF is that it incurs a higher time complexity compared to the conventional CRF approach. Recently, Li et al. (2020b) proposed a Machine Learning Comprehension (MRC) framework to solve NER tasks. Their model uses two separate binary classifiers to predict whether each token is the start or end of an entity. They introduced an additional module to determine which start and end tokens should be matched. We notice that the CRF implemented in PyTorch-Struct (Rush, 2020) has a different interface than usual CRF libraries in that it takes not two tensors for emission and transition scores, but rather one score tensor of the shape (batch size, sentence length, number of tags, number of tags). This allows one to incorporate even more prior knowledge in the structured prediction by setting a constraint mask as a function of not only a pair of tags, but also words on which the tags are assigned. Such feature may be exploited in future work.
Finally, we acknowledge that the naive version of MCRF that does constrained decoding has already been implemented in AllenNLP 5 . As shown in Section 4.1, such approach is suboptimal compared to the proposed MCRF-training method.

Conclusion
Our major contribution is the proposal of MCRF, a constrained variant of CRF that masks illegal transitions during CRF training, eliminating illegal outcomes in a principled way.
We have justified MCRF from a theoretical perspective, and shown empirically in a number of datasets that MCRF consistently outperforms the conventional CRF. As MCRF is easy to implement and incurs zero additional overhead, we advocate always using MCRF instead of CRF when applicable. and for all (i, j) ∈ Ω lim c→−∞ ∇ a ij L(W 0 ,Ā 0 (c)) = ∇ a ij L (W 0 , A 0 ). (14) Proof. First, we recall that

A Appendices
and the masked transition matrixĀ(c) = ā ij (c) is defined as where Ω is the set of illegal transitions. SinceĀ(c) differs from A only on entries corresponding to illegal transitions and a legal path contains only legal transitions, it follows from (15) that ∀p ∈ P/I s(p, x, W 0 ,Ā 0 (c)) = s(p, x, W 0 , A 0 ). (17) Thus N (W 0 ,Ā 0 (c)) = N (W 0 , A 0 ). By (10) (11) and (17), it suffices to demonstrate for any illegal path p ∈ I lim c→−∞ exp s(p, x, W 0 ,Ā 0 (c)) = 0. (20) To achieve this, we rewrite s(p, x, W 0 ,Ā 0 (c)) as a product of three terms: Now that the terms in the parenthesis do not depend on c and e cE(p) vanishes as c → −∞, we achieve (20). Then (12) of Lemma 2 is proved. Now we turn to the proof of (13). By elementary calculus we have By (18) and (19), it remains to show By the same argument as in the proof of (17) and (20), it is easily seen that for p ∈ P/I ∇ W exp s(p, x, W, A) Thus (21) is achieved and (13) follows.
Finally, the proof of (14) is similar to that of (13).