Uncertainty-Aware Label Refinement for Sequence Labeling

Conditional random fields (CRF) for label decoding has become ubiquitous in sequence labeling tasks. However, the local label dependencies and inefficient Viterbi decoding have always been a problem to be solved. In this work, we introduce a novel two-stage label decoding framework to model long-term label dependencies, while being much more computationally efficient. A base model first predicts draft labels, and then a novel two-stream self-attention model makes refinements on these draft predictions based on long-range label dependencies, which can achieve parallel decoding for a faster prediction. In addition, in order to mitigate the side effects of incorrect draft labels, Bayesian neural networks are used to indicate the labels with a high probability of being wrong, which can greatly assist in preventing error propagation. The experimental results on three sequence labeling benchmarks demonstrated that the proposed method not only outperformed the CRF-based methods but also greatly accelerated the inference process.


Introduction
Linguistic sequence labeling is one of the fundamental tasks in natural language processing.It has the goal of predicting a linguistic label for each word, including part-of-speech (POS) tagging, text chunking, and named entity recognition (NER).Benefiting from representation learning, neural network-based approaches can achieve state-ofthe-art performance without massive handcrafted feature engineering (Ma and Hovy, 2016;Lample et al., 2016;Strubell et al., 2017;Peters et al., 2018;Devlin et al., 2019).
Although the use of representation learning to obtain better text representation is very successful, Figure 1: Schematic of label refinement framework (Cui and Zhang, 2019).The goal is refining the label of "Arab" using contextual labels and words, while the refinement of other correct labels may be negatively impacted by incorrect draft labels.
creating better models for label dependencies has always been the focus of sequence labeling tasks (Collobert et al., 2011;Ye and Ling, 2018;Zhang et al., 2018).Among them, the CRF layer integrated with neural encoders to capture label transition patterns (Zhou and Xu, 2015;Ma and Hovy, 2016) has become ubiquitous in sequence labeling tasks.However, CRF only captures the neighboring label dependencies and must rely on inefficient Viterbi decoding.Many of the recent methods try to introduce label embeddings to manage longer ranges of dependencies, such as two-stage label refinement (Krishnan and Manning, 2006;Cui and Zhang, 2019) and seq2seq (Vaswani et al., 2016;Zhang et al., 2018) frameworks.In particular, Cui and Zhang (2019) introduced a hierarchically-refined representation of marginal label distributions, which predicts a sequence of draft labels in advance and then uses the word-label interactions to refine them.
Although these methods can model longer label dependencies, they are vulnerable to error propagation: if a label is mistakenly predicted during inference, the error will be propagated and the other labels conditioned on this one will be impacted (Bengio et al., 2015).As shown in Figure 1, the label attention network (LAN) (Cui and arXiv:2012.10608v1 [cs.CL]  We use Bayesian neural networks (Kendall and Gal, 2017) to estimate the uncertainty.We can see that the uncertainty value of incorrect prediction is 29 times larger than that of correct predictions, which can effectively indicate the incorrect predictions.Zhang, 2019) would negatively impact the correct predictions in the refinement stage.There are 39 correct tokens that have been incorrectly modified (Table 1).Hence, the model should selectively correct the labels with high probabilities of being incorrect, not all of them.Fortunately, we find that uncertainty values estimated by Bayesian neural networks (Kendall and Gal, 2017) can effectively indicate the labels that have a high probability of being incorrect.As shown in Table 1 1 , the average uncertainty value of incorrect prediction is 29 times larger than that of correct predictions for the draft labels.Hence, we can easily set an uncertainty threshold to only refine the potentially incorrect labels and prevent side effects on the correct labels.
In this work, we propose a novel two-stage Uncertainty-Aware label refinement Network (UANet).At the first stage, the Bayesian neural networks take a sentence as input and yield all of the draft labels together with corresponding uncertainties.At the second stage, a two-stream self-attention model performs attention over label embeddings to explicitly model the label dependencies, as well as context vectors to model the context representations.All of these features are fused to refine the potentially incorrect draft labels.The above label refinement operations can be processed in parallel, which can avoid the use of Viterbi decoding of the CRF for a faster prediction.Experimental results on three sequence labeling benchmarks demonstrated that the proposed method not only outperformed the CRF-based methods but also significantly accelerated the inference process.
The main contributions of this paper can be summarized as follows: 1) we propose the use of Bayesian neural networks to estimate the uncertainty of predictions and indicate the potentially incorrect labels that should be refined; 2) we propose a novel two-stream self-attention refining framework to better model different ranges of label dependencies and word-label interactions; 3) the proposed parallel decoding process can greatly speed up the inference process; and 4) the experimental results across three sequence labeling datasets indicate that the proposed method outperforms the other label decoding methods.
2 Related Work and Background

Sequence Labeling
Traditional sequence labeling models use statistical approaches such as Hidden Markov Models (HMM) and Conditional Random Fields (CRF) (Passos et al., 2014;Cuong et al., 2014;Luo et al., 2015) with handcrafted features and task-specific resources.With advances in deep learning, neural models could achieve competitive performances without massive handcrafted feature engineering (Chiu and Nichols, 2016;Santos and Zadrozny, 2014).In recent years, modeling label dependencies has been the other focus of sequence labeling tasks, such as using a CRF layer integrated with neural encoders to capture label transition patterns (Zhou and Xu, 2015;Ma and Hovy, 2016), and introducing label embeddings to manage longer ranges of dependencies (Vaswani et al., 2016;Zhang et al., 2018;Cui and Zhang, 2019).Our work is an extension of label embedding methods, which applies label dependencies and word-label interactions to only refine the labels with high probabilities of being incorrect.The probability of making a mistake is estimated using Bayesian neural networks, which will be described in the next subsection.

Bayesian Neural Networks
The predictive probabilities obtained by the softmax output are often erroneously interpreted as model confidence.However, a model can be uncertain in its predictions even with a high softmax output (Gal and Ghahramani, 2016a).Gal and Ghahramani (2016a) gives results showing that simply using predictive probabilities to estimate the uncertainty results in extrapolations with unjustified high confidence for points far from the training data.They verified that modeling a distribution over the parameters through Bayesian NNs can effectively reflect the uncertainty, and Bernoulli Dropout is exactly one example of a regularization technique corresponding to an approximate variational distribution.Some typical examples of using Bernoulli distribution to estimate uncertainty are Bayesian CNN (Gal and Ghahramani, 2015) and variational RNN (Gal and Ghahramani, 2016b).
Given the dataset D with training inputs X = {x 1 , . . ., x n } and their corresponding outputs Y = {y 1 , . . ., y n }, Bayesian inference looks for the posterior distribution of the parameters given the dataset p(W|D).This makes it possible to predict an output for a new input point x * by marginalizing over all of the possible parameters, as follows: (1) Bayesian inference is intractable for many models because of the complex nonlinear structures and high dimension of the model parameters.Recent advances in variational inference introduced new techniques into the field.Among these, Monte Carlo Dropout (Gal and Ghahramani, 2016a) requires minimum modification to the original model.It is possible to use the variational inference approach to find an approximation q * θ (W) to the true posterior p(W|D) parameterized by a different set of weights θ, where the Kullback-Leibler (KL) divergence of the two distributions is minimized.The integral can be approximated as follows: In contrast to non-Bayesian networks, at test time, Dropouts are also activated.As a result, model uncertainty can be approximately evaluated by summarizing the variance of the model outputs from multiple forward passes.
3 Uncertainty-Aware Label Refinement In this work, we propose a novel sequence labeling framework, which incorporates Bayesian neural networks to estimate the epistemic uncertainty of the draft labels.The uncertain labels that have a high probability of being wrong can be refined by a two-stream self-attention model using long-term label dependencies and word-label interactions.The proposed model is shown in Figure 2.

Variational LSTM for Uncertainty Estimation
Long short-term memory (LSTM) stands at the forefront of many recent developments in sequence labeling tasks.To facilitate comparison with LSTM-based models, variational LSTMs (Gal and Ghahramani, 2016b) as special Bayesian neural networks are used to encode sentences and determine the labels with a high probability of being wrong.Obviously, the uncertainty estimation methods can also be easily applied to other sequence labeling models, like the CNN and Transformer.
Word Representation Following Santos and Zadrozny (2014) and Lample et al. (2016), we use character information to enhance the word representation.Given a word sequence S = {w 1 , w 2 , . . ., w n }, the product of the one-hot encoded vector with an embedding matrix then gives a word embedding: w i = e w (w i ), where e w denotes a word embedding lookup table.Each word is made up of a sequence of characters c 1 , c 2 , . . ., c l .We adopt CNNs for character encoding and x c i denotes the output of characterlevel encoding.Then a word is represented by concatenating its word embedding and its character-level encoding: All the word representations make up an embedding matrix E ∈ R V ×D , where D is the embedding dimensionality of x and V is the number of words in the vocabulary.
Variational LSTM A common practice of Dropout technique on LSTM is that the technique should be used with the inputs and outputs of the LSTM alone.In contrast, the variational LSTM additionally applies Dropout on recurrent connections by repeating the same mask at each time step.Hence, the variational LSTM can model the uncertainty more accurately.
As shown in Figure 2, we use the same Dropout vectors z x and z h on four gates: "input", "forget", "output", and "input modulation" as follows: x 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " m 6 l s w R Z z b u 0 e k r H L 0 i B F 8 V P r 7 N 8 / L e P g A d B K U P g = = < / l a t e x i t > x 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " Q g q m 4 c e A p 5 c a x Y + x c X 6 f 5 e v F y p s = " x g E M c 0 z x P U c E l q q i T d 4 h H P O F Z q 2 k 3 2 p 1 2 / 5 m q Z V L N P r 4 t 7 e E D d n K U P w = = < / l a t e x i t > x 4 < l a t e x i t s h a 1 _ b a s e 6 4 = " y N P J 8 D 7 K 2 l 4 m 8 q U 7 J y 7 6 S 5 C d 7 l L W n A 5 s 9 x z o N a q W g e F U u X x 4 X y W T r q L P a w j 0 O a 5 w n K u E A F V f L m e M Q T n r U r b a z d a f e f q V o m 1 e z i 2 9 I e P g A d e p T h < / l a t e x i t > x 2 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 0 7 K 2 l 4 m 8 q U 7 J y 7 6 S 5 C d 7 l L W n A 5 s 9 x z o N a q W g e F U u X x 4 X y W T r q L P a w j 0 O a 5 w n K u E A F V f L m e M Q T n r U r b a z d a f e f q V o m 1 e z i 2 9 I e P g A d e J T h < / l a t e x i t > l 2 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 k 7 I r 8 p 7 W w V m D 8 x 9 t / P Y 3 y H t W p q N a C 8 9 U 6 X 2 6 J S A X k 5 K E 3 u k i S m P E 5 a n m S q e K W f J / u Y 9 V p 7 y b i P 6 u 9 o r J F b g m t i / d N P M / + p k L Q I D H K s a f K o p U Y y s z t M u m e q K v L n 5 p S p B D g l x E v c p z g l 7 S j n t s 6 k 0 q a p d 9 t Z R 8 T e V K V m 5 9 3 R u h n d 5 S x q w / X O c s 6 B R K d s H 5 c r 5 Y a l 6 o k e d x w 5 2 s U / z P E I V Z 6 i h T t 4 c j 3 j C s 3 F h j I w 7 4 / 4 z 1 c h p z T a + L e P h A w C 8 l N U = < / l a t e x i t > l 2 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " D N L u q b g n z y P z o x t r 4 c 9 E Z y m D 7 a 3 y H t W p q N a C 8 9 U 6 X 2 6 J S A X k 5 K E 3 u k i S m P E 5 a n m S q e K W f J / u Y 9 V p 7 y b i P 6 u 9 o r J F b g m t i / d N P M / + p k L Q I D H K s a f K o p U Y y s z t M u m e q K v L n 5 p S p B D g l x E v c p z g l 7 S j n t s 6 k 0 q a p d 9 t Z R 8 T e V K V m 5 9 3 R u h n d 5 S x q w / X O c s 6 B R K d s H 5 c r 5 Y a l 6 o k e d x w 5 2 s U / z P E I V Z 6 i h T t 4 c j 3 j C s 3 F h j I w 7 4 / 4 z 1 c h p z T a + L e P h A w C + l N U = < / l a t e x i t >  where φ denotes the tanh function, and σ is the sigmoid function.and • represent the Hadamard product and matrix product, respectively.We assume that t is one of {g, i, f, o}.Then, θ = {E, W t } and the Dropout rate r are the parameters of the variational LSTM.
Draft Labels and Uncertainty Estimation Assuming that we have completed the training and obtained the optimized approximated posterior q * θ (W) (the optimizing method is shown in § 3.3), at inference time, we can predict an output for a new input point by performing Monte Carlo integration in Eq.2 as follows: with M sampled masked model weights W j ∼ q * θ (W), where q * θ (W) is the Dropout distribution.In order to make the model with multiple sampling the same speed as the standard LSTM, we repeat the same input M times to form a batch and run in parallel on the GPU.Hence, M samples can be done concurrently in the forward passes, resulting in constant running time identical to that of standard Dropout (Gal and Ghahramani, 2016a), which is verified in Table 6.
Similar to classic sequence labeling models, the model applies y * i = argmax(p i ) to obtain the draft label.Then the uncertainty of this probability vector p i can be summarized using the entropy of the probability vector: In this way, we can obtain the draft labels Y * = {y * 1 , y * 2 , . . ., y * n } coupled with the corresponding epistemic uncertainties U = {u 1 , u 2 , . . ., u n } for each input sentence.We find when the epistemic uncertainty u i is larger than some threshold value Γ, then the draft label y * i has a high probability of being wrong.Hence, we utilize a novel two-stream self-attention model to refine those uncertain labels using long-term label dependencies and word-label interactions.

Two-Stream Self-Attention for Label Refinement
Given the draft labels and corresponding epistemic uncertainties, we seek the help of label dependencies and word-label interactions to refine the uncertain labels.In order to refine the draft labels in parallel, we use the Transformer (Vaswani et al., 2017) incorporating relative position encoding (Dai et al., 2019) to model the words and draft labels.
In the standard Transformer, the attention score incorporating absolute position encoding between query q i and key vector k j can be decomposed as where U ∈ R Lmax×d provides a set of positional encodings.The ith row U i corresponds to the ith absolute position and L max prescribes the maximum possible length to be modeled.The relative position between labels is very important for modeling the label dependencies.Inspired by Dai et al. (2019), we modify the Eq.6 using the relative position encoding to model words and corresponding labels simultaneously, but offer a different derivation, arriving at a new form of two-stream relative positional encodings.We not only provide a word-to-word interactions but also provide a word-to-label interactions correspondence to its counterpart.The relative position encodings are reparameterized as follows: where A x2x i,j and A x2l i,m denotes the attention from the ith word (x i ) to the jth word (x j ) and the ith word (x i ) to the mth label (y * m ), respectively.R i−j is the encoding of relative distance between position i and j, and R is the sinusoid matrix like Dai et al. (2019).ϕ = {W, u, and v} are learnable parameters.
Equipping the transformer with our proposed relative positional encoding, we finally arrive at the two-stream self-attention architecture.We summarize the computational procedure for one layer with a single attention head here: (8)

Training and Decoding
There are two networks to be optimized: one is variational LSTM for draft labels and uncertainty estimation, the other is two-stream self-attention model for label refinement.Our ultimate training goal is to minimize the total loss function on the two models: The variational LSTM performs approximate variational inference.We use a simple Bernoulli distribution (Dropout) q * θ (W) in a tractable family to minimize the KL divergence to the true model posterior p(W|D).The minimization objective is given by (Jordan et al., 1999): where N is the number of data points, and r is the Dropout probability to sample W j ∼ q * θ (W).For the two-stream self-attention model, we use the concatenation of H x and H l for the final prediction ŷi = f (H x , H l |E x , E y * m ).In particular, we can optimize the model using cross entropy loss as: where y i is the one-hot vector of the label corresponding to w i .When training is complete, we can obtain the draft labels Y * = {y * 1 , y * 2 , . . ., y * n } and corresponding uncertainties U = {u 1 , u 2 , . . ., u n } from variational LSTM, and refined labels Ŷ = {ŷ 1 , ŷ2 , . . ., ŷn } from two-stream self-attention model.To avoid the correct labels being incorrectly modified, we set an uncertainty threshold Γ to distinguish which labels should be used, i.e., we use refined labels when u i > Γ and vice versa (as an example, given u 1 > Γ, u 2 ≤ Γ, and u n > Γ, decoding labels will become {ŷ 1 , y * 2 , . . ., ŷn }).

Experimental Setup
In this section, we describe the datasets across different sequence labeling tasks, including two English NER datasets and one POS tagging dataset.We also detail the baseline models for comparison.Finally, we clarify the hyperparameters configuration of our uncertainty-aware refinement network.

Datasets
We conduct experiments on three sequence labeling datasets.The statistics are listed in Table 2.
CoNLL2003.The shared task of CoNLL2003 dataset (Tjong Kim Sang and De Meulder, 2003) for named entity recognition is collected from Reuters Corpus.The dataset divide name entities into four different types: persons, locations, organizations, and miscellaneous entities.We use the BIOES tag scheme instead of standard BIO2, which is the same as Ma and Hovy (2016).

Compared Methods
In this work, we mainly focus on improving decoding efficiency and enhancing label dependencies.Thus, we make comparisons with the classic methods that have different decoding layers, such as Softmax, CRF, and LAN frameworks.We also compare some recent competitive methods, such as Transformer, IntNet (Xin et al., 2018), and BERT (Devlin et al., 2019).BiLSTM-Softmax.This baseline uses bidirectional LSTM to reprensent a sequence.The BiLSTM concatenates the forward hidden state − → h i and backward hidden state ← − h i to form an integral representation Finally, sentence representation H = {h i , • • • , h n } is fed to softmax layer for predicting.BiLSTM-CRF.A CRF layer is used on top of the hidden vectors H (Ma and Hovy, 2016).The CRF can model bigram interactions between two successive labels (Lample et al., 2016) instead of making independent labeling decisions for each output.In the decoding time, the Viterbi algorithm is used to find the highest scored label sequence over an input word sequence.BiLSTM-Seq2seq.To model longer label dependencies, Zhang et al. (2018) predicts a sequence of labels as a sequence to sequence problem.BiLSTM-LAN.The label attention network (LAN) (Cui and Zhang, 2019) introduces label embedding, and uses consecutive attention layers on the label embeddings to refine the draft labels.It achieves the state-of-the-art results on several sequence labeling tasks.Rel-Transformer.This baseline model adopts self-attention mechanism with relative position representations (Vaswani et al., 2017;Dai et al., 2019).(Ma and Hovy, 2016) 91.21 86.99 97.51 BiLSTM-Softmax (Yang et al., 2018) 90.77 83.76 97.51 BiLSTM-Seq2seq (Zhang et al., 2018) 91.22 -97.59 Rel-Transformer (Dai et al., 2019) 90.70 87.45 97.49BiLSTM-LAN (Cui and Zhang, 2019) 90

Hyper-parameter Settings
Following (Ma and Hovy, 2016), we use the same 100-dimensional GloVe embeddings2 as initialization.We use 1-layer variational LSTM with a hidden size of 400 to create draft labels.
The vanilla dropout after the embedding layer and the variational dropout is set to 0.5 and 0.25, respectively.We use 2 layers of multihead transformer for WSJ and CoNLL2003 and 3 for OntoNotes dataset to refine the label.The number of heads is chosen from {5, 7, 9}, and the dimension of each head is chosen from {80, 120, 160} via grid search.We use SGD as the optimizer for variational LSTM and Adam (Kingma and Ba, 2014) for transformer.Learning rates are set to 0.015 for SGD on CoNLL2003 and Ontonotes datasets and 0.2 on WSJ dataset.The learning rates for Adam are set to 0.0001 for all datasets.F1 score and accuracy are used for NER and POS tagging, respectively.All experiments are implemented in NCRF++ (Yang and Zhang, 2018) and conducted using a GeForce GTX 1080Ti with 11GB memory.More details are shown in our codes3 .

Results and Analysis
In this section, we present the experimental results of the proposed and baseline models.We show that the proposed method not only achieves better performance but also has a significant speed advantage.Since our contribution is mainly focused on the label decoding layer, the proposed model can also be combined with the latest pretrained model to further improve performance.(Devlin et al., 2019) was not achieved with the current version of the library.See a discussion in (Stanislawek et al., 2019) and the reported results at (Zhang et al., 2019).

Main Results
Table 3 reports model performances on CoNLL2003, OntoNotes, and WSJ dataset, which shows that the proposed method not only can achieve state-of-the-art results on NER task but also is effective on other sequence labeling tasks, like POS tagging.The previous methods leverage rich handcrafted features (Huang et al., 2015;Chiu and Nichols, 2016), CRF decoding (Strubell et al., 2017), and longer range label dependencies (Zhang et al., 2018;Cui and Zhang, 2019).Compared with these methods, our UANet model gives better results.Benefitting from the strong capability of modeling long-term label dependencies, the UANet outperforms models with the CRF inference layer by a large margin.Moreover, different from the seq2seq and LAN models that also leverage label dependencies, our UANet model integrates model uncertainty into the refinement stage to avoid side effects on correct draft labels.As a result, it outperforms LAN and seq2seq models on all of the three datasets.

Ablation Study
To study the contribution of each component in BiLSTM-UANet, we conducted ablation experiments on the three datasets and display the results in  label information is removed, indicating that label dependencies are useful in the refinement.We also find that both the variational LSTM and twostream self-attention play an important role in label refinement.Even though we replace any component with the CRF layer, the performance will be seriously hurt.
We also give our model more complex character representations (IntNet) or use the pretrained model (BERT) to replace the Glove embeddings.We finetune the BERT for each task.The results are shown in Table 5.We find that the contribution of our model and more complex word representations may be orthogonal, i.e., whether or not the UANet uses the IntNet and BERT, our methods have similar improvements, because of better modeling label dependencies.

Efficient Advantage
Table 6 shows a comparison of inference speeds.BiLSTM-UANet processes 1,630, 1,262, and 1,192 sentences per second on the CoNLL2003, OntoNotes, and WSJ development data, respectively, outperforming BiLSTM-CRF by 13.7%, 32.8% and 48.8%, respectively.We can see that for the dataset with a longer average length, the speed of inference will be more advantageous.Because the model calculates uncertainties through parallel sampling the same input multiple times, the inference time of the BiLSTM-UANet (M = 8)  only slightly increases.
To further investigate the influence of the different sentence lengths, we analyze the inference speed of the UANet and CRF on the CoNLL2003 development set, which is split into five parts according to sentence length.We ruled out the influence of the text encoder and only counted the time of label decoding.The left subfigure in Figure 3 shows the decoding speed on the different sentence lengths.The results reveal that as the sentence length increases, the speed of the UANet is relatively stable, while the speed of the CRF decreases substantially.Due to the UANet's parallelism, when processing the sentence longer than 30, the UANet is nearly 3 times faster than the CRF.In addition, we exhibit the F1 score of the sentences with different lengths in right subfigure.It is worth noting that the UANet outperforms the CRF by a large margin when the length of the sentence is greater than 15, verifying the UANet's superiority in long-term label dependencies.

Discussion
Uncertainty Threshold.In order to investigate the influence of uncertainty threshold Γ, we evaluate the performance with different uncertainty thresholds on three datasets, as shown in Figure 4. Γ = 0 represents that the model uses all of the refined labels as final predictions.As the threshold gets larger, the performance of UANet can improve by reducing the negative effects on correct draft labels.However, when Γ is too large, the model mainly uses draft labels as final predictions, resulting in performance degradation, which verifies our motivation that a reasonable uncertainty threshold can avoid side effects on correct draft labels.Number of Sampling.We also investigate the influence of the number of sampling in the variational LSTM as shown in Figure 4.The results meet our expectation that a larger number of sampling can lead to better performance because a larger number of sampling can make the model better approximate the posterior p(W|D).
Case Study.Table 7 shows two cases from CoNLL2003 NER dataset.The first case reflects the necessity of modeling higher-order dependencies in the NER task.UANet can learn the label consistency of two phrases near the word and.Moreover, seq2seq decoding model (Zhang et al., 2018) refines the labels in a left-to-right way and can't refine the previous labels in this case.The second case shows the effectiveness of the uncertainty threshold in mitigating the side effect of incorrect refinement.In this case, the refinement model is affected by the incorrect label of Yangon (E-LOC) when predicting the word University.Since the uncertainty value of University is lower than the threshold, our model can get the correct results.

Conclusions
In this work, we introduce a novel sequence labeling framework that incorporates Bayesian neural networks to estimate model uncertainty.We find that the model uncertainty can effectively indicate the labels with a high probability of being wrong.The proposed method can selectively refine the uncertain labels to avoid the side effects of the refinement on correct labels.In addition, the proposed model can capture different ranges of label dependencies and word-label interactions in parallel, which can avoid the use of Viterbi decoding of the CRF for a faster prediction.Experimental results across three sequence labeling datasets demonstrated that the proposed method significantly outperforms the previous methods.
t e x i t s h a 1 _ b a s e 6 4 = " o m i U k + o Y e d 2 E 8 6 k 9 d x 2 X P K F U I m 4 o z g k 7 S j n r s 6 4 0 s a p d 9 t Z S 8 T e V K V m 5 d 9 L c B O / y l j R g 8 + c 4 5 0 G t W D C P C 8 X L U r 5 8 l o 4 6 i 3 0 c 4 I j m e Y I y L l B B l b w 5 H y 7 2 a 5 K d 7 l L W n A 5 s 9 x z o O 6 V T a P y t b l c a l y l o 0 6 j z 3 s 4 5 D m e Y I K L l B F j b w 5 H v G E Z + 1 K G 2 t 3 2 v 1 n q p b L N L v 4 t r S H D w M e l N Y = < / l a t e x i t >

Figure 2 :
Figure 2: Graphical illustration of architecture and inference process for the proposed UANet.The variational LSTM outputs draft labels and model uncertainties simultaneously.The refinement only works on draft labels with threshold greater than 0.35.

Figure 4 :
Figure 4: F1 variation under different uncertainty thresholds and numbers of sampling in variational LSTM, respectively.The results are evaluated on the development sets.∆F1 represents the F1 scores at different steps minus the initial results.

Table 1 :
19 Dec 2020 Results of LAN with uncertainty estimation evaluated on CoNLL2003 test dataset.refers to the correct prediction, and refers to the wrong prediction.

Table 2 :
Statistics of CoNLL2003, OntoNotes and WSJ datasets, where # represents the number of tokens in datasets.The class number of NER datasets is counted under BIOES tag scheme.

Table 3 :
Main results on three sequence labeling datasets.* indicates the results by running Cui and Zhang (2019)'s released code 5 .

Table 5 :
Results on CoNLL2003 test set.We implement BERT for NER task without documentlevel information.Original result of BERT in

Table 4 .
The results show that the model's performance is degraded if the draft

Table 6 :
Comparison of inference speed.M represents for the number of sampling.We show how many sentences the model can process per second.

Table 7 :
NER cases analysis.Contents with bold red and italic blue styles represent incorrect and correct entities, respectively.Draft labels with uncertainty greater than 0.35 will be refined.