AIN: Fast and Accurate Sequence Labeling with Approximate Inference Network

The linear-chain Conditional Random Field (CRF) model is one of the most widely-used neural sequence labeling approaches. Exact probabilistic inference algorithms such as the forward-backward and Viterbi algorithms are typically applied in training and prediction stages of the CRF model. However, these algorithms require sequential computation that makes parallelization impossible. In this paper, we propose to employ a parallelizable approximate variational inference algorithm for the CRF model. Based on this algorithm, we design an approximate inference network that can be connected with the encoder of the neural CRF model to form an end-to-end network, which is amenable to parallelization for faster training and prediction. The empirical results show that our proposed approaches achieve a 12.7-fold improvement in decoding speed with long sentences and a competitive accuracy compared with the traditional CRF approach.


Introduction
Sequence labeling assigns each token with a label in a sequence.Tasks such as Named Entity Recognition (NER) (Sundheim, 1995), Part-Of-Speech (POS) tagging (DeRose, 1988) and chunking (Tjong Kim Sang and Buchholz, 2000) can all be formulated as sequence labeling tasks.BiLSTM-CRF (Huang et al., 2015;Lample et al., 2016;Ma and Hovy, 2016) is one of the most successful neural sequence labeling architectures.It feeds pretrained (contextual) word representations into a single layer bi-directional LSTM (BiLSTM) encoder to extract contextual features and then feeds these features into a CRF (Lafferty et al., 2001) decoder layer to produce final predictions.The CRF layer is a linear-chain structure that models the relation between neighboring labels.In the traditional CRF approach, exact probabilistic inference algorithms such as the forward-backward and Viterbi algorithms are applied for training and prediction respectively.In many sequence labeling tasks, the CRF layer leads to better results than the simpler method of predicting each label independently.
In practice, we sometimes require very fast sequence labelers for training (e.g., on huge datasets like WikiAnn (Pan et al., 2017)) and prediction (e.g. for low latency online serving).The BiLSTM encoder and the CRF layer both contain sequential computation and require O(n) time over n input words even when parallelized on GPU.A common practice to improve the speed of the encoder is to replace the BiLSTM with a CNN structure (Collobert et al., 2011;Strubell et al., 2017), distill larger encoders into smaller ones (Tsai et al., 2019;Mukherjee and Awadallah, 2020) or in other settings (Tu and Gimpel, 2018;Yang et al., 2018;Tu and Gimpel, 2019;Cui and Zhang, 2019).The CRF layer, however, is more difficult to replace because of its superior accuracy compared with faster alternatives in many tasks.
In order to achieve sublinear time complexity on the CRF layer, we must parallelize the CRF prediction over the tokens.In this paper, we apply Mean-Field Variational Inference (MFVI) to approximately decode the linear-chain CRF.MFVI iteratively passes messages among neighboring labels to update their distributions locally.Unlike the exact probabilistic inference algorithms, MFVI can be parallelized over different positions in the sequence, achieving time complexity that is constant in n with full parallelization.Previous work (Zheng et al., 2015) showed that such an algorithm can be unfolded as an RNN for grid CRF structure.We expand on the work for the linear-chain CRF structure and unfold the algorithm as an RNN which can be connected with the encoder to form an end-to-end neural network that is amenable to parallelization for both training and prediction.We call the unfolded RNN an approximate inference network (AIN).In addition to linear-chain CRFs, we also apply AIN to factorized second-order CRF models, which consider relations between more neighboring labels.Our empirical results show that AIN significantly improves the speed and achieves competitive accuracy against the traditional CRF approach on 4 tasks with 15 datasets.

Approaches
Given an input sequence with n tokens x = [x 1 , x 2 , . . ., x n ] and a corresponding label sequence y = [y 1 , y 2 , . . ., y n ] with a label set of size L, the conditional probability of y given x specified by a CRF with position-wise factorization is: where Y(x) is the set of all possible label sequences for x and ψ(x, y, i) is a potential function.
In the simplest case, the potential function is just a softmax function that outputs the distribution of each label independently.We call it the MaxEnt approach.In a typical linear-chain CRF, the potential function is decomposed into a unary potential ψ u and a binary potential ψ b (called the emission and transition functions respectively): where r i is the contextual feature of x i output from the CNN or BiLSTM encoder with dimension d, v y i is a one-hot vector for label y i , W is a d × L matrix and U is an L × L matrix containing the transition scores between two labels.The factor graph of a linear-chain CRF is shown at the top of Figure 1.The exact probabilistic inference algorithms (Viterbi and forward-backward) for the CRF layer are significantly slower than the MaxEnt approach.They take O(nL 2 ) and O(n log L) time on CPU and GPU1 respectively, while the decoder in Max-Ent takes O(nL) and O(log L).

AIN on Linear-Chain CRF
In order to speed up the training and prediction time of the CRF layer, we propose the approximate inference network (AIN), which is a neural network derived from MFVI for approximate decoding in linear-chain CRF.
MFVI approximates the distribution P (y|x) with a factorized distribution where s(i, j, k) represents the message from node i to node j at time step k.Q 0 i (y i |x) is set by normalizing the unary potential ψ u (x, y i ).Upon convergence, the label sequence with the highest approximate probability Q(y|x) can be found by optimizing Q i (y i |x) at each position i: Similar to Zheng et al. (2015), we unfold the MFVI algorithm as a recurrent neural network that is parameterized by the linear-chain CRF potential functions.We fix the number of iterations to M and call the resulting network AIN.AIN can be connected with the encoding network that computes the potential functions and together they form an end-to-end neural network.However, different from previous work (Krähenbühl and Koltun, 2011;Zheng et al., 2015) using the MFVI algorithm for intractable problems of the grid-structured probabilistic models to get better accuracy, we propose to employ the MFVI algorithm to accelerate the speed of tractable problems of the sequence-structured probabilistic models.
The time complexity of each iteration of the MFVI algorithm is O(nL 2 ), which is on par with the time complexity of the exact probabilistic inference algorithms.However, in each iteration, the update of each distribution Q i (y i |x) depends only on its two neighboring distributions from the previous iteration, so each iteration can be parallelized over positions.A comparison between the Viterbi algorithm and the MFVI algorithm is shown in Figure 2. The time complexity of our AIN decoder with full GPU parallelization is O(M log L), while the time complexity of the exact probabilistic inference algorithms with GPU parallelization is O(n log L).We set the value of M to 3s , which is much smaller than the typical value of sequence length n.

AIN on Factorized Second-Order CRF
We can extend AIN to the second-order CRF with a ternary potential function over every three consecutive labels.In the second-order CRF, the potential function in Eq. 1 becomes: However, the second-order CRF has space and time complexity that is cubic in L. Therefore, we factorize its ternary potential function and reduce its complexity to be quadratic in L: where the matrix Ũ has the same shape as U in Eq. 2. The factor graph of our factorized secondorder CRF is shown at the bottom of Figure 1.The update formula is similar to that of our first-order approach but with more neighbors: where s (i, j, k) has a similar definition as s(i, j, k) by replacing ψ b with ψ b .The time complexity of this approach is also O(nL 2 ) for each iteration and O(M log L) with full GPU parallelization for M iterations.Following the first approach, we also unfold MFVI of this approach as an AIN.

Learning
Given a sequence x with corresponding gold labels , the learning objective of our approaches is: Since AINs are end-to-end neural networks, the objective function can be optimized by any gradientbased method in an end-to-end manner.

Settings
Datasets We evaluate our approaches on four tasks: NER, POS tagging, chunking and slot filling.For NER, we use the corpora from the CoNLL 2002 and CoNLL 2003 shared tasks (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003).For POS tagging, we use universal POS tag annotations with 8 languages from the Universal Dependencies (UD) (Nivre et al., 2018) dataset.
For chunking, we use the corpora from the CoNLL 2003 shared task.We use the Air Travel Information System (ATIS) (Hemphill et al., 1990) dataset for slot filling.
Encoder In our experiments, we use three types of encoders.The first is a BiLSTM fed with word and character embeddings, which captures contextual information globally.The second is a single layer CNN with only word embedding as input, which captures contextual information locally.The third is a single linear layer with word embeddings as input, which does not capture any contextual information.We use these settings for a better understanding of how the decoders perform on each task when the encoders capture different levels of contextual information.Decoder We use the MaxEnt approach, the traditional CRF approach and AINs with the first-order and factorized second-order CRFs for decoding.We denote these approaches by MaxEnt, CRF, AIN-1O and AIN-F2O respectively.We set the iteration number M to 3 in AINs because we find that more iterations do not result in further improvement in accuracy.

Results
Speed We report the relative speed improvements over the CRF model based on our PyTorch (Paszke et al., 2019) implementation run on a GPU server with Nvidia Tesla V100.Following Tsai et al. (2019), we report the training and prediction speed with 10,000 sentences of 32 and 128 words, respectively.The results (Table 1) show that AINs are significantly faster than CRF in terms of both the full model speed and the decoder speed.The speed advantage of AINs is especially prominent with long sentences, suggesting their usefulness in tasks like document-level NER.
Accuracy We run each approach on each dataset for 5 times and compute its average accuracy.Because of space limit, we report the accuracy averaged over all the datasets for each task in Table 2. Please refer to the supplementary material for the complete results.AINs achieve competitive overall accuracy with CRF, even though AINs take significantly less time than CRF.With the BiLSTM encoder which has the capability to capture global contextual information, AINs achieves almost the same average accuracy as CRF, demonstrating that AINs performing approximate inference with local contextual information are competitive with CRF with globally exact decoding.With the CNN encoder that encodes local contextual information, AINs are inferior to CRF because both the CNN layer and our approaches utilize only local information.Without any contextual encoders (Word Only), the accuracy of these decoders vary significantly over tasks.For NER and chunking, CRF is the strongest, but our approaches only marginally underperform CRF while significantly outperform MaxEnt.For POS tagging and slot filling, our approaches outperform CRF, which implies that local information might be more beneficial for these tasks.Comparing AIN-1O and AIN-F2O, AIN-F2O is stronger when the encoder is weak, but their performance gap becomes smaller and eventually disappears when the encoder gets stronger.

Conclusion
In this paper, we propose approximate inference networks (AIN) that use Mean-Field Variational Inference (MFVI) instead of exact probabilistic inference algorithms such as the forward-backward and Viterbi algorithms for training and prediction on the conditional random field for sequence labeling.The MFVI algorithm can be unfolded as a recurrent neural network and connected with the encoder to form an end-to-end neural network.AINs can be parallelized over different positions in the sequence.Empirical results show that AINs are significantly faster than traditional CRF and do very well in tasks that require more local information.Our approaches achieve competitive accuracy on 4 tasks with 15 datasets over three encoder types.

A Appendix
A.1 Datasets Named Entity Recognition (NER) We use the corpora from the CoNLL 2002 and CoNLL 2003 shared tasks (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003), which contain four languages in total.We use the standard training/development/test split for experiments. 2hunking The chunking datasets are also from the CoNLL 2003 shared task (Tjong Kim Sang and De Meulder, 2003) that contains English and German datasets.We use the same standard split as in NER.
Part-Of-Speech (POS) Tagging Universal Dependencies3 (UD) (Nivre et al., 2018) contains syntactically annotated corpora of over 70 languages.We use universal POS tag annotations with 8 languages for experiments, the list of treebanks is shown in Table 3.We use the standard training/development/test split for experiments.
Slot Filling Slot filling is a task that interprets user commands by extracting relevant slots, which can be formulated as a sequence labeling task.We use the Air Travel Information System (ATIS) (Hemphill et al., 1990) dataset for the task and use the same dataset split as this repository.

A.2 Settings
Embeddings For word embeddings in NER, chunking and slot filling experiments, we use the same word embedding as Lample et al. (2016) except that we use fastText (Bojanowski et al., 2017) embedding for Dutch which we find significantly improves the accuracy (more than 5 F1 scores on CoNLL NER).We use fastText embeddings for all UD tagging experiments.For character embedding, we use a single layer character CNN with a hidden size of 50, because Yang et al. (2018) empirically showed that it has competitive performance with character LSTM.We concatenate the word embedding and character CNN output for the final word representation.
Hyper-parameters For the hyper-parameters, we follow the settings of previous work (Akbik et al., 2018).We use Stochastic Gradient Descent for optimization with a fixed learning rate of 0.1 and a batch size of 32.We fix the hidden size of the CNN and BiLSTM layer to 512 and 256 respectively, and the kernel size of CNN to 3. We anneal the learning rate by 0.5 if there is no improvement in the development sets for 10 epochs when training.For the value of maximum iteration M , we  tried from 1 to 5 and compared the accuracy on the English NER dataset over 5 runs for different M .
Evaluation We use F1 score to evaluate the NER, slot filling and chunking tasks and use accuracy to evaluate the POS tagging task.We convert the BIO format into BIOES format for NERs, slot filling and chunking datasets and use the official release of CoNLL evaluation script4 to evaluate the F1 score.

A.3 Detailed Results
The detailed results for the four tasks are shown in Table 4 and 5.We use ISO 639-1 codes5 to represent each language for simplification.

Figure 2 :
Figure 2: Illustration of the computation graphs for the Viterbi decoding and one iteration of our MFVI inference on the CRF model.Y i is the random variable representing the i-th label with three possible values.The illustrated vectors represent Viterbi scores and Q i distributions respectively.

Table 1 :
Relative speedup over the CRF model with 10,000 sentences of 32/128 words.All represents the speed of the full model.Dec. represents the speed of decoder.: For reference.

Table 2 :
Averaged F1 score and accuracy on four tasks.SF represents the slot filling task.: For reference.

Table 3 :
The list of treebank that we used in UD POS tagging.

Table 5 :
Averaged F1 scores on NER, chunking and slot filling for each language.SF represents the slot filling task.: For reference.