Hierarchically-Refined Label Attention Network for Sequence Labeling

CRF has been used as a powerful model for statistical sequence labeling. For neural sequence labeling, however, BiLSTM-CRF does not always lead to better results compared with BiLSTM-softmax local classification. This can be because the simple Markov label transition model of CRF does not give much information gain over strong neural encoding. For better representing label sequences, we investigate a hierarchically-refined label attention network, which explicitly leverages label embeddings and captures potential long-term label dependency by giving each word incrementally refined label distributions with hierarchical attention. Results on POS tagging, NER and CCG supertagging show that the proposed model not only improves the overall tagging accuracy with similar number of parameters, but also significantly speeds up the training and testing compared to BiLSTM-CRF.


Introduction
Conditional random fields (CRF) (Lafferty et al., 2001) is a state-of-the-art model for statistical sequence labeling (Toutanova et al., 2003;Peng et al., 2004;Ratinov and Roth, 2009). Recently, CRF has been integrated with neural encoders as an output layer to capture label transition patterns (Zhou and Xu, 2015;Ma and Hovy, 2016). This, however, sees mixed results. For example, previous work (Reimers and Gurevych, 2017; has shown that BiLSTM-softmax gives better accuracies compared to BiLSTM-CRF for part-of-speech (POS) tagging. In addition, the state-of-the-art neural Combinatory Categorial Grammar (CCG) supertaggers do not use CRF Lewis et al., 2016).
One possible reason is that the strong representation power of neural sentence encoders such as BiLSTMs allow models to capture implicit long-range label dependencies from input  Figure 1: Visualization of hierarchically-refined Label Attention Network for POS tagging. The numbers indicate the label probability distribution for each word.
word sequences alone (Kiperwasser and Goldberg, 2016;Dozat and Manning, 2016;Teng and Zhang, 2018), thereby allowing the output layer to make local predictions. In contrast, though explicitly capturing output label dependencies, CRF can be limited by its Markov assumptions, particularly when being used on top of neural encoders. In addition, CRF can be computationally expensive when the number of labels is large, due to the use of Viterbi decoding. One interesting research question is whether there is neural alternative to CRF for neural sequence labeling, which is both faster and more accurate. To this question, we investigate a neural network model for output label sequences. In particular, we represent each possible label using an embedding vector, and aim to encode sequences of label distributions using a recurrent neural network. One main challenge, however, is that the number of possible label sequences is exponential to the size of input. This makes our task essentially to represent a full-exponential search space without making Markov assumptions.
We tackle this challenge using a hierarchicallyrefined representation of marginal label distributions. As shown in Figure 1, our model consists of a multi-layer neural network. In each layer, each input words is represented together with its marginal label probabilities, and a sequence neural network is employed to model unbounded dependencies. The marginal distributions space are refined hierarchically bottom-up, where a higher layer learns a more informed label sequence distribution based on information from a lower layer.
For instance, given a sentence "They 1 can 2 fish 3 and 4 also 5 tomatoes 6 here 7 ", the label distributions of the words can 2 and fish 3 in the first layer of Figure 1 have higher probabilities on the tags MD (modal verb) and VB (base form verb), respectively, though not fully confidently. The initial label distributions are then fed as the inputs to the next layer, so that long-range label dependencies can be considered. In the second layer, the network can learn to assign a noun tag to fish 3 by taking into account the highly confident tagging information of tomatoes 6 (NN), resulting in the pattern "can 2 (VB) fish 3 (NN)". Figure 2, our model consists of stacked attentive BiLSTM layers, each of which takes a sequence of vectors as input and yields a sequence of hidden state vectors together with a sequence of label distributions. The model performs attention over label embeddings (Wang et al., 2015; for deriving a marginal label distributions, which are in turn used to calculate a weighted sum of label embeddings. Finally, the resulting packed label vector is used together with input word vectors as the hidden state vector for the current layer. Thus our model is named label attention network (LAN). For sequence labeling, the input to the whole model is a sentence and the output is the label distributions of each word in the final layer.

As shown in
BiLSTM-LAN can be viewed as a form of multi-layered BiLSTM-softmax sequence labeler. In particular, a single-layer BiLSTM-LAN is identical to a single-layer BiLSTM-softmax model, where the label embedding table serves as the softmax weights in BiLSTM-softmax, and the label attention distribution is the softmax distribution in BiLSTM-softmax. The traditional way of making a multi-layer extention to BiLSTM-softmax is to stack multiple BiLSTM encoder layers before the softmax output layer, which learns a deeper input representation. In contrast, a multi-layer BiLSTM-LAN stacks both the BiLSTM encoder layer and the softmax output layer, learning a deeper representation of both the input and candidate output sequences.
On standard benchmarks for POS tagging, NER and CCG supertagging, our model achieves significantly better accuracies and higher efficiencies than BiLSTM-CRF and BiLSTM-softmax with similar number of parameters. It gives highly competitive results compared with topperformance systems on WSJ, OntoNotes 5.0 and CCGBank without external training. In addition to accuracy and efficiency, BiLSTM-LAN is also more interpretable than BiLSTM-CRF thanks to visualizable label embeddings and label distributions. Our code and models are released at https://github.com/Nealcly/LAN.

Related Work
Neural Attention. Attention has been shown useful in neural machine translation (Bahdanau et al., 2014), sentiment classification (Chen et al., 2017;Liu and Zhang, 2017), relation classification (Zhou et al., 2016), read comprehension (Hermann et al., 2015), sentence summarization (Rush et al., 2015), parsing (Li et al., 2016), question answering (Wang et al., 2018b) and text understanding (Kadlec et al., 2016). Self-attention network (SAN) (Vaswani et al., 2017) has been used for semantic role labeling (Strubell et al., 2018), text classification (Xu et al., 2018;Wu et al., 2018) and other tasks. Our work is similar to Vaswani et al. (2017) in the sense that we also build a hierarchical attentive neural network for sequence representation. The difference lies in that our main goal is to investigate the encoding of exponential label sequences, whereas their work focuses on encoding of a word sequence only.
Label Embeddings. Label embedding was first used in the field of computer vision for facilitating zero-shot learning (Palatucci et al., 2009;Socher et al., 2013;Zhang and Saligrama, 2016). The basic idea is to improve the performance of classifying previously unseen class instances by learning output label knowledge. In NLP, label embeddings have been exploited for better text classification (Tang et al., 2015;Nam et al., 2016;. However, relatively little work has been done investigating label embeddings for sequence labeling. One exception is Vaswani et al. (2016b), who use supertag embeddings in the output layer of a CCG supertagger, through a combination of local classification model rescored using a supertag language model. In contrast, we model deep label interactions using a dynamically refined sequence representation network. To our knowledge, we are the first to investigate a hierarchical attention network over a label space.
Neural CRF. There has been methods that aim to speed up neural CRF (Tu and Gimpel, 2018), and to solve the Markov constraint of neural CRF. In particular,  predicts a sequence of labels as a sequence to sequence problem; Guo et al. (2019) further integrates global input information in encoding. Capturing non-local dependencies between labels, these methods, however, are slower compared with CRF. In contrast to these lines of work, our method is both asymptotically faster and empirically more accurate compared with neural CRF.

Baseline
We implement BiLSTM-CRF and BiLSTMsoftmax baseline models with character-level features (Dos Santos and Zadrozny, 2014;Lample et al., 2016), which consist of a word representation layer, a sequence representation layer and a local softmax or CRF inference layer.

Word Representation Layer
Following Dos Santos and Zadrozny (2014) and Lample et al. (2016), we use character information for POS tagging. Given a word sequence w 1 , w 2 , ..., w n , where c ij denotes the jth character in the ith word, each c ij is represented using where e c denotes a character embedding lookup table.
We adopt BiLSTM for character encoding. x c i denotes the output of character-level encoding. A word is represented by concatenating its word embedding and its character representation: where e w denotes a word embedding lookup table.

Sequence Representation Layer
For sequence encoding, the input is a sentence x = {x 1 , · · ·, x n }. Word representations are fed into a BiLSTM layer, yielding a sequence of for- h w n }, respectively. Finally, the two hidden states are concatenated for a final representation

CRF.
A CRF layer is used on top of the hidden vectors H w . The conditional probabilities of label distribution sequences y = {y 1 , · · ·, y n } is Here y represents an arbitrary label distribution sequence, W l i CRF is a model parameter specific to l i , and b is a bias specific to l i−1 and l i . The first-order Viterbi algorithm is used to find the highest scored label sequence over an input word sequence during decoding.
Softmax. Independent local softmax classification can give competitive result on sequence labeling (Reimers and Gurevych, 2017). For each node, h w i is fed to a softmax layer to find whereŷ i is the predicted label for w i ; W and b are the parameters for softmax layer.

Label Attention Network
The structure of our refined label attention network is shown in Figure 2. We denote our model as BiLSTM-LAN (namely BiLSTM-label attention network) for the remaining of this paper. We use the same input word representations for BiLSTM-softmax, BiLSTM-CRF and BiLSTM-LAN, as described in Section 3.1. Compared with the baseline models, BiLSTM-LAN uses a set of BiLSTM-LAN layers for both encoding and label prediction, instead of a set of traditional sequence encoding layers and an inference layer.

Label Representation.
Given the set of candidates output labels L = {l 1 , · · ·, l |L| }, each label l k is represented using an embedding vector: x l k = e l (l k ) where e l denotes a label embedding lookup table. Label embeddings are randomly initialized and tuned during model training.

BiLSTM-LAN Layer
The model in Figure 2 consists of 2 BiLSTM-LAN layers. As discussed earlier, each BiLSTM-LAN layer is composed of a BiLSTM encoding sublayer and a label-attention inference sublayer. In paticular, the former is the same as the BiLSTM layer in the baseline model, while the latter uses multihead attention (Vaswani et al., 2017) to jointly encode information from the word representation subspace and the label representation subspace. For each head, we apply a dot-product attention with a scaling factor to the inference component, deriving label distributions for BiLSTM hidden states and label embeddings.
BiLSTM Encoding Sublayer. Denote the input to each layer as x = {x 1 , x 2 , ..., x n }. BiL-STM (Section 3.2) is used to calculate H w ∈ R n×d h , where n and d h denote the word sequence length and BiLSTM hidden size (same as the dimension of label embedding), respectively.
Label-Attention Inference Sublayer. For the label-attention inference sublayer, the attention mechanism produces an attention matrix α consisting of a potential label distribution for each word. We define Instead of the standard attention mechanism above, it can be beneficial to use multi-head for capturing multiple possible of potential label distributions in parallel.
k are parameters to be learned during the training, k is the number of parallel heads.
The final representation for each BiLSTM-LAN layer is the concatenation of BiLSTM hidden states and attention outputs : H is then fed to a subsequent BiLSTM-LAN layer as input, if any.
Output. In the last layer, BiLSTM-LAN directly predicts the label of each word based on the attention weights.
whereŷ i j denotes the jth candidate label for the ith word andŷ i is the predicted label for ith word in the sequence.

Training
BiLSTM-LAN can be trained by standard backpropagation using the log-likelihood loss. The training object is to minimize the cross-entropy between y i andŷ i for all labeled gold-standard sentences. For each sentence, where i is the word index, j is the index of labels for each word.

Complexity
For decoding, the asymptotic time complexity is O(|L| 2 n) and O(|L|n) for BiLSTM-CRF and BiLSTM-LAN, respectively, where |L| is the total number of labels and n is the sequence length. Compared with BiLSTM-CRF, BiLSTM-LAN can significantly increase the speed for sequence labeling, especially for CCG supertagging, where the sequence length can be much smaller than the total number of labels.

BiLSTM-LAN and BiLSTM-softmax
As mentioned in the introduction, a singlelayer BiLSTM-LAN is identical to a single-layer BiLSTM-softmax sequence labeling model. In particular, the BiLSTM-softmax model is given by Eq 1. In a BiLSTM-LAN model, we arrange the set of label embeddings into a matrix as follows: A naive attention model over X has: It is thus straightforward to see that the label embedding matrix x l corresponds to the weight matrix W is Eq 1, and the distribution α corresponds to y in Eq 1.

Experiments
We empirically compare BiLSTM-CRF, BiLSTM-softmax and BiLSTM-LAN using different sequence labelling tasks, including English POS tagging, multilingual POS tagging, NER and CCG supertagging.     Evaluation. F-1 score are used for NER. Other tasks are evaluated based on the accuracy. We repeat the same experiment five times with the same hyperparameters and report the max accuracy, average accuracy and standard deviation for POS tagging. For fair comparison, all experiments are implemented in NCRF++  and conducted using a GeForce GTX 1080Ti with 11GB memory.

Development Experiments
We report a set of WSJ development experiments to show our investigation on key configurations of BiLSTM-LAN and BiLSTM-CRF.
Label embedding size. Table 2 shows the effect of the label embedding size. Notable improvement can be achieved when the label embedding size increases from 200 to 400 in our model, but the accuracy does not further increase when the size increases beyond 400. We fix the label embedding size to 400 for our model.
Number of Layers. For BiLSTM-CRF, previous work has shown that one BiLSTM layer is the most effecitve for POS tagging (Ma and Hovy, 2016;. Table 2 compares different numbers of BiLSTM layers and hidden sizes.   As can be seen, a multi-layer model with larger hidden sizes does not give significantly better results compared to a 1-layer model with a hidden size of 400. We thus chose the latter for the final model. For BiLSTM-LAN, each layer learns a more abstract representation of word and label distribution sequences. As shown in Table 2, for POS tagging, it is effective to capture label dependencies using two layers. More layers do not empirically improve the performance. We thus set the final number of layers to 3. The effectiveness of Model Structure. To evaluate the effect of BiLSTM-LAN layers, we conduct ablation experiments as shown in Table 3. In BiLSTM-LAN w/o attention, we remove the attention inference sublayers from BiLSTM-LAN except for the last BiLSTM-LAN layer. This model is reminiscent to BiLSTM-softmax except that the output is based on label embeddings. It gives an accuracy slightly higher than that of LSTM-softmax, which demonstrates the advantage of label embeddings. On the other hand, it significantly underperforms BiLSTM-LAN (p-value<0.01), which shows the advantage of hierarchically-refined label distribution sequence encoding.
Model Size vs CRF.   the model size increases, both BiLSTM-CRF and BiLSTM-LAN see a peak point beyond which further increase of model size does not bring better results, which is consistent with observations from prior work, demonstrating that the number of parameters is not the decisive factor to model accuracy; and (2) the best-performing BiLSTM-LAN model size is comparable to that of the BiLSTM-CRF model size, which indicates that the model structure is more important for the better accuracy of BiLSTM-LAN. Speed vs CRF. Table 4 shows a comparison of training and decoding speeds. BiLSTM-LAN processes 805 and 714 sentences per second on the WSJ and CCGBank development data, respectively, outperforming BiLSTM-CRF by 3% and 19%, respectively. The larger speed improvement on CCGBank shows the benefit of lower asymptotic complexity, as discussed in Section 4.
Training vs CRF. Figure 3 presents the training curves on the WSJ development set. At the beginning, BiLSTM-LAN converges slower than BiLSTM-CRF, which is likely because BiLSTM-LAN has more complex layer structures for label embedding and attention. After around 15 training iterations, the accuracy of BiLSTM-LAN on the development sets becomes increasingly higher than BiLSTM-CRF. This demonstrates the effect of label embeddings, which allows more structured knowledge to be learned in modeling.

Final Results
WSJ. Table 5 shows the final POS tagging results on WSJ. Each experiment is repeated 5 times. BiLSTM-LAN gives significant accuracy improvements over both BiLSTM-CRF and BiLSTM-softmax (p <0.01), which is consistent with observations on development experiments. Table 6  Universal Dependencies(UD) v2.2. We design a multilingual experiment to compare BiLSTMsoftmax, BiLSTM-CRF (strictly following Yasunaga et al. (2018) 1 , which is the state-of-theart on multi-lingual POS tagging) and BiLSTM-LAN. The accuracy and training speeds are shown in Table 7. Our model outperforms all the baselines on all the languages. The improvements are statistically significant for all the languages (p <0.01), suggesting that BiLSTM-LAN is generally effective across languages.
OntoNotes 5.0. In NER, BiLSTM-CRF is widely used, because local dependencies between neighboring labels relatively more important that POS tagging and CCG supertagging. BiLSTM-LAN also significantly outperforms BiLSTM-CRF by 1.17 F1-score (p <0.01). Table 8  CCGBank. In CCG supertagging, the major challenge is a larger set of lexical tags |L| and supertag constraints over long distance dependencies. As shown in Table 9, BiLSTM-LAN significantly outperforms both BiLSTMsoftmax and BiLSTM-CRF (p <0.01), showing the advantage of LAN.  and Vaswani et al. (2016a) explore BiRNN-softmax and BiLSTM-softmax, respectively. Søgaard and Goldberg (2016) present a multi-task learning architecture with BiRNN to improve the performance. Lewis et al. (2016) train BiLSTM-softmax using tri-training. Vaswani et al. (2016b) combine a LSTM language model and BiLSTM over the supertags. Tu and Gimpel (2019) introduce the inference network (Tu and Gimpel, 2018) in CCG supertagging to speed up the training and decoding for BiLSTM-CRF. Compared with these methods, BiLSTM-LAN obtains new state-of-theart results on CCGBank, matching the tri-training performance of Lewis et al. (2016), without training on external data.

Discussion
Visualization. A salient advantage of LAN is more interpretable models. We visualize the label embeddings as well as label attention weights for POS tagging. We use t-SNE to visualize the 45 different English POS tags (WSJ) on a 2D map after 5, 15, 38 training iteration, respectively. Each dot represents a label embedding. As can be seen in Figure 4, label embeddings are increasingly more meaningful during training. Initially, the vectors sit in random locations in the space. After 5 iterations, small clusters emerge, such as   "NNP" and "NNPS", "VBD" and "VBN", "JJS" and "JJR" etc. The clusters grow absorbing more related tags after more training iterations. After 38 training iterations, most similar POS tags are grouped together, such as "VB", "VBD", "VBN", "VBG" and "VBP". More attention visualization are shown in Appendix B. Supercategory Complexity. We also measure the complexity of supercategories by the number of basic categories that they contain. According to this definition, "S", "S/NP" and "(S \NP)/NP" have complexities of 1, 2 and 3, respectively. Figure 5 shows the accuracy of BiLSTM-softmax, BiLSTM-CRF and BiLSTM-LAN against the supertag complexity. As the complexity increases, the performance of all the models decrease, which conforms to intuition. BiLSTM-CRF does not show obvious advantages over BiLSTM-softmax on complex categories. In contrast, BiLSTM-LAN outperforms both models on complex categories, demonstrating its advantage in capturing more sophisticated label dependencies.
Case Study. Some predictions of BiLSTMsoftmax, BiLSTM-CRF and BiLSTM-LAN are shown in Table 10. The sentence contains two prepositional phrases "with ..." and "at ...", thus exemplifies the PP-attachment problem, one of the hardest sub-problems in CCG supertagging. As can be seen, BiLSTM-softmax fails to learn the long-range relation between "settled" and "at". BiLSTM-CRF strictly follows the hard constraint between neighbor categories thanks to Markov label transition. However, it predicts "with" incorrectly as "PP/NP" with the former supertag ending with "/PP". In contrast, BiLSTM-LAN can capture potential long-term dependency and better determine the supertags based on global label information. In this case, our model can effectively represent a full label search space without making Markov assumptions.

Conclusion
We investigate a hierarchically-refined label attention network (LAN) for sequence labeling, which leverages label embeddings and captures potential long-range label dependencies by deep attentional encoding of label distribution sequences.