An Investigation of Potential Function Designs for Neural CRF

The neural linear-chain CRF model is one of the most widely-used approach to sequence labeling. In this paper, we investigate a series of increasingly expressive potential functions for neural CRF models, which not only integrate the emission and transition functions, but also explicitly take the representations of the contextual words as input. Our extensive experiments show that the decomposed quadrilinear potential function based on the vector representations of two neighboring labels and two neighboring words consistently achieves the best performance.


Introduction
Sequence labeling is the task of labeling each token of a sequence. It is an important task in natural language processing and has a lot of applications such as Part-of-Speech Tagging (POS) (DeRose, 1988;Toutanova et al., 2003;Xin et al., 2018), Named Entity Recognition (NER) (Ritter et al., 2011;Akbik et al., 2019), Chunking (Tjong Kim Sang and Buchholz, 2000;Suzuki et al., 2006). The neural CRF model is one of the most widelyused approaches to sequence labeling and can achieve superior performance on many tasks (Collobert et al., 2011;Chen et al., 2015;Ling et al., 2015;Ma and Hovy, 2016;Lample et al., 2016a). It often employs an encoder such as a BiLSTM to compute the contextual vector representation of each word in the input sequence. The potential function at each position of the input sequence in a neural CRF is typically decomposed into an emission function (of the current label and the vector representation of the current word) and a transition function (of the previous and current labels) (Liu et al., 2018;Yang et al., 2018). * Kewei Tu and Yong Jiang are the corresponding authors. In this paper, we design a series of increasingly expressive potential functions for neural CRF models. First, we compute the transition function from label embeddings (Ma et al., 2016;Nam et al., 2016;Cui and Zhang, 2019) instead of label identities. Second, we use a single potential function over the current word and the previous and current labels, instead of decomposing it into the emission and transition functions, leading to more expressiveness. We also employ tensor decomposition in order to keep the potential function tractable. Thirdly, we take the representations of additional neighboring words as input to the potential function, instead of solely relying on the BiLSTM to capture contextual information.
To empirically evaluate different approaches, we conduct experiments on four well-known sequence labeling tasks: NER, Chunking, coarse-and finegrained POS tagging. We find that it is beneficial for the potential function to take representations of neighboring words as input, and a quadrilinear potential function with a decomposed tensor parameter leads to the best overall performance.
Our work is related to Reimers and Gurevych (2017); Yang et al. (2018), which also compared different network architectures and configurations and conducted empirical analysis on different sequence labeling tasks. However, our focus is on the potential function design of neural CRF models, which has not been sufficiently studied before.  Figure 2: Factor graphs of different models. The solid circles and hollow circles indicate random variables of word encodings and labels respectively. The black squares represent factors.

Models
Our overall neural network architecture for sequence labeling is shown in Figure 1. It contains three parts: a word representation layer, a bi-directional LSTM (BiLSTM) encoder, and an inference layer. The BiLSTM encoder produces a sequence of output vectors h 1 , h 2 , ...h M ∈ R D h , which are utilized by the inference layer to predict the label sequence. The inference layer typically defines a potential function s(x, y, i) for each position i of the input sequence x and label sequence y and computes the conditional probability of the label sequence given the input sequence as follows: where M is the length of the sequence. The simplest inference layer assumes independence between labels. It applies a linear transformation to h i followed by a Softmax function to predict the distribution of label y i at each position i (Figure 2(a)). In many scenarios, however, it makes sense to model dependency between neighboring labels, which leads to linear-chain CRF models.
Vanilla CRF In most previous work of neural CRFs, the potential function is decomposed to an emission function and a transition function ( Figure  2(b)), and the transition function is represented by a table φ maintaining the transition scores between labels.
where v y i is a one-hot vector for label y i and W h ∈ R D h ×Dt is a weight matrix.
TwoBilinear Instead of one-hot vectors, we may use dense vectors to represent labels, which has the benefit of encoding similarities between labels. Accordingly, the emission and transition functions are modeled by two bilinear functions.
where W t ∈ R Dt×Dt is a weight matrix, and t y i ∈ R Dt is the embedding of label y i . The factor graph remains the same as vanilla CRF (Figure 2(b)).
ThreeBilinear Figure 2(c) depicts the structure of ThreeBilinear. Compared with TwoBilinear, ThreeBilinear has an extra emission function between the current word representation and previous label.
Trilinear Instead of three bilinear functions, we may use a trilinear function to model the correlation between h i , t y i and t y i−1 . It has strictly more representational power than the sum of three bilinear functions.
where U ∈ R D h ×Dt×Dt is an order-3 weight tensor. Figure 2(d) presents the structure of Trilinear.

D-Trilinear
Despite the increased representational power of Trilinear, its space and time complexity becomes cubic. To reduce the computational complexity without too much compromise of the representational power, we assume that U has rank D r and can be decomposed into the product of three matrices U t 1 , U t 2 ∈ R Dt×Dr and U h ∈ R D h ×Dr . Then the trilinear function can be rewritten as, where • denotes element-wise product. We call the resulting model D-Trilinear. The factor graph of D-Trilinear is the same as Trilinear (Figure 2(d)).

D-Quadrilinear
We may take the representation of the previous word as an additional input and use a quadrilinear function in the potential function.
where U is an order-4 weight tensor. However, the computational complexity of this function becomes quartic. Hence we again decompose the tensor into the product of four matrices and rewrite the potential function as follows.
We call the resulting model D-Quadrilinear and its factor graph is shown in Figure 2(e).

D-Pentalinear
Following the same idea, we extend D-quadrilinear to D-Pentalinear by taking the representation of the next word as an additional input. Figure 2(f) shows the structure of D-Pentalinear. We conduct our experiments with pretrained word embeddings, character embeddings, and BERT embeddings (Devlin et al., 2019a). For NER and Chunking, we use the BIOES scheme for its better performance than the BIO scheme (Ratinov and Roth, 2009;Dai et al., 2015;Yang et al., 2018). We use F1-score as the evaluation metric for both NER and Chunking. We run each model for 5 times with different random seeds for each experiment and report the average score and the standard derivation. More details can be found in supplementary material.

Results
We show the detailed results on NER and Chunking with BERT embeddings in Table 1 and the averaged results on all the tasks in Table 2 (the complete results can be found in the supplementary materials). We make the following observations. Firstly, D-Quadrilinear has the best overall performance in all the tasks. Its advantage over D-Trilinear is somewhat surprising because the BiLSTM output h i in D-Trilinear already contains information of both the current word and the previous word. We speculate that: 1) information of the previous word is useful in evaluating the local potential in sequence labeling (as shown by traditional feature-based approaches); and 2) information of the previous word is obfuscated in h i and hence directly inputting h i−1 into the potential function helps. Secondly, D-Quadrilinear greatly outperforms BiLSTM-LAN (Cui and Zhang, 2019), one of the state-of-the-art sequence labeling approaches which employs a hierarchically-refined label attention network. Thirdly, D-Trilinear clearly outperforms both ThreeBilinear and Trilinear. This suggests that tensor decomposition could be a viable way to both regularize multilinear potential functions and reduce their computational complexity.

Analysis
Small training data We train four of our models on randomly selected 10% or 30% of the training data on the NER and Chunking tasks. We run each experiment for 5 times. Figure 3 shows the average difference in F1-scores between each model and    Vanilla CRF. It can be seen that with small data, the advantages of D-Trilinear and D-Quadrilinear over Vanilla CRF and Softmax become even larger.
Multi-layers LSTM As discussed in section 3.1, D-Quadrilinear outperforms D-Trilinear probably because h i , the BiLSTM output at position i, does not contain sufficient information of the previous word. Here we study whether increasing the number of BiLSTM layers would inject more information into h i and hence reduce the performance gap between the two models. Table 3 shows the results on the NER and Chunking tasks with word embedding. D-Quadrilinear still outperforms D-Trilinear, but by comparing Table 3 with Table 2, we see that their difference indeed becomes smaller with more BiLSTM layers. Another observation is that more BiLSTM layers often lead to lower scores. This is consistent with previous findings (Cui and Zhang, 2019) and is probably caused by overfitting.
Speed We test the training and inference speed of our models. Our decomposed multilinear approaches are only a few percent slower than Vanilla CRF during training and as fast as Vanilla CRF during inference, which suggests their practical usefulness. The details can be found in the supple-mentary material.

Conclusion
In this paper, we investigate several potential functions for neural CRF models. The proposed potential functions not only integrate the emission and transition functions, but also take into consideration representations of additional neighboring words. Our experiments show that D-Quadrilinear achieves the best overall performance. Our proposed approaches are simple and effective and could facilitate future research in neural sequence labeling.

Acknowledgement
This work was supported by Alibaba Group through Alibaba Innovative Research Program. This work was also supported by the National Natural Science Foundation of China (61976139).

A.2 Word representations
We have three different versions of word representations: • Word Embedding. We use pretrained word embeddings such as GloVe (Pennington et al., 2014) and FastText (Grave et al., 2018).
• Word Embedding and Character Embedding. We use the same character LSTMs as in Lample et al. (2016b) and set the hidden  • BERT Embedding. We use the respective BERT embedding from (Devlin et al., 2019b) for each language. If there is no pretrained BERT embedding for a language, we then use the multilingual BERT (M-BERT) instead. The word representation is from the last four layers of the BERT embedding.
We fine-tune the word embeddings and character embeddings during the training process. We don't fine-tune the BERT embeddings.

A.3 Hyperparameters setting
We tune the following hyperparameters in our experiments.  Rank D r In D-Trilinear, D-Quadlinear, and D-Pentalinear, D r is a hyperparameter that controls the representational power of the multilinear functions. We select its value from {64, 128, 256, 384, 600}.

LSTM hidden size
Other hyperparameter settings are list in table 5.

A.4 Additional Analysis
Multilinear vs. Concatenation Our bestperforming models are based on multilinear functions with decomposed parameter tensors. An alternative to multilinear functions is to apply an MLP with nonlinear activations to the concatenated input vectors. We run the comparison on the NER task with word embeddings and tune the tag embedding size from {20, 50, 100, 200} and the hidden size of the MLP from {64, 128, 256, 384}. As shown in table 6, the two concatenation-based models underperform their decomposed multilinear counterparts, but they do outperform TwoBilinear and ThreeBilinear.
Transformer vs. BiLSTM As we discussed in section 3.1, information of the previous word may be obfuscated in h i . Transformer-like encoders which can model long-range context may alleviate the obfuscation. We use a 6-layers transformer encoder and run the comparison on vanilla CRF and D-Quadrilinear on NER tasks with word embeddings. As shown in table 7, with the transformer encoder, D-Quadrilinear outperforms the vanilla CRF by 1.31%. In comparison, with the BiLSTM encoder, D-Quadrilinear outperforms the vanilla CRF by 0.63%. So the advantage of our approach against the vanilla CRF becomes even larger when using the transformer encoder.
Speed We use a Nvidia Titan V GPU to test the training and inference speed of the 8 models on the NER English dataset. Figure 4 shows the training and inference time averaged over 10 epochs. Softmax is much faster than all the other approaches because it does not need to run Forward-Backward and Viberbi and can parallelize the predictions at all the positions of a sequence. Our decomposed multilinear approaches are not significantly slower than Vanilla CRF but generally have better perfor-   mance, which suggests their practical usefulness.