An Exploration of Arbitrary-Order Sequence Labeling via Energy-Based Inference Networks

Many tasks in natural language processing involve predicting structured outputs, e.g., sequence labeling, semantic role labeling, parsing, and machine translation. Researchers are increasingly applying deep representation learning to these problems, but the structured component of these approaches is usually quite simplistic. In this work, we propose several high-order energy terms to capture complex dependencies among labels in sequence labeling, including several that consider the entire label sequence. We use neural parameterizations for these energy terms, drawing from convolutional, recurrent, and self-attention networks. We use the framework of learning energy-based inference networks (Tu and Gimpel, 2018) for dealing with the difficulties of training and inference with such models. We empirically demonstrate that this approach achieves substantial improvement using a variety of high-order energy terms on four sequence labeling tasks, while having the same decoding speed as simple, local classifiers. We also find high-order energies to help in noisy data conditions.


Introduction
Conditional random fields (CRFs; Lafferty et al., 2001) have been shown to perform well in various sequence labeling tasks. Recent work uses rich neural network architectures to define the "unary" potentials, i.e., terms that only consider a single position's label at a time (Collobert et al., 2011;Lample et al., 2016;Ma and Hovy, 2016;Strubell et al., 2018). However, "binary" potentials, which consider pairs of adjacent labels, are usually quite simple and may consist solely of a parameter or parameter vector for each unique label transition.
Models with unary and binary potentials are generally referred to as "first order" models.
A major challenge with CRFs is the complexity of training and inference, which are quadratic in the number of output labels for first order models and grow exponentially when higher order dependencies are considered. This explains why the most common type of CRF used in practice is a first order model, also referred to as a "linear chain" CRF.
One promising alternative to CRFs is structured prediction energy networks (SPENs; Belanger and McCallum, 2016), which use deep neural networks to parameterize arbitrary potential functions for structured prediction. While SPENs also pose challenges for learning and inference, Tu and Gimpel (2018) proposed a way to train SPENs jointly with "inference networks", neural networks trained to approximate structured arg max inference.
In this paper, we leverage the frameworks of SPENs and inference networks to explore highorder energy functions for sequence labeling. Naively instantiating high-order energy terms can lead to a very large number of parameters to learn, so we instead develop concise neural parameterizations for high-order terms. In particular, we draw from vectorized Kronecker products, convolutional networks, recurrent networks, and self-attention. We also consider "skip-chain" connections (Sutton and McCallum, 2004) with various skip distances and ways of reducing their total parameter count for increased learnability.
Our experimental results on four sequence labeling tasks show that a range of high-order energy functions can yield performance improvements. While the optimal energy function varies by task, we find strong performance from skip-chain terms with short skip distances, convolutional networks with filters that consider label trigrams, and recurrent networks and self-attention networks that consider large subsequences of labels.
We also demonstrate that modeling high-order dependencies can lead to significant performance improvements in the setting of noisy training and test sets. Visualizations of the high-order energies show various methods capture intuitive structured dependencies among output labels.
Throughout, we use inference networks that share the same architecture as unstructured classifiers for sequence labeling, so test time inference speeds are unchanged between local models and our method. Enlarging the inference network architecture by adding one layer leads consistently to better results, rivaling or improving over a BiLSTM-CRF baseline, suggesting that training efficient inference networks with high-order energy terms can make up for errors arising from approximate inference. While we focus on sequence labeling in this paper, our results show the potential of developing high-order structured models for other NLP tasks in the future.

Structured Energy-Based Learning
We denote the input space by X . For an input x ∈ X , we denote the structured output space by Y(x). The entire space of structured outputs is denoted Y = ∪ x∈X Y(x). We define an energy function (LeCun et al., 2006;Belanger and McCallum, 2016) E Θ parameterized by Θ that computes a scalar energy for an input/output pair: E Θ : X × Y → R. At test time, for a given input x, prediction is done by choosing the output with lowest energy:

Inference Networks
Inference. Solving equation (1) requires combinatorial algorithms because Y is a structured, discrete space. This becomes intractable when E Θ does not decompose into a sum over small "parts" of y. Belanger and McCallum (2016) relax this problem by allowing the discrete vector y to be continuous. Let Y R denote the relaxed output space. They solve the relaxed problem by using gradient descent to iteratively minimize the energy with respect to y. Tu and Gimpel (2018) propose an alternative that replaces gradient descent with a neural network trained to do inference, i.e., to mimic the function performed in equation (1). This "inference network" A Ψ : X → Y R is parameterized by Ψ and trained with the goal that (2) Tu and Gimpel (2019) show that inference networks achieve a better speed/accuracy/search error trade-off than gradient descent given pretrained energy functions.
Joint training of energy functions and inference networks. Belanger and McCallum (2016) proposed a structured hinge loss for learning the energy function parameters Θ, using gradient descent for the "cost-augmented" inference step required during learning. Tu and Gimpel (2018) replaced the cost-augmented inference step in the structured hinge loss with training of a "cost-augmented inference network" F Φ (x) trained with the following goal: where is a structured cost function that computes the distance between its two arguments. The new optimization objective becomes: where D is the set of training pairs and [h] + = max(0, h). Tu and Gimpel (2018) alternatively optimized Θ and Φ, which is similar to training in generative adversarial networks (Goodfellow et al., 2014).

An Objective for Joint Learning of Inference Networks
One challenge with the optimization problem above is that it still requires training an inference network A Ψ for test-time prediction. Tu et al. (2020a) proposed a "compound" objective that avoids this by training two inference networks jointly (with shared parameters), F Φ for cost-augmented inference and A Ψ for test-time inference:

perceptron loss
As indicated, this loss can be viewed as the sum of the margin-rescaled and perceptron losses. Θ, Φ, and Ψ are alternatively optimized. The objective for the energy function parameters Θ is: The objective for the other parameters is: where token is a supervised token-level loss which is added to aid in training inference networks.
In this paper, we use the standard cross entropy summed over all positions. Like Tu et al. (2020a), we drop the zero truncation (max(0, .)) when updating the inference network parameters to improve stability during training, which also lets us remove the terms that do not have inference networks. We use two independent networks but with the same architecture for the two inference networks.

Energy Functions
Our experiments in this paper consider sequence labeling tasks, so the input x is a length-T sequence of tokens where x t denotes the token at position t. The output y is a sequence of labels also of length T . We use y t to denote the output label at position t, where y t is a vector of length L (the number of labels in the label set) and where y t,j is the jth entry of the vector y t . In the original output space Y(x), y t,j is 1 for a single j and 0 for all others. In the relaxed output space Y R (x), y t,j can be interpreted as the probability of the tth position being labeled with label j. We use the following energy: where U j ∈ R d is a parameter vector for label j and E W (y) is a structured energy term parameterized by parameters W . In a linear chain CRF, W is a transition matrix for scoring two adjacent labels. Different instantiations of E W will be detailed in the sections below. Also, b(x, t) ∈ R d denotes the "input feature vector" for position t. We define it to be the d-dimensional BiLSTM (Hochreiter and Schmidhuber, 1997) hidden vector at t. The full set of energy parameters Θ includes the U j vectors, W , and the parameters of the BiLSTM. The above energy functions are trained with the objective in Section 2.3. Table 1 shows the training and test-time inference requirements of our method compared to previous methods. For different formulations of the energy function, the inference network architecture is the same (e.g., BiLSTM). So the inference complexity is the same as the standard neural approaches that do not use structured prediction, which is linear in the label set size. However, even for the first order model (linear-chain CRF), the time complexity is quadratic in the label set size. The time complexity of higher-order CRFs grows exponentially with the order.

Linear Chain Energies
Our first choice for a structured energy term is relaxed linear chain energy defined for sequence labeling by Tu and Gimpel (2018): Where W i ∈ R L×L is the transition matrix, which is used to score the pair of adjacent labels. If this linear chain energy is the only structured energy term in use, exact inference can be performed efficiently using the Viterbi algorithm.

Skip-Chain Energies
We also consider an energy inspired by "skip-chain" conditional random fields (Sutton and McCallum, 2004). In addition to consecutive labels, this energy also considers pairs of labels appearing in a given window size M + 1: where each W i ∈ R L×L and the max window size M is a hyperparameter. While linear chain energies allow efficient exact inference, using skip-chain energies causes exact inference to require time exponential in the size of M .

High-Order Energies
We also consider M th-order energy terms. We use the function F to score the M + 1 consecutive  labels y t−M , . . . , y t , then sum over positions: We consider several different ways to define the function F , detailed below.

Vectorized Kronecker Product (VKP):
A naive way to parameterize a high-order energy term would involve using a parameter tensor W ∈ R L M +1 with an entry for each possible label sequence of length M + 1. To avoid this exponentially-large number of parameters, we define a more efficient parameterization as follows.
We first define a label embedding lookup table ∈ R L×n l and denote the embedding for label j by e j . We consider M = 2 as an example. Then, for a tensor W ∈ R L×L×L , its value W i,j,k at indices where v ∈ R (M +1)n l is a parameter vector and ; denotes vector concatenation. MLP expects and returns vectors of dimension (M + 1) × n l and is parameterized as a multilayer perceptron. Then, the energy is computed: The operator VKP is somewhat similar to the Kronecker product of the k vectors v 1 , . . . , v k 2 . However it will return a vector, not a tensor: There are some work (Lei et al., 2014;Srikumar and Manning, 2014;Yu et al., 2016) that use Kronecker product for higher order feature combinations with low-rank tensors. Here we use this form to express the computation when scoring the consecutive labels.
Where vec is the operation that vectorizes a tensor into a (column) vector.
CNN: Convolutional neural networks (CNN) are frequently used in NLP to extract features based on words or characters (Collobert et al., 2011;Kim, 2014). We apply CNN filters over the sequence of M + 1 consecutive labels. The F function is computed as follows: where g is a ReLU nonlinearity and the vector W n ∈ R L(M +1) and scalar b n ∈ R are the parameters for filter n. The filter size of all filters is the same as the window size, namely, M + 1. The F function sums over all CNN filters. When viewing this high-order energy as a CNN, we can think of the summation in Eq. 4 as corresponding to sum pooling over time of the feature map outputs.
Tag Language Model (TLM): Tu and Gimpel (2018) defined an energy term based on a pretrained "tag language model", which computes the probability of an entire sequence of labels. We also use a TLM, scoring a sequence of M + 1 consecutive labels in a way similar to Tu and Gimpel (2018); however, the parameters of the TLM are trained in our setting: We adopt the multi-head self-attention formulation from Vaswani et al. (2017). Given a matrix of the M + 1 consecutive labels where attention is the general attention mechanism: the weighted sum of the value vectors V using query vectors Q and key vectors K (Vaswani et al., 2017). The energy on the M + 1 consecutive labels is defined as the sum of entries in the feature map H ∈ R L×(M +1) after the self-attention transformation.

Fully-Connected Energies
We can simulate a "fully-connected" energy function by setting a very large value for M in the skip-chain energy (Section 3.2). For efficiency and learnability, we use a low-rank parameterization for the many translation matrices W i that will result from increasing M . We first define a matrix S ∈ R L×d that all W i will use. Each i has a learned parameter matrix D i ∈ R L×d and together S and D i are used to compute W i : where d is a tunable hyperparameter that affects the number of learnable parameters.

Related Work
Linear chain CRFs (Lafferty et al., 2001), which consider dependencies between at most two adjacent labels or segments, are commonly used in practice (Sarawagi and Cohen, 2005;Lample et al., 2016;Ma and Hovy, 2016). There have been several efforts in developing efficient algorithms for handling higher-order CRFs. Qian et al. (2009) developed an efficient decoding algorithm under the assumption that all high-order features have non-negative weights. Some work has shown that high-order CRFs can be handled relatively efficiently if particular patterns of sparsity are assumed (Ye et al., 2009;Cuong et al., 2014). Mueller et al. (2013) proposed an approximate CRF using coarse-to-fine decoding and early updating. Loopy belief propagation (Murphy et al., 1999) has been used for approximate inference in high-order CRFs, such as skip-chain CRFs (Sutton and Mc-Callum, 2004), which form the inspiration for one category of energy function in this paper. . CRFs are typically trained by maximizing conditional log-likelihood. Even assuming that the graph structure underlying the CRF admits tractable inference, it is still time-consuming to compute the partition function. Margin-based methods have been proposed (Taskar et al., 2004;Tsochantaridis et al., 2004) to avoid the summation over all possible outputs. Similar losses are used when training SPENs (Belanger and McCallum, 2016;Belanger et al., 2017), including in this paper. . The energybased inference network learning framework has been used for multi-label classification (Tu and Gimpel, 2018), non-autoregressive machine translation (Tu et al., 2020b), and previously for sequence labeling (Tu and Gimpel, 2019).
Moving beyond CRFs and sequence labeling, there has been a great deal of work in the NLP community in designing non-local features, often combined with the development of approximate algorithms to incorporate them during inference. These include n-best reranking (Och et al., 2004), beam search (Lowerre, 1976), loopy belief propagation (Sutton and McCallum, 2004;Smith and Eisner, 2008), Gibbs sampling (Finkel et al., 2005), stacked learning (Cohen and de Carvalho, 2005;Krishnan and Manning, 2006), sequential Monte Carlo algorithms (Yang and Eisenstein, 2013), dynamic programming approximations like cube pruning (Chiang, 2007;Huang and Chiang, 2007), dual decomposition (Rush et al., 2010;Martins et al., 2011), and methods based on black-box optimization like integer linear programming (Roth and Yih, 2004). These methods are often developed or applied with particular types of non-local energy terms in mind. By contrast, here we find that the framework of SPEN learning with inference networks can support a wide range of high-order energies for sequence labeling.

Datasets
POS. We use the annotated data from Gimpel et al. (2011) and Owoputi et al. (2013) which con-tains 25 POS tags. We use the 100-dimensional skip-gram embeddings from Tu et al. (2017) which were trained on a dataset of 56 million English tweets using word2vec (Mikolov et al., 2013). The evaluation metric is tagging accuracy.
NER. We use the CoNLL 2003 English data (Tjong Kim Sang and De Meulder, 2003). We use the BIOES tagging scheme, so there are 17 labels. We use 100-dimensional pretrained GloVe (Pennington et al., 2014) embeddings. The task is evaluated with micro-averaged F1 score.
CCG. We use the standard splits from CCGbank (Hockenmaier and Steedman, 2002). We only keep sentences with length less than 50 in the original training data during training. We use only the 400 most frequent labels. The training data contains 1,284 unique labels, but because the label distribution has a long tail, we use only the 400 most frequent labels, replacing the others by a special tag * . The percentages of * in train/development/test are 0.25/0.23/0.23%. When the gold standard tag is * , the prediction is always evaluated as incorrect. We use the same GloVe embeddings as in NER. . The task is evaluated with per-token accuracy.
SRL. We use the standard split from CoNLL 2005 (Carreras and Màrquez, 2005). The gold predicates are provided as part of the input. We use the official evaluation script from the CoNLL 2005 shared task for evaluation. We again use the same GloVe embeddings as in NER. To form the inputs to our models, an embedding of a binary feature indicating whether the word is the given predicate is concatenated to the word embedding. 3

Training
Local Classifiers. We consider local baselines that use a BiLSTM trained with the local loss token . For POS, NER and CCG, we use a 1-layer BiLSTM with hidden size 100, and the word embeddings are fixed during training. For SRL, we use a 4layer BiLSTM with hidden size 300 and the word embeddings are fine-tuned.
BiLSTM-CRF. We also train BiLSTM-CRF models with the standard conditional log-likelihood objective. A 1-layer BiLSTM with hidden size 100 is used for extracting input features. The CRF part uses a linear chain energy with a single tag transition parameter matrix. We do early stopping based on development sets. The usual dynamic programming algorithms are used for training and inference, e.g., the Viterbi algorithm is used for inference. The same pretrained word embeddings as for the local classifiers are used.
Inference Networks. When defining architectures for the inference networks, we use the same architectures as the local classifiers. However, the objective of the inference networks is different, which is shown in Section 2.3. λ = 1 and τ = 1 are used for training. We do early stopping based on the development set.
Energy Terms. The unary terms are parameterized using a one-layer BiLSTM with hidden size 100. For the structured energy terms, the VKP operation uses n l = 20, the number of CNN filters is 50, and the tag language model is a 1-layer LSTM with hidden size 100. For the fully-connected energy, d = 20 for the approximation of the transition matrix and M = 20 for the approximation of the fully-connected energies.
Hyperparameters. For the inference network training, the batch size is 100. We update the energy function parameters using the Adam optimizer (Kingma and Ba, 2014) with learning rate 0.001. For POS, NER, and CCG, we train the inference networks parameter with stochastic gradient descent with momentum as the optimizer. The learning rate is 0.005 and the momentum is 0.9. For SRL, we train the inference networks using Adam with learning rate 0.001.

Results
Parameterizations for High-Order Energies. We first compare several choices for energy functions within our inference network learning framework. In Section 3.3, we considered several ways to define the high-order energy function F . We compare performance of the parameterizations on three tasks: POS, NER, and CCG. The results are shown in Table 2.
For VKP high-order energies, there are small differences between 2nd and 3rd order models, however, 4th order models are consistently worse. The CNN high-order energy is best when M =2 for the three tasks. Increasing M does not consistently help. The tag language model (TLM) works best when scoring the entire label sequence. In  the following experiment with TLM energies, we always use it with this "all" setting. Self-attention (S-Att) also shows better performance with larger M . However, the results for NER are not as high overall as for other energy terms.
Overall, there is no clear winner among the four types of parameterizations, indicating that a variety of high-order energy terms can work well on these tasks, once appropriate window sizes are chosen. We do note differences among tasks: NER benefits more from larger window sizes than POS.
Comparing Structured Energy Terms. Above we compared parameterizations of the high-order energy terms. In Table 3, we compare instantiations of the structured energy term E W (y): linear-chain energies, skip-chain energies, high-order energies, and fully-connected energies. 4 We also compare to local classifiers (BiLSTM). The models with structured energies typically improve over the local classifiers, even with just the linear chain energy.
The richer energy terms tend to perform better than linear chain, at least for most tasks and energies. The skip-chain energies benefit from relatively large M values, i.e., 3 or 4 depending on the   Table 4: Test results when inference networks have 2 layers (so the local classifier baseline also has 2 layers).
task. These tend to be larger than the optimal VKP M values. We note that S-Att high-order energies work well on SRL. This points to the benefits of self-attention on SRL, which has been found in recent work (Tan et al., 2018;Strubell et al., 2018). Both the skip-chain and high-order energy models achieve substantial improvements over the linear chain CRF, notably a gain of 0.8 F1 for NER. The fully-connected energy is not as strong as the others, possibly due to the energies from label pairs spanning a long range. These long-range energies do not appear helpful for these tasks.
Comparison using Deeper Inference Networks. Table 4 compares methods when using 2-layer BiL-STMs as inference networks. 5 The deeper inference networks reach higher performance across all tasks compared to 1-layer inference networks.
We observe that inference networks trained with skip-chain energies and high-order energies achieve better results than BiLSTM-CRF on the three datasets (the Viterbi algorithm is used for   exact inference for BiLSTM-CRF). This indicates that adding richer energy terms can make up for approximate inference during training and inference. Moreover, a 2-layer BiLSTM is much cheaper computationally than Viterbi, especially for tasks with large label sets.

Results on Noisy Datasets
We now consider the impact of our structured energy terms in noisy data settings. Our motivation for these experiments stems from the assumption that structured energies will be more helpful when there is a weaker relationship between the observations and the labels. One way to achieve this is by introducing noise into the observations. So, we create new datasets: for any given sentence, we randomly replace a token x with an unknown word symbol "UNK" with probability α. From previous results, we see that NER shows more benefit from structured energies, so we focus on NER and consider two settings: UnkTest: train on clean text, evaluate on noisy text; and Unk-Train: train on noisy text, evaluate on noisy text.  Table 7: Test results for NER when using BERT. When using energy-based inference networks (our framework), BERT is used in both the energy function and as the inference network architecture. Table 5 shows results for UnkTest. CNN energies are best among all structured energy terms, including the different parameterizations. Increasing M improves F1, showing that high-order information helps the model recover from the high degree of noise. Table 6 shows results for UnkTrain. The CNN high-order energies again yield large gains: roughly 2 points compared to the local classifier and 1.8 compared to the linear chain energy.

Incorporating BERT
Researchers have recently been applying largescale pretrained transformers like BERT (Devlin et al., 2019) to many tasks, including sequence labeling. To explore the impact of high-order energies on BERT-like models, we now consider experiments that use BERT BASE in various ways. We use two baselines: (1) BERT finetuned for NER using a local loss, and (2) a CRF using BERT features ("BERT-CRF"). Within our framework, we also experiment with using BERT in both the energy function and inference network architecture. That is, the "input feature vector" in Equation 3 is replaced by the features from BERT. The energy and inference networks are trained with the objective in Section 2.3. For the training of energy function and inference networks, we use Adam with learning rate 5e−5, a batch size of 32, and L2 weight decay of 1e−5. The results are shown in Table 7. 6 There is a slight improvement when moving from BERT trained with the local loss to using BERT within the CRF (92.13 to 92.34). There is little difference (92.13 vs. 92.14) between the locally-trained BERT model and when using the linear-chain energy function within our framework. However, when using the higher-order energies, the difference is larger (92.13 to 92.46).
(b) Skip-chain energy matrix W3. Figure 1: Learned pairwise potential matrices W 1 and W 3 for NER with skip-chain energy. The rows correspond to earlier labels and the columns correspond to subsequent labels.

Analysis of Learned Energies
In this section, we visualize our learned energy functions for NER to see what structural dependencies among labels have been captured. Figure 1 visualizes two matrices in the skipchain energy with M = 3. We can see strong associations among labels in neighborhoods from W 1 . For example, B-ORG and I-ORG are more likely to be followed by E-ORG. The W 3 matrix shows a strong association between I-ORG and E-ORG, which implies that the length of organization names is often long in the dataset.
For the VKP energy with M =3, Figure 2 shows the learned matrix when the first label is B-PER, showing that B-PER is likely to be followed by "I-PER E-PER", "E-PER O", or "I-PER I-PER".
In order to visualize the learned CNN filters,  we calculate the inner product between the filter weights and consecutive labels. For each filter, we select the sequence of consecutive labels with the highest inner product. Table 8 shows the 10 filters with the highest inner product and the corresponding label trigram. All filters give high scores for structured label sequences with a strong local dependency, such as "B-MISC I-MISC E-MISC" and "B-LOC I-LOC E-LOC", etc. Figure 3 in the appendix shows these inner product scores of 50 CNN filters on a sampled NER label sequence. We can observe that filters learn the sparse set of label trigrams with strong local dependency.

Conclusion
We explore arbitrary-order models with different neural parameterizations on sequence labeling tasks via energy-based inference networks. This approach achieve substantial improvement using high-order energy terms, especially in noisy data conditions, while having same decoding speed as simple local classifiers.  Table 9: Results on all tasks for local classifiers and different structured energy functions: linear-chain energy, Kronecker Product high-order energies, skip-chain energy and fully-connected energies. The metrics of the four tasks POS, NER, CCG, SRL are accuracy, F1, accuracy and F1. The architecture of inference networks is one-layer BiLSTM.