Speed-Accuracy Tradeoffs in Tagging with Variable-Order CRFs and Structured Sparsity

We propose a method for learning the structure of variable-order CRFs, a more ﬂexible variant of higher-order linear-chain CRFs. Variable-order CRFs achieve faster inference by including features for only some of the tag n - grams. Our learning method discovers the useful higher-order features at the same time as it trains their weights, by maximizing an objective that combines log-likelihood with a structured-sparsity regularizer. An active-set outer loop allows the feature set to grow as far as needed. On part-of-speech tagging in 5 randomly chosen languages from the Universal Dependencies dataset, our method of shrinking the model achieved a 2 – 6 x speedup over a baseline, with no signiﬁcant drop in accuracy.


Introduction
Conditional Random Fields (CRFs) (Lafferty et al., 2001) are a convenient formalism for sequence labeling tasks common in NLP. A CRF defines a featurerich conditional distribution over tag sequences (output) given an observed word sequence (input).
The key advantage of the CRF framework is the flexibility to consider arbitrary features of the input, as well as enough features over the output structure to encourage it to be well-formed and consistent. However, inference in CRFs is fast only if the features over the output structure are limited. For example, an order-k CRF (or "k-CRF" for short, with k > 1 being "higher-order") allows expressive features over a window of k+1 adjacent tags (as well as the input), and then inference takes time O(n·|Y | k+1 ), where Y is the set of tags and n is the length of the input.
How large does k need to be? Typically k = 2 works well, with big gains from 0 → 1 and modest * Equal contribution  Figure 1: Speed-accuracy tradeoff curves on test data for the 5 languages. Large dark circles represent the k-CRFs of ascending orders along x-axis (marked on for Slovenian). Smaller triangles each represent a VoCRF discovered by sweeping the speed parameters γ. We find faster models at similar accuracy to the best k-CRFs ( §5).
gains from 1 → 2 (Fig. 1). Small k may be sufficient when there is enough training data to allow the model to attend to many fine-grained features of the input (Toutanova et al., 2003;Liang et al., 2008). For example, when predicting POS tags in morphologicallyrich languages, certain words are easily tagged based on their spelling without considering the context (k = 0). In fact, such languages tend to have a more free word order, making tag context less useful. We investigate a hybrid approach that gives the accuracy of higher-order models while reducing runtime. We build on variable-order CRFs (Ye et al., 2009) (VoCRF), which support features on tag subsequences of mixed orders. Since only modest gains are obtained from moving to higher-order models, we posit that only a small fraction of the higher-order features are necessary. We introduce a hyperparameter γ that discourages the model from using many higher-order features (= faster inference) and a hyperparameter λ that encourages generalization. Thus, sweeping a range of values for γ and λ gives rise to a number of operating points along the speed-accuracy curve (triangle points in Fig. 1).
We present three contributions: (1) A simplified exposition of VoCRFs, including an algorithm for computing gradients that is asymptotically more efficient than prior art (Cuong et al., 2014). (2) We develop a structure learning algorithm for discovering the essential set of higher-order dependencies so that inference is fast and accurate. (3) We investigate the effectiveness of our approach on POS tagging in five diverse languages. We find that the amount of required context for accurate prediction is highly language-dependent. In all languages, however, our approach meets the accuracy of fixed-order models at a fraction of the runtime.

Variable-Order CRFs
An order-k CRF (k-CRF, for short) is a conditional probability distribution of the form where n is the length of the input x, θ ∈ R d is the model parameter, and f is an arbitrary user-defined function that computes a vector in R d of features of the tag substring s = y t−k . . . y t when it appears at position t of input x. We define y i to be a distinguished boundary tag # when i / ∈ [1, n]. A variable-order CRF or VoCRF is a refinement of the k-CRF, in which f may not always depend on all k + 1 of the tags that it has access to. The features of a particular tag substring s may sometimes be determined by a shorter suffix of s.
To be precise, a VoCRF specifies a finite set W ⊂ Y * that is sufficient for feature computation (where Y * denotes the set of all tag sequences). 1 The VoCRF's featurization function f (x, t, s) is then defined as f (x, t, w(s)) where f can be any function and w(s) ∈ Y * is the longest suffix of s that appears in W (or ε if none exists). The full power of a k-CRF can be obtained by specifying W = Y k+1 , but smaller W will in general allow speedups.
To support our algorithms, we define W to be the closure of W under prefixes and last-character substitution. Formally, W is the smallest nonempty superset of W such that if hy ∈ W for some h ∈ Y * Algorithm 1 FORWARD: Compute log Z θ (x). α(·, ·) = 0; α(0, #) = 1 initialization for t = 1 to n + 1 : and y ∈ Y , then h ∈ W and also hy ∈ W for all y ∈ Y . This implies that we can factor W as H × Y , where H ⊂ Y * is called the set of histories.
We now define NEXT(h, y) to return the longest suffix of hy that is in H (which may be hy itself, or even ε). We may regard NEXT as the transition function of a deterministic finite-state automaton (DFA) with state set H and alphabet Y . If this DFA is used to read any tag sequence y ∈ Y * , then the arc that reads y t comes from a state h such that hy t is the longest suffix of s = y t−k . . . y t that appears in W-and thus w(hy t ) = w(s) ∈ W and provides sufficient information to compute f (x, t, s). 2 For a given x of length n and given parameters θ, the log-normalizer log Z θ (x)-which will be needed to compute the log-probability in eq. (1) below-can be found in time O(|W| n) by dynamic programming. Concise pseudocode is in Alg. 1. In effect, this runs the forward algorithm on the lattice of taggings given by length-n paths through the DFA.
For finding the parameters θ that minimize eq. (1) below, we want the gradient ∇ θ log Z θ (x). By applying algorithmic differentiation to Alg. 1, we obtain Alg. 2, which uses back-propagation to compute the gradient (asymptotically) as fast as Alg. 1 and |H| times faster than Cuong et al. (2014)'s algorithm-a significant speedup since |H| is often quite large (up to 300 in our experiments). Algs. 1-2 together effectively run the forward-backward algorithm on the lattice of taggings. 3 It is straightforward to modify Alg. 1 to obtain a Viterbi decoder that finds the most-likely tag sequence under p θ (· | x). It is also straightforward to modify Alg. 2 to compute the marginal probabilities of tag substrings occurring at particular positions.

Structured Sparsity and Active Sets
We begin with a k-CRF model whose feature vector . . y t and is 0 otherwise. 4 To obtain the advantages of a VoCRF, we merely have to choose a sparse weight vector θ. The set W can then be defined to be the set of strings in Y * whose features have nonzero weight. Prior work (Cuong et al., 2014) has left the construction of W to domain experts or "one size fits all" strategies (e.g., k-CRF). Our goal is to choose θ-and thus W-so that inference is accurate and fast.
Our approach is to modify the usual L 2regularized log-likelihood training criterion with a carefully defined runtime penalty scaled by a parameter γ to balance competing objectives: likelihood on Recall that the runtime of inference on a given sentence is proportional to the size of W, the closure  of W under prefixes and last-character replacement.
(Any tag strings in W\W can get nonzero weight without increasing runtime.) Thus, R(θ) would ideally measure |W|, or proportionately, |H|. Experimentally, we find that |W| has > 99% Pearson correlation with wallclock time, making it an excellent proxy for wallclock time while being more replicable. We relax this regularizer to a convex functiona tree-structured group lasso objective (Yuan and Lin, 2006;Nelakanti et al., 2013). For each string h ∈ Y * , we have a group G h consisting of the indicator features (in f (2) ) for all strings w ∈ W that have h as a proper prefix. Fig. 2 gives a visual depiction. We now define R(θ) = h∈Y * ||θ G h || 2 . This penalty encourages each group of weights to remain all at zero (thereby conserving runtime, in our setting, because it means that h does not need to be added to H). Once a single weight in a group becomes nonzero, the "initial inertia" induced by the group lasso penalty is overcome, and other features in the group can be more cheaply adjusted away from zero.
Although eq. (1) is now convex, directly optimizing it would be expensive for large k, since θ then contains very many parameters. We thus use a heuristic optimization algorithm, the active set method (Schmidt, 2010), which starts with a low-dimensional θ and incrementally adds features to the model. This also frees us from needing to specify a limit k. Rather, W grows until further extensions are unhelpful, and then implicitly k = max w∈W |w| − 1.
The method defines f (2) to include indicator features for all tag sequences w in an active set W active . Thus, θ (2) is always a vector of |W active | real numbers. Initially, we take W active = Y and θ = 0. At each active set iteration, we fully optimize eq. (1) to obtain a sparse θ and a set W = {w ∈ W active | θ (2) w = 0} of features that are known to be "useful." 5 We then update W active to {wy | w ∈ W, y ∈ Y }, so that it includes single-tag extensions of these useful features; this expands θ to consider additional features that plausibly might prove useful. Finally, we complete the iteration by updating W active to its closure W active , simply because this further expansion of the feature set will not slow down our algorithms. When eq. (1) is re-optimized at the next iteration, some of these newly added features in W active may acquire nonzero weights and thus enter W, allowing further extensions. We can halt once W no longer changes.
As a final step, we follow common practice by running "debiasing" (Martins et al., 2011a), where we fix our f (2) feature set to be given by the final W, and retrain θ without the group lasso penalty term.
In practice, we optimized eq. (1) using the online proximal gradient algorithm SPOM (Martins et al., 2011b) and Adagrad (Duchi et al., 2011) with η = 0.01 and 15 inner epochs. We limited to 3 active set iterations, and as a result, our final W contained at most tag trigrams.

Related Work
Our paper can be seen as transferring methods of Cotterell and Eisner (2015) to the CRF setting. They too used tree-structured group lasso and active set to select variable-order n-gram features W for globally-normalized sequence models (in their case, to rapidly and accurately approximate beliefs during message-passing inference). Similarly, Nelakanti et al. (2013) used tree-structured group lasso to regularize a variable-order language model (though their focus was training speed). Here we apply these techniques to conditional models for tagging.
Our work directly builds on the variable-order CRF of Cuong et al. (2014), with a speedup in Alg. 2, but our approach also learns the VoCRF structure. Our method is also related to the generative variable-order tagger of Schütze and Singer (1994).
Our static feature selection chooses a single model that permits fast exact marginal inference, similar to learning a low-treewidth graphical model (Bach and Jordan, 2001;Elidan and Gould, 2008). This contrasts with recent papers that learn to do approximate 1-best inference using a sequence of models, whether by dynamic feature selection within a greedy inference algorithm (Strubell et al., 2015), or by gradually increasing the feature set of a 1-best global inference algorithm and pruning its hypothesis space after each increase (Weiss and Taskar, 2010;He et al., 2013). Schmidt (2010) explores the use of group lasso penalties and the active set method for learning the structure of a graphical model, but does not consider learning repeated structures (in our setting, W defines a structure that is reused at each position). Steinhardt and Liang (2015) jointly modeled the amount of context to use in a variable-order model that dynamically determines how much context to use in a beam search decoder.

Experiments 6
Data: We conduct experiments on multilingual POS tagging. The task is to label each word in a sentence with one of |Y | = 17 labels. We train on five typologically-diverse languages from the Universal Dependencies (UD) corpora (Petrov et al., 2012): Basque, Bulgarian, Hindi, Norwegian and Slovenian. For each language, we start with the original train / dev / test split in the UD dataset, then move random sentences from train into dev until the dev set has 3000 sentences. This ensures more stable hyperparameter tuning. We use these new splits below.
Eval: We train models with (λ, γ) ∈ {10 −4 · m, 10 −3 ·m, 10 −2 ·m}×{0, 0.1·m, 0.2·m, . . . , m}, where m is the number of training sentences. To tag a dev or test sentence, we choose its most probable tag sequence. For each of several model sizes, Table 1 selects the model of that size that achieved the highest per-token tagging accuracy on the dev set, and reports that model's accuracy on the test set.
Features: Recall from §3 that our features include non-stationary zeroth-order features f (1) as well as the stationary features based on W. For f (1) (x, t, y t ) we consider the following language-agnostic properties of (x, t): • The identities of the tokens x t−3 , ..., x t+3 , and the token bigrams (x t+1 , x t ), (x t , x t−1 ),  Each row's best results are in boldface, where ties in accuracy are broken in favor of faster models. Superscript k indicates that the accuracy is significantly different from the k-CRF (paired permutation test, p < 0.05) and this superscript is in blue/red if the accuracy is higher/lower than the k-CRF. In all cases, we find a VoCRF (underlined) that is about as accurate as the 2-CRF (i.e., not significantly less accurate) and far faster, since the 2-CRF has |W| = 4913. Fig. 1 plots the Pareto frontiers.
(x t−1 , x t+1 ). We use special boundary symbols for tokens at positions beyond the start or end of the sentence.
• Prefixes and suffixes of x t , up to 4 characters long, that occur ≥ 5 times in the training data. • Indicators for whether x t is all caps, is lowercase, or has a digit. • Word shape of x t , which maps the token string into the following character classes (uppercase, lowercase, number) with punctuation unmodified (e.g., VoCRF-like ⇒ AaAAA-aaaa, $5,432.10 ⇒ $8,888.88). For efficiency, we hash these properties into 2 22 bins. The f (1) features are obtained by conjoining these bins with y t (Weinberger et al., 2009): e.g., there is a feature that returns 0 unless y t = NOUN, in which case it counts the number of bin 1234567's properties that (x, t) has. (The f (2) features are not hashed.) Results: Our results are presented in Fig. 1 and Table 1. We highlight two key points: (i) Across all languages we learned a tagger about as accurate as a 2-CRF, but much faster. (ii) The size of the set W required is highly language-dependent. For many languages, learning a full k-CRF is wasteful; our method resolves this problem.
In each language, the fastest "good" VoCRF is rather faster than the fastest "good" k-CRF (where "good" means statistically indistinguishable from the 2-CRF). These two systems are underlined; the underlined VoCRF systems are smaller than the underlined k-CRF systems (for the 5 languages respectively) by factors of 1.9, 6.4, 3.4, 1.9, and 2.9. In every language, we learn a VoCRF with |W| ≤ 850 that is not significantly worse than a 2-CRF with |W| = 17 3 = 4913.
We also notice an interesting language-dependent effect, whereby certain languages require a small number of tag strings in order to perform well. For example, Hindi has a competitive model that ignores the previous tag y t−1 unless it is in {NOUN, VERB, ADP, PROPN}: thus the stationary features are 17 unigrams plus 4 × 17 bigrams, for a total of |W| = 85. At the other extreme, the Slavic languages Slovenian and Bulgarian seem to require more expressive models over the tag space, remembering as many as 98 useful left-context histories (unigrams and bigrams) for the current tag. An interesting direction for future research would be to determine which morpho-syntactic properties of a language tend to increase the complexity of tagging.

Conclusion
We presented a structured sparsity approach for structure learning in VoCRFs, which achieves the accuracy of higher-order CRFs at a fraction of the runtime. Additionally, we derive an asymptotically faster algorithm for the gradients necessary to train a VoCRF than prior work. Our method provides an effective speed-accuracy tradeoff for POS tagging across five languages-confirming that significant speed-ups are possible with little-to-no loss in accuracy.