Learning the Structure of Variable-Order CRFs: a finite-state perspective

The computational complexity of linear-chain Conditional Random Fields (CRFs) makes it difficult to deal with very large label sets and long range dependencies. Such situations are not rare and arise when dealing with morphologically rich languages or joint labelling tasks. We extend here recent proposals to consider variable order CRFs. Using an effective finite-state representation of variable-length dependencies, we propose new ways to perform feature selection at large scale and report experimental results where we outperform strong baselines on a tagging task.


Introduction
Conditional Random Fields (CRFs) (Lafferty et al., 2001;) are a method of choice for many sequence labelling tasks such as Part of Speech (PoS) tagging, Text Chunking, or Named Entity Recognition. Linearchain CRFs are easy to train by solving a convex optimization problem, can accomodate rich feature patterns, and enjoy polynomial exact inference procedures. They also deliver state-of-the-art performance for many tasks, sometimes surpassing seq2seq neural models (Schnober et al., 2016).
A major issue with CRFs is the complexity of training and inference procedures, which are quadratic in the number of possible output labels for first order models and grow exponentially when higher order dependencies are considered. This is problematic for tasks such as precise PoS tagging for Morphologically Rich Languages (MRLs), where the number of morphosyntactic labels is in the thousands (Hajič, 2000;Müller et al., 2013). Large label sets also naturally arise when joint labelling tasks (eg. simultaneous PoS tag-ging and text chunking) are considered, For such tasks, processing first-order models is demanding, and full size higher-order models are out of the question. Attempts to overcome this difficulty are based on a greedy approach which starts with firstorder dependencies between labels and iteratively increases the scope of dependency patterns under the constraint that a high-order dependency is selected only if it extends an existing lower order feature (Müller et al., 2013). As a result, feature selection may only choose only few higherorder features, motivating the need for an effective variable-order CRF (voCRF) training procedure (Ye et al., 2009). 1 The latest implementation of this idea (Vieira et al., 2016) relies on (structured) sparsity promoting regularization (Martins et al., 2011) and on finite-state techniques, handling high-order features at a small extra cost (see § 2). In this approach, the sparse set of label dependency patterns is represented in a finite-state automaton, which arises as the result of the feature selection process. In this paper, we somehow reverse the perspective and consider VoCRF training mostly as an automaton inference problem. This leads us to consider alternative techniques for learning the finitestate machine representing the dependency structure of sparse VoCRFs (see § 3). Two lines of enquiries are explored: (a) to take into account the internal structure of large tag sets in order to learn better and/or leaner feature sets; (b) to detect unconditional structural dependencies in label sequences in order to speed-up the discovery of useful features. These ideas are implemented in 6 feature selection strategies, allowing us to explore a large set of dependency structures. Relying on lazy finite-state operations, we train VoCRFs up to order 5, and achieve PoS tagging performance that surpass strong baselines for two MRLs (see § 4).

Variable order CRFs
In this section, we recall the basics of CRFs and VoCRFs and introduce some notations.

Basics
First-order CRFs use the following model: where x = (x 1 , . . . , x T ) and y = (y 1 , . . . , y T ) are the input (in X T ) and output (in Y T ) sequences and Z θ (x) is a normalizer. Each component F j (x, y) of the global feature vector decomposes as a sum of local features T t=1 f j (y t−1 , y t , x t ) and is associated to parameter θ j . Local features typically use binary tests and take the form: where I() is an indicator function and g() tests a local property of x around x t . In this setting, the number of parameters is |Y| 2 × |X | train , where |A| is the cardinality of A and |X | train is the number of values of g(x, t) observed in the training set. Even in moderate size applications, the parameter set can be very large and contain dozen of millions of features, due to the introduction of sequential dependencies in the model.
, estimation is based on the minimization of the negated conditional log-likelihood l(θ). Optimizing this objective requires to compute its gradient and to repeatedly evaluate the conditional expectation of the feature vector. This can be done using a forward-backward algorithm having a complexity that grows quadratically with |Y|. l(θ) is usually complemented with a regularization term so as to avoid overfitting and stabilize the optimization. Common regularizers use the 1 -or the 2norm of the parameter vector, the former having the benefit to promote sparsity, thereby performing automatic feature selection (Tibshirani, 1996).

Variable order CRFs (VoCRFs)
When the label set is large, many pairs of labels never occur in the training data and the sparsity of label ngrams quickly increases with the order p of the model. In the variable order CRF model, it is assumed that only a small number of ngrams (out of |Y| p ) are associated with a non-zero parameter value. Denoting W the set of such ngrams and w ∈ W, a generic feature function is then In (order-p) VoCRFs, the computational cost of training and inference is proportional to the size of a finite-state automaton A[W] encoding the patterns in W, 2 which can be much less than |Y| p . Our procedure for building A[W] is sketched in Algorithm 1, where TrieInsert inserts a string in a trie, Pref(W) computes the set of prefixes of the strings in W, 3 LgSuff(v, U) returns the longest suffix of v in U, and FailureTrans is a special ε-transition used only when no labelled transition exists (Allauzen et al., 2003). 4 Each state (or pattern prefix) v in A[W] is associated with a set of feature functions {f u,g , ∀u ∈ Suff(v), g}. 5 The forward step of the gradient computation maintains one value α(v, t) per state and time step, which is recursively accumulated over all paths ending in v at time t.
The next question is to identify W. The simplest method keeps all the ngrams viewed in training, additionally filtering rare patterns (Cuong et al., 2014). However, frequency based feature selection does not take interactions into account and is not the best solution. Ideally, one would like to train a complete order-p model with a sparsity promoting penalty, a technique that only works for small label sets. 6 The greedy algorithm of Schmidt and Murphy (2010); Vieira et al. (2016) is more scalable: it starts with all unigram patterns and iteratively grows W by extending the ngrams that have been selected in the simpler model. At each round of training, feature selection is performed using a 1 penalty and identifies the patterns that will be further augmented.

Learning patterns
We introduce now several alternatives for learning W. Our motivation for doing so is twofold: (a) to take the internal structure of large label sets into account; (b) to identify more abstract patterns in label sequences, possibly containing gaps or iterations, which could yield smaller A[W]. As discussed below, both motivations can be combined.

Greedy 1
The greedy strategy iteratively grows patterns up to order p. Considering all possible unigram and bigram patterns, we train a sparse model to select a first set of useful bigrams. In subsequent iterations, each pattern w selected at order k is extended in all possible ways to specify the pattern set at order k + 1, which will be filtered during the next training round. This approach is close, yet simpler, than the group lasso approach of Vieira et al. (2016) and experimentally yields slightly smaller pattern sets (see Table 2). This is because we do not enforce closure under last-character replacement: once pattern w is pruned, longer patterns ending in w are never considered. 7

Component-wise training
Large tag sets often occur in joint tasks, where multiple levels of information are encoded in one compound tag. For instance, the fine grain labels in the Tiger corpus (Brants et al., 2002) combine PoS and morphological information in tags such as NN.Dat.Sg.Fem for a feminine singular dative noun. In the sequel, we refer to each piece of information as a tag component. We assume that all tags contain the same components, using a "non-applicable" value whenever needed. Using features that test arbitrary combinations of tag components would make feature selection much more difficult, as the number of possible patterns grows combinatorially with the number of components. We keep things simple by allowing features to only evaluate one single component at a time: 7 cf. the discussion in (Vieira et al., 2016, § 4). this allows us to identify dependencies of different orders for each component.
Assuming that each tag y contains K components y = [z 1 , z 2 . . . , z K ], with z k ∈ Y k , W is then computed as in § 3.1, except that we now consider one distinct set of patterns W k for each component k. At each training round, each set W k is extended and pruned independently from the others. Note that all these automata are trained simultaneously using a common set of features. This process results in K automata, which are intersected on the fly 8 using "lazy" composition. In our experiments, we also consider the case where we additionally combine the automaton representing complete tag sequences: this has the beneficial effect to restrict the combinations of subtags to values that actually exist in the data.

Pruned language models
Another approach for computing W assumes that useful dependencies between tags can be identified using an auxiliary language model (LM) trained without paying any attention to observation sequences. A pattern w will then be deemed useful for the labelling task only if w is a useful history in a LM of tag sequences. This strategy was implemented by first training a compact pgram LM with entropy pruning 9 (Stolcke, 1998) and including all the surviving histories in W. In a second step, we train the complete CRF as usual, with all observation features and 1 penalty to further prune the parameter set.

Maximum entropy language models
Another technique, which combines the two previous ideas, relies on Maximum Entropy LMs (MELMs) (Rosenfeld, 1996). MELMs decompose the probabililty of a sequence y 1 . . . y T using the chain rule, where each term p λ (y t |y <t ) is a locally normalized exponential model including all possible ngram features up to order p: In contrast to globally normalized models, the complexity of training remains linear wrt. |Y|, irrespective of p. It it also straightforward both to (a) use a 1 penalty to perform feature selection; (b) include features that only test specific components of a complex tag. For an order p model, our feature functions evaluate all n-grams (for n ≤ p) of complete tags or of one specific component: Once a first round of feature selection has been performed, 10 we compute A[W] as explained above. The last step of training reintroduces the observations and estimates the CRF paramaters. A variant of this approach adds extra gappy features to the n-gram features. Gappy features at order p test whether some label u occurs in the remote past anywhere between position t − p + 1 11 and t − n. They take the following form: G w,u (y 1 , . . . , y t ) =I(y t−n+1 . . . y t = w∧ u ∈ {y t−p+1 . . . y t−n }), and likewise for features testing components.

Training protocol
The following protocol is used throughout: (a) identify W ( §3) -note that this may imply to tune a regularization parameter; (b) train a full model (including tests on the observations for each pattern in W) using 1 regularization and a very small 2 term to stabilize convergence. The best regularization in (a) and (b) is selected on development data and targets either perplexity (for LMs) or label accuracy (for CRFs).

Datasets and Features
Experiments are run on two MRLs: for Czech, we use the CoNLL 2009 data set (Hajič et al., 2009) and for German, the Tiger Treebank with the split of Fraser et al. (2013)). Both datasets include rich morphological attributes (cf. Table 1). All the patterns in W are combined with lexical features testing the current word x t , its prefixes and suffixes of length 1 to 4, its capitalization and the presence of digit or punctuation symbols. Additional contextual features also test words in a local window around position t. These tests greatly increase the feature count and are not provided for all label patterns: for unigram patterns, we test the presence of all unigrams and bigrams of words in a window of 5 words; for bigrams patterns we only test for all unigrams in a window of 3 words. Contextual features are not used for larger patterns.

Results
We consider several baselines: Maxent and MEMM models, neither of which considers label dependencies in training, a linear chain CRF 12 and our own implementation of the group lasso of Vieira et al. (2016). For the latter, we contrast two setups: one where each pattern in W gives rise to one single feature, and one where it is conjoined with tests on the observation. 13 All scores in Table 2 are label accuracies on unseen test data.
As expected, Maxent and MEMM are outperformed by almost all variants of CRFs, and their scores are only reported for completeness. Group lasso results demonstrate the effectiveness of using contextual information with high order features: the gain is ≈ 0.7 points for both languages and all values of p. Greedy 1 achieves accuracy results similar to group lasso, suggesting that 1 penalty alone is effective to select highorder features. It also yields slighly smaller models and very comparable training time across the board: indeed, greedy parameter selection strategies imply multiple rounds of training which are overall quite costly, due to the size of the full label set. Testing individual subtags ( § 3.2) results in a slight improvement (≈+0.3) in accuracy over Greedy 1 . When using an additional automata for the full tag, we get a larger gain of ≈ 0.6 points for Czech, slightly less for German: including a model for complete tags also prevents to gener-12 Using the implementation of Lavergne et al. (2010). 13 As suggested by the authors themselves in fn 4.  ate invalid combinations of subtags. These models represent different tradeoffs between accuracy and training time: the 4-gram Component-wise experiment only took 14 hrs to complete on German data and outperforms the corresponding Greedy 1 setup while containing approximately 100 times less features. Component-wise+Full is more comparable in size and training time to Greedy 1 , but yields a larger improvement in performance. The last sets of experiments with LMs yields even better operating points, as the first stage of pattern selection is performed with a cheap model. They are our best trade-off to date, yielding the best performance for all values of p.

Conclusion
In this work, we have explored ways to take advantage of the flexibility offered by implementations of VoCRFs based on finite-state techniques. We have proposed strategies to include tests on subparts of complex tags, as well as to select useful label patterns with auxiliary unconditional LMs. Experiments with two MRLs with large tagsets yielded consistent improvements (≈ +0.8 points) over strong baselines. They offer new perspectives to perform feature selection in high order CRFs. In our future work, we intend to also explore how to complement 1 penalties with terms penalizing more explicitely the processing time; we also wish to study how these ideas can be used in combination with neural models.