Transducer Disambiguation with Sparse Topological Features

We describe a simple and efﬁcient al-gorithm to disambiguate non-functional weighted ﬁnite state transducers (WFSTs), i.e. to generate a new WFST that contains a unique, best-scoring path for each hypothesis in the input labels along with the best output labels. The algorithm uses topological features combined with a tropical sparse tuple vector semiring. We empirically show that our algorithm is more efﬁcient than previous work in a PoS-tagging disambiguation task. We use our method to rescore very large translation lattices with a bilingual neural network language model, obtaining gains in line with the literature.


Introduction
Weighted finite-state transducers (WFSTs), or lattices, are used in speech and language processing to compactly represent and manipulate a large number of strings. Applying a finite-state operation (eg. PoS tagging) to a lattice via composition produces a WFST that maps input (eg. words) onto output strings (eg. PoS tags) and preserves the arc-level alignment between each input and output symbol (eg. each arc is labeled with a word-tag pair and has a weight). Typically, the result of such operation is a WFST that is ambiguous because it contains multiple paths with the same input string, and non-functional because it contains multiple output strings for a given input string (Mohri, 2009).
Disambiguating such WFSTs is the task of creating a WFST that encodes only the best-scoring path of each input string, while still maintaining the arc-level mapping between input and output symbols. This is a non-trivial task 1 , and so far only 1 Unless one enumerates all the possible input strings in one algorithm has been described ; the main steps are: (a) Map the WFST into an equivalent weighted finite-state automata (WFSA) using weights that contain both the WFST weight and output symbols (using a special semiring) (b) Apply WFSA determinization under this semiring to ensure that only one unique path per input string survives (c) Expand the result back to an WFST that preserves arc-level alignments We present a new disambiguation algorithm that can efficiently accomplish this. In Section 2 we describe how the tropical sparse tuple vector semiring can keep track of individual arcs in the original WFST as topological features during the mapping step (a). This allows us to describe in Section 3 an efficient expansion algorithm for step (c). We show in Section 4 empirical evidence that our algorithm is more efficient than  in their same PoS-tagging task. We also show how our method can be applied in rescoring translation lattices under a bilingual neuralnetwork model (Devlin et al., 2014), obtaining BLEU score gains consistent with the literature. Section 5 reviews related work and concludes.

Semiring Definitions
A WFST T = (Σ, ∆, Q, I, F, E, ρ) over a semiring (K, ⊕, ⊗, 0, 1) has input and output alphabets Σ and ∆, a set of states Q, the initial state I ∈ Q, a set of final states F ⊂ Q, a set of transitions (edges) E ⊂ (Q × Σ × ∆ × K × Q), and a final state function ρ : F → K. We focus on extensions to the tropical semiring (R ± ∞, min, +, ∞, 0). the lattice, searches for the best output string for each input string, and converts the resulting sequences back into a WFST, which is clearly inefficient.

Tropical Sparse Vector Semiring
Letf [e i ] = f i ∈ R N be the unweighted feature vector associated with edge e i , and letᾱ ∈ R N be a global feature weight vector. The tropical weight is then found as Given a fixedᾱ, we define the operators for the tropical vector semiring as: The tropical weights are maintained correctly by the vector semiring asf i ⊕ αfj = w i ⊕ w j andf i ⊗ ᾱ f j = w i ⊗ w j . Finally, we define the element-wise times operator as: When dealing with high-dimensional feature vectors which have few non-zero elements, it is convenient in practice (for computational efficiency) to use a sparse representation for vectors: The semiring that operates on sparse feature vectors, which we call tropical sparse tuple vector semiring 2 , uses conceptually identical operators as the non-sparse version defined above, so it also maintains the tropical weights w correctly.

A Disambiguation Algorithm
We now describe how we use the semiring described in Section 2 for steps (a) and (b), and describe an expansion algorithm for step (c) that efficiently converts the output of determinization into an unambiguous WFST with arc-level alignments.

WFSA with Sparse Topological Features
Let T be a tropical-weight WFST with K edges. T is topologically sorted so that if an edge e k preceeds an edge e k on a path, then k < k . We now use tropical sparse tuple vector weights to create a WFSA A that maintains (in its weights) pointers to specific edges in T . These 'pointers' are sparse topological features. 2 We implement this semiring as an extension to the sparse tuple weights of the OpenFst library (Allauzen et al., 2007).
For each edge e k = (p k , i k , o k , w k , n k ) of T , we create an edge e k = (p k , i k , i k ,f k , n k ) in A, wheref k = [w k , 0, . . . , 0, 1, 0, . . . , 0]; the 1 is in the k th position. In other words, f k,0 is the tropical weight of the k th edge in T and f k,k = 1 indicates that this tropical weight belongs to edge k in T . In sparse notation,f k = [(0, w k ), (k, 1)]. For example, this non-deterministic transducer T : is mapped to acceptor A with topological features: Given α = [1, 0, . . . , 0], operations on A yield the same path weights as in the usual tropical semiring.

WFSA determinization
We now apply the standard determinization algorithm to A, which yields A D : This now accepts only one best-scoring path for each input string, and the weights 'point' to the sequence of arcs traversed by the relevant path in T . In turn, this reveals the best output string for the given input string. For example, the path-level features associated with 'a b' are [(0, 3), (1, 1), (3, 1)], indicating a path π = e 1 e 3 with tropical weight 3 through T (and hence output string 'AB').
The topology of A D is compact because multiple input strings may share arcs while still encoding different output strings in their weights. This is achieved by 'cancelling' topological features in subsequent arcs and 'replacing' them by new ones as one traverses the path. For example, the string 'a c' initially has feature (1, 1), but this gets cancelled later in the path by (1, −1), and replaced by [(2, 1), (5, 1)], indicating a path π = e 2 e 5 through T with output string 'ZD'.

Expansion Algorithm
We now describe an expansion algorithm to convert A D into an unambiguous WFST T that maintains the arc-level input-output alignments of the original transducer T . In our example, T should be identical to T except for edge 4, which is removed.
Due to the WFSA determinization algorithm, we observe empirically that the cancelling features in A D tend to appear in a path after the feature itself. This allows us to define an algorithm that traverses A D in reverse (from its final states to its initial state) and creates an equivalent acceptor with the topology of T .
The algorithm is described in Figure 1. It performs a forward pass through A r (the reverse of A D ). The intuition is that, for each arc, we create a new arc where we 'pop' the highest topological feature (as it will not be cancelled later) and its tropical weight. The new states encode the original state q and the residual features that have not been 'popped' yet. For each edge E(q), the auxiliary POPTFEA(f , e) returns a (w , t ,f ) tuple, where w is the tropical weight obtained as ; if w has only one topological feature, the residual is 0. The residual in all final states of B r will be 0 (no topological features still to be popped).
Graphically, in our running example A r is: Reversing B r yields an acceptor B (still in the sparse tuple vector semiring) which has the same topology as our goal T and can be trivially mapped to T in linear time: each arc takes the tropical weight viaᾱ and has only one topological feature which points to the arc in T containing the required output symbol.

Two-pass Expansion
As mentioned earlier, our algorithm relies on 'cancelling' topological features appearing after the feature they cancel in a given path. In general, consider T a weighted transducer and A it's equivalent automaton with sparse topological features, as described here. A p is the result of applying standard WFST operations, such as determinization, minimization or shortest path. Assume as a final step that the weights have been pushed towards the final states. It is worth noting the property that: two topological features in a path accepted by A will never get reordered in A p , although they can appear together on the same edge, as shown in our running example. Indeed, if A p contains only one single path, all the topological features would appear on the final state.
Let us define a function d A (e) as the minimum number of edges on any path in A from the start state to n[e] through edge e.
Consider all edges e i in A and e p in A p , with f [e i ] i = 1 and f [e p ] i = 0, i.e. we are interested in the topological feature contribution on e p due to the edge e i in A. If d A (e i ) ≤ d A p (e p ) is always satisfied, then EXPANDTFEA will yield the correct answer because the residual at each state, together with the the weight of the current edge, contains all the necessary information to pop the next correct topological feature.
However, many deterministic WFSAs will not exhibit this behaviour (eg. minimised WFSAs), even after pushing the weights towards the final states. For example see this acceptor A: As d A (e 3 ) = d A (e 4 ) = 2 and d A p (e p 2 ) = 1, the distance test fails for both topological features (3, −1)(4, 1). Running B r =EXPANDTFEA((A p ) r ) will not cancel feature (3, 1) along the path 'z x' and will pop (4, 1) instead, storing the remaining none-0 residual in a final state of B r .
As mentioned before, two topological features along the same path in A will not reorder in A p . In this example, as (4, 1) appears in edge e p 2 , feature (2, 1) must also appear in this edge (or on an earlier edge, in a more complicated machine). In general, any remaining topological features along the path back to the start state of A p will all be popped after their correct edges in B r . All edges in B r pass the distance test compared to A r , the reversed form of A: for all edges e i with f [e i ] i = 1 in A r and e q in B r such that f [e q ] i = 0, d A r (e i ) ≤ d B r (e q ). Edges in these machines are now reverse sorted, i.e. if an edge e k precedes an edge e k on a path, then k < k.
We can perform a second pass with the same algorithm over B, with the only minor modification that t is now the index of the lowest topological feature off * f [e]. This expands the acceptor correctly. Because correct expansions yield 0 residuals on the final states, the algorithm can be trivially modified to trigger the second pass automatically if the residual on any final state is not 0.

Experiments
We evaluate our algorithm, henceforth called topological, in two ways: we empirically contrast disambiguation times against previous work, and then apply it to rescore translation lattices with bilingual neural network models.

PoS Transducer Disambiguation
We apply our algorithm to the 4,664 NIST English CTS RT Dev04 set PoS tagged lattices used by Sproat el al. (2014); these were generated with a speech recognizer similar to (Soltau et al., 2005) and tagged with a WFST-based HMM tagger. The average number of states is 493. We contrast with the lexicographic tropical categorial semiring implementation of , henceforth referred to as the categorial method. Figure 2 (left) shows the number of disambiguated WFSTs as processing time increases. The topological algorithm proves much faster (and we observe no memory footprint differences). In 50ms it disambiguates 3540 transducers, as opposed to the 2771 completed by the categorial procedure; the slowest WFST to disambiguate takes 230 seconds in the categorial procedure and 60 seconds in our method. Using sparse topological features with our semiring disambiguates all WF-STs faster in 99.8% of cases.

Neural Network Bilingual Rescoring
We use the disambiguation algorithm to apply the bilingual neural network language model (BiLM) of Devlin et al. (2014) to the output lattices of the CUED OpenMT12 Arabic-English hierarchical phrase-based translation system 3 using HiFST (de Gispert et al., 2010). We use a development set mt0205tune (2075 sentences) and a validation set mt0205test (2040 sentences) from the NIST MT02 through MT05 evaluation sets.
The edges in these WFSTs are of the form t:i/w, where t is the target word, i is the source sentence position t aligns to, and w contains the translation and language model score. HiFST outputs these WFSTs by using a standard hiero grammar (Chiang, 2007) augmented with target side heuristic alignments or affiliations to the source (Devlin et al., 2014).
In a rule over source and target words X →< s 1 X s 2 s 3 , t 1 X t 2 > / 2, 1 the feature '2, 1' indicates that the target word t 1 is aligned to source word s 2 and that t 2 aligns to s 1 . As rules are applied in translation, this information can be used to link target words to absolute positions within the source sentence. Allowing for admissible pruning, all possible affiliation sequences under the grammar for every translation are available in the WFSTs; disambiguation keeps the best affiliation sequence for each translation hypothesis, which allows the rescoring of very large lattices with the BiLM model.
This disambiguation task involves much bigger lattices than the POS-tagging task: the average number of states of the HiFST lattices is 38,200. Figure 2 (right) shows the number of mt0205tune disambiguated WFSTs over time compared to the categorial method. As with the PoS disambiguation task, the topological method is always much faster than the categorial one. After 10 seconds, our method has disambiguated 1953 lattices out of 2075, whereas the categorial method has only finished 1405. The slowest WFST to disambiguate takes 6700 seconds with the categorial procedure, which compares to 1000 seconds in our case.
The BiLM model is trained with NPLM (Vaswani et al., 2013)  of 3 source and 4 target words. Lattice rescoring with this model requires a special variation of the standard WFST composition which looks at both input and output labels on a transducer arc; we use KenLM (Heafield, 2011) to retrieve neural network scores for on-the-fly composition. We retune the parameters with Lattice MERT (Macherey et al., 2008) . Results are shown in Table 1.
Acknowledging the task differences with respect to (Devlin et al., 2014), we find BLEU gains consistent with rescoring results reported in their Table 5.

Conclusions and Related Work
We have described a tagging disambiguation algorithm that supports non-functional WFSTs, which cannot be handled directly by neither WFST determinization (Mohri, 1997) nor WFST disambiguation (Mohri, 2012). We show it is faster than the implementation with a lexicographic-tropicalcategorial semiring described by  and describe a use case in a practical rescoring task of an MT system with bilingual neural networks that yield 1.0 BLEU gain. Povey et al. (2012) also use a special semiring that allows to map non-functional WFSTs into WFSAs by inserting the tag into a string weight. However, in contrast to our implementation and that of , no expansion into an WFST with aligned input/output is described.
Lexicographic semirings, used for PoS tagging disambiguation , have been also shown to be useful in other tasks (Sproat et al., 2014), such as optimized epsilon encoding for backoff language models , and hierarchical phrase-based decoding with Pushdown Automata (Allauzen et al., 2014).
The tools for disambiguation and WFST composition with bilingual models, along with a tutorial to replicate Section 4.2, are all available at http://ucam-smt.github.io.