On the Compression of Lexicon Transducers

In finite-state language processing pipelines, a lexicon is often a key component. It needs to be comprehensive to ensure accuracy, reducing out-of-vocabulary misses. However, in memory-constrained environments (e.g., mobile phones), the size of the component automata must be kept small. Indeed, a delicate balance between comprehensiveness, speed, and memory must be struck to conform to device requirements while providing a good user experience. In this paper, we describe a compression scheme for lexicons when represented as finite-state transducers. We efficiently encode the graph of the transducer while storing transition labels separately. The graph encoding scheme is based on the LOUDS (Level Order Unary Degree Sequence) tree representation, which has constant time tree traversal for queries while being information-theoretically optimal in space. We find that our encoding is near the theoretical lower bound for such graphs and substantially outperforms more traditional representations in space while remaining competitive in latency benchmarks.


Introduction
Modern finite-state language processing pipelines often consist of several finite-state transducers in composition. For example, a virtual keyboard pipeline, used for decoding on mobile devices, can consist of a context dependency transducer C, a lexicon L, and an n-gram language model G (Ouyang et al., 2017). A bikey C transducer is used to encode context in gesture decoding, the lexicon transducer L maps from a character string to the corresponding word ID, and the language model G gives the a priori probability of a word sequence. A similar decomposition is often used in speech recognition decoding (Mohri et al., 1996).
These models are then composed as The application of this combined model to an input character string outputs the corresponding word string and probability. Unfortunately, in order to be accurate, these models may need to be large. This problem is aggravated when the composition is performed statically since the state space grows with the product of the input automata sizes. In practice, on-the-fly composition is often used to save space (Mohri et al., 1996;Hori et al., 2004;Caseiro and Trancoso, 2006). Additionally, it is of practical importance to have compact and efficient finite-state language model component representations.
There are a variety of compression schemes available for automata (Daciuk, 2000). These range from general compression algorithms, which do not depend on a specific underlying structure (Daciuk and van Noord, 2001;Daciuk and Weiss, 2011;Mohri et al., 2015) to schemes that try to heavily exploit specific structural properties of the inputs (Watanabe et al., 2009;Sorensen and Allauzen, 2011). Another important consideration is whether the automata can be decompressed just for a queried portion or need to be more fully decompressed. Generic compression algorithms often have relatively good compression ratios over a wide class of machines, but they sacrifice speed and space in use since they often do not admit such selective decompression. In contrast, structurally-specific compression algorithms can have an attractive balance between the compression ratio and query performance, but are limited to precise subclasses of machines. In real-time production systems, the latter method often proves more desirable since a user should not have to wait long or waste space when a query is answered.
Among the transducers mentioned above, the context-dependency transducer C can be represented implicitly (in code) and structurallyspecific compression algorithms for the n-gram language model G have previously been developed (Sorensen and Allauzen, 2011). This leads us to investigate the compression of the lexicon L.
This paper is organized as follows. Section 2 introduces the formal algebraic structures and notation that we will use. Section 3 describes different representations for these algebraic structures. In Section 4, we formally define a lexicon and explore its possible representations. Section 5 develops an information-theoretic bound on the number of bits needed to encode a lexicon, Section 6 presents our encoding, and Section 7 presents experiments on the quality of that encoding. Finally, we offer concluding remarks in Section 8.

Graphs and Trees
A directed graph (or digraph) G = (V, A) has a finite set of nodes (or vertices) V and a finite set of directed arcs (or edges) A ⊆ V × V . An arc a = (p[a], n[a]) spans from a source node p[a] to a destination node n[a]. A path π is a non-empty list of consecutive arcs a 1 , a 2 , . . . , a n where p[a i+1 ] = n[a i ]. We write p[π] = p[a 1 ], n[π] = n[a n ]. A cycle is a path π with p[π] = n[π]. A digraph is acyclic if it has no cycles. The out-degree of a node v ∈ V is |{w ∈ V | (v, w) ∈ A}| and the in-degree is |{w ∈ V | (w, v) ∈ A}|.
We distinguish several specific digraph cases: • An out-tree (V, A, i) is an acyclic digraph for which the in-degree of every node is 1 except for the distinguished root node i ∈ V , which has in-degree 0. The nodes with out-degree 0 are called leaves.
• An in-tree (V, A, f ) is an acyclic digraph for which the out-degree of every node is 1 except for the distinguished root node f ∈ V , which has out-degree 0. The nodes with indegree 0 are called leaves.
• A directed bipartite digraph (V 1 ∪V 2 , A) partitions the nodes into two disjoint sets V 1 and

Finite-State Transducers
A finite-state transducer T = (Σ, Γ, Q, E, i, F ) has a finite input alphabet Σ, a finite output al- Thus, there is a 1:1 correspondence between states and nodes but there may be multiple transitions, with different labelings, that correspond to the same digraph arc. In that case, we say the transition is a digraph multiarc. A path π = e 1 , . . . , e n , a cycle, p[π] and n[π] are analogously defined to digraphs and define . P (q, q ) denotes the set of all paths in T from state q to q . We extend this to sets in the obvious way: P (q, R) denotes the set of all paths from state q to q ∈ R and so forth. A path π is successful if it is in P (i, F ) and in that case the transducer is said to accept the input string i[π] and output o[π].
A finite-state transducer is subsequential if it is input deterministic, that is, no two outgoing transitions at the same state share the same input label, and the destination state of any epsilon transition is a final state with no outgoing transitions.

Graph and Tree Representations
Basic Graph and Tree Representation. A simple digraph representation uses adjacency lists: denote the nodes V by integers from 1 to N , let a be an array indexed by the node number, and let a[q] = (q 1 , . . . , q n ) be a list of the nodes {q j ∈ V : (q, q j ) ∈ A}. An in-tree and out-tree can use this representation where a distinguished integer such as 1 or |V | is used to denote the root. A directed bipartite graph can also use this representation where it may be convenient to number the nodes in V 1 from 1 to |V 1 | and V 2 from |V 1 | + 1 to |V |.
Compact Tree Representation. In the case of trees, there is a particularly compact representation known as LOUDS (Level Order Unary Degree Sequence). We can quantify compactness as follows.
For a finite set with M elements, we require at least N = log M bits to uniquely encode each el-ement. We call an encoding scheme succinct if it takes at most N +o(N ) bits to encode any element uniquely.
The LOUDS tree encoding is a succinct representation of ordinal trees (where a node's children have a total ordering). Given an ordinal tree of N nodes, it encodes it in 2N + 1 bits, while the information-theoretic lower bound is 2N − O(log N ) (Jacobson, 1989). Moreover, O(1) time parent-child traversals can be implemented using o(N ) extras bits of storage (Geary et al., 2004).
Let b be a bitstring where b[i] is the element at index i when starting from 0. Then, we define Rank x and Select x , where x ∈ {0, 1}, as These operations can be performed in constant time using o(|b|) extra bits of space (Vigna, 2008). The LOUDS encoding is then constructed as follows. We start with the bitstring 10. Then, from the root in breadth-first order, we append 1 d 0, where d is the number of children of the current node. Here, we assume the graph is labeled in breadth-first order. Then, a node n corresponds to the n-th 1 in the bitstring (or, equivalently, the (n + 1)-th 0). We can find the parent or first/last child (if any) using a combination of Rank and Select queries: From these, we can retrieve the number of children of a node, the i-th child, whether or not a node is a leaf, and many other operations in a constant number of queries (Geary et al., 2004;Delpratt et al., 2006). It is known that Select and Rank can be performed in constant time in the length of the bitstring by augmenting the bitstring with o(N ) additional bits of information (thus retaining any succinctness properties) (Kim et al., 2005;Vigna, 2008).

Transducer Representations
Basic Transducer Representation. A simple transducer representation uses adjacency lists as well, stored in an array a indexed by states that are denoted by integers from 0 to |Q| − 1, The value a[q] = ((i 1 , o 1 , q 1 ), . . . , (i n , o n , q n )) is a list The initial state can be denoted by 0 and the final states can be stored separately. We will call this representation AdjList in our experiments where we use 32 bits for each of the input label, output label, and destination state of each transition.
Compact Transducer Representation. A more compact transducer representation stores the |Q| adjacency lists across 2 global arrays as follows.
First an array I, indexed by integers from 0 to |Q|, holds the values I[q] = 0≤i<q |a[q]|. Second an array A, indexed by integers from 0 to |E| − 1, holds the concatenation of the adjacency lists a[0] · · · a[|Q| − 1]. The adjacency list for a given state q can be recovered from I and A as Observe that I stores a monotonic nondecreasing sequence of integers, hence we encode using a differential coding approach similar to PForDelta (Zukowski et al., 2006). We store A using a variable-length encoding that ensures that log |Q| + log |Σ| + log |Γ| bits are used per entry in A on average. Final states are stored as superfinal transitions. We will call this representation CmpAdjList in our experiments.

Lexicons
Lexicon Definition. We define a lexicon as a finite binary relation L ⊂ Σ + × Γ that pairs nonempty character strings from the finite alphabet Σ to a word symbol in the finite alphabet Γ. This terminology matches our keyboard application described above. For the speech recognition application, the Σ alphabet represents phonemes. We will assume the relation L is functional and one-to-one. In other words, each character string in the domain of L maps to only one word (i.e., no homonyms) and each word maps to only one character string (i.e., unique spellings). This is natural for the keyboard application. 1 Lexicon Representation While there are many ways to represent a lexicon, we focus on using a character-to-word finite-state transducer. An advantage of this approach is that we can use trans- Notice that removing the bridge arcs disconnects the graph while leaving two tree structures. ducer determinization and minimization to put the transducer into a minimal canonical form (possible since L is finite and thus has an acyclic transducer representation) (Mohri et al., 2002). Figure  1 gives an example of a character-to-word lexicon transducer in this canonical form. Each word in a canonical lexicon corresponds to exactly one successful path (by subsequentiality) and every successful path has exactly one transition with an nonoutput label (by definition of a lexicon). Further, there is only one final state (by acyclicity and minimality) which we will denote by f . What remains is to store this representation compactly. We will do so by storing the transducer graph and its labels separately.
Given a minimal lexicon transducer T , we will now show that we can decompose the graph G(T ) into three sub-graphs: a prefix graph G p (T ), a suffix graph G s (T ), and a bridge graph G b (T ). We further show that G p (T ) is an out-tree, G s (T ) is an in-tree, and G b (T ) is a directed bipartite graph. We will use this decomposition in our stored representation.
In other words, the prefix graph corresponds to transitions on paths in T before the output label, the suffix graph to those after the output label, and the bridge graph to those with the output label. It is easy to see a transition in T corresponds to an arc in exactly one of these sub-graphs. Further, Q p and Q s partition Q.
The prefix graph is an out-tree rooted at i ∈ Q p . Suppose there are two arcs entering some state q ∈ Q p . Then there must be two successful paths in T that pass through q with the same word label, which is a contradiction.
Similarly, the suffix graph is an in-tree rooted at f ∈ Q s . For example, suppose there are two arcs leaving some state q ∈ Q s . Then again there must be two successful paths in T that pass through q with the same word label, which is a contradiction.
Finally, the bridge graph is a directed bipartite graph with arcs that span from Q p to Q s because for any successful path in T the transition with a non-output label is preceded by a subpath with all output labels from the initial state i and followed by a subpath with all output labels to the final state f . Observe that only bridge arcs in A b can be multiarcs of G(T ) since L is one-to-one. Figure 1 shows this decomposition for our example with the bridge arcs specially marked.

The Optimal Graph Encoding
Now that we have described the canonical form of our lexicon transducer and its graph decomposition, we can begin to devise a compression scheme. We first wish to find the informationtheoretic bound on the number of bits required to uniquely encode any lexicon graph. That is, among all lexicon transducers with given prefix out-tree and suffix in-tree sizes (and a given number of leaves in each) and k bridge arcs, how many bits is sufficient to encode them so that they are all pairwise distinguishable?
In this section, we let n and n be the number of nodes and leaves in the prefix out-tree and m and m be the same for the suffix in-tree.
The LOUDS tree encoding is optimal for all n node ordinal trees up to lower order terms (Jacobson, 1989). This is because there are This is compared to the 2n + 1 bits used by LOUDS. However, when the number of leaves is known, this bound can be reduced. There are n−2 n −1 n−1 n −1 n ordinal trees with n nodes and n leaves (Yamanaka et al., 2012).
We are left with the task of counting the number of valid bridge graphs with k arcs. Each bridge graph is uniquely defined by choosing a set of k bridge arcs, i.e., a k element subset of Q p × Q s . Every state in a minimal lexicon transducer must belong to a successful path, hence every node in its graph must belong to a path from the root of Q p to the root of Q s . A leaf in Q p (resp. Q s ) belongs to such a path if and only if it is the origin (resp. destination) of a bridge arc. Hence, a set of k bridge arcs A b ∈ P k (Q p × Q s ) is valid iff for every leaf q there exists a bridge arc a ∈ A b such that q = p[a] or q = n[a]. Let Q p and Q s be the set of leaves in the prefix and suffix graphs respectively, and Q = Q p ∪ Q s .
Let A q denote the set of sets of k bridge arcs where the leaf q ∈ Q is not part of an arc: ) otherwise. A set of bridge arcs is valid if and only if it does not belong to any of A q . Hence, the number of valid sets of bridge arcs is We can now apply the inclusion-exclusion principle to compute the cardinality of the union in that last term: Observe that, for a non-empty subset X of Q , and the cardinality of that intersection is: Hence, the cardinality of the intersection defined by a given X depends only on the number of leaves from Q p and Q s in X. We can continue the inclusion-exclusion computation using the last derivation following from We can now complete the computation of the number of valid bridge graphs: We are unaware of any asymptotic analysis of this summation or a way to closely estimate its logarithm. To compare it with our encoding, we use a loose upper bound of nm k and Stirling's approximation to get Overall, the number of possible lexicon graphs given n, m, k, n , and m can be found by multiplying the number of n (m) node, n (m ) leaf trees Finally, we note that by choosing any out-tree as a prefix graph, any in-tree as a suffix graph, and any valid bridge graph, we obtain a graph that is a valid lexicon graph. A minimal lexicon transducer can be derived from that graph by labeling each non-bridge arc with a unique input label (and epsilon output) and each bridge arc with a unique input and output label. 2 6 Compact Lexicon Encoding 6.1 Encoding the Graph We encode the prefix, suffix, and bridge graphs separately. Encoding the prefix out-tree and suffix in-tree using LOUDS leads to a natural numbering of the nodes in Q: nodes in Q p are numbered from 0 to n − 1 in BFS order and nodes in Q s from n to |Q| − 1 in BFS order using the reverse of A s , {(q , q) | (q, q ) ∈ A s }, with 0 and n denoting the roots of Q p and Q s , respectively. The LOUDS representation of the prefix and suffix graphs consists of two bitstrings, b p of length n + 1 and b s of length m + 1, using 2(n + m + 1) bits combined.
We represent the bridge graph using a compact adjacency list approach. We use an array A b indexed from 0 to n − 1 holding the concatenation of the bridge-arc adjacency lists of the prefix nodes a b [0] · · · a b [n − 1]. We use a bitmap b b with n + k bits, one for each prefix node and bridge arc, to implement an index into A b as follows. The bitmap b b is encoded by concatenating 1 d 0 for each prefix node q, where 3 We retrieve the number of bridge arcs originating at a node q ∈ Q p by computing and the index in the dense array A b to the position where the adjacency list for q starts by The variable-length encoding mentioned in Section 3.2 is used to compress A b in k log m bits, since the k entries in A b can take at most m values.
It is possible to reduce the bridge arc adjacency list and multiplicity encoding to min(n + k log m, m + k log n) + k + 1 bits by noting that the bridge arcs travel unidirectionally from the prefix out-tree to suffix in-tree so we can represent them in either the forward or reverse direction, depending on which uses less space. However, we choose not to do this as it would incur an additional traversal time cost.
In total, our encoding uses 2(n + m + 1) + n + k + k log m bits to store the graph. We note that this is asymptotically worse than the best possible from Section 5. Nevertheless, in Section 7, we show empirically that it performs substantially better than the CmpAdjList format and is useful in practice.

Encoding the Labeling
We now encode the arc labels for each of the three component graphs using four ancillary arrays. The arrays L p and L s store the input labels for each of the n − 1 prefix arcs and m − 1 suffix arcs. For q ∈ Q p \ {0}, L p [q − 1] holds the input label for the unique incoming prefix arc to q. Likewise, for q ∈ Q s \ n, L s [q − n − 1] holds the input label for the unique outgoing suffix arc to q. Recall that arcs in the prefix out-tree or suffix in-tree always have output label .
The arrays L i b and L o b store the input and output label for each of the k bridge arcs, using the same indexing as A b : the bridge arc corresponding to the j-th entry in A b , has input label L Each of the arrays L p , L s , L i b , and L o b is compressed using the same variable-length encoding scheme as CmpAdjList. This allows us to directly compare the effect of encoding the graph separately from the arc label data. Encoding finality is simple -only one node, the root of the suffix in-tree, is final.

Prefix
Bridge Arcs Suffix An overview of the memory layout of our encoding is given in Figure 2. We discuss the practical space savings in Section 7. All together, we call this representation the LOUDS lexicon format in our experiments.

Traversing the Transducer
We traverse the transducer by constructing the transitions originating at a given state q on demand.
When q ∈ Q p , the set E[q] of outgoing transitions in q can be decomposed as where E p [q] represents the transitions corresponding to prefix arcs and E b [q] the ones corresponding to bridge arcs. The first component can be computed from the prefix LOUDS tree by and the second component can be recovered from the compact adjacency representation of the bridge graph as When q ∈ Q s \ {n}, there is a single outgoing transition in q that can be computed from the suffix LOUDS tree as (q, L s [q − n − 1], , n + Parent bs (q − n)).
Finally, when q = n, q is the root of the suffix out-tree. There are no outgoing transitions in q but q is final.

Closure
In practice, we often use a modified lexicon transducer representing its closure T + , which accepts one or more words from the lexicon. For this, an -labeled transition from the final state to the initial state can be added to the canonical transducer. 4

Experiments
We compare our lexicon encoding to the two other transducer representations in Section 3.2. We measure the memory size of the resulting machines as well as their runtimes on a decoding task. We prepare a set of lexicons using the most common 500k words in the Google keyboard (GBoard) Russian language model. We extract the 50k, 100k, . . . , 500k most frequent words to create a total of 10 lexicons.
We first compare the space used by the AdjList, the CmpAdjList, and the LOUDS lexicon formats. The results are shown in Figure  3. The LOUDS lexicon outperforms the other two formats in every case. On the 500k word lexicon, it is 90.8% smaller than the AdjList format and 58.8% smaller than the CmpAdjList format. Figure 4 shows the number of bits required to encode the Russian lexicons using our encoding and the upper bound of the optimal encoding. We use the parameters from Table 1 along with the upper bound described in Section 5 and the number of bits for our representation from Section 6. Our graph encoding nearly matches the upper bound approximation in all situations. For the 500k lexicon, the difference between our encoding and the upper bound is less than 2%. In contrast, the standard adjacency list format graph requires ten times more space across all test cases. We now consider the performance of our encoding on a benchmark decoding task consisting of on-the-fly composition with an n-gram language model followed by shortest path computation, which simulates a typical pipeline in applications. For the language model, we use a 244k state n-gram model trained # Words  50k  100k  150k  200k  250k  300k  350k  400k  450k  500k  Prefix Nodes 76207 148026 219494 292499 360911 429080 494766 558619 620429 670232  Suffix Nodes 7867  12548  15964  18187  20634  22454  23850  25059  25955  26977  Prefix Leaves 13402 26371  39300  52288  64894  77462  89881 102067 114228 124097  Suffix Leaves 1602  2560  3266  3732  4182  4512  4754  4989  5221  5421   Table 1: The size of the prefix out-tree and suffix in-tree as well as the number of leaves in each for all of the Russian lexicons. Note that the number of bridge arcs is the same as the number of words. on Russian language data. Figure 5 shows the speed of this benchmark for each of the lexicon formats. At its worst, the LOUDS format was ∼ 20% slower than the CmpAdjList format. However, for the 500k word case, the difference between the LOUDS format and the CmpAdjList format was only 8.6%. In these experiments, no pre-processing (transition sorting, caching, etc.) of the transducers was done so that the raw access time for each format could be measured more accurately.

Conclusion
In this paper, we described a compact encoding for character-to-word lexicon transducers in canonical minimal form. The transducer graph is decomposed into simpler subgraphs, exploited in the encoding. The arc label data is encoded separately using variable-length compression schemes. We presented an information-theoretic lower bound for the graph encoding and compare the encoding to an asymptotic upper bound approximation. Our encoding is compared to two alternative formats -adjacency lists with and without variable length compression. Ours is more than 58% smaller while being only ∼9% slower in tests on a decoding benchmark. Furthermore, this encod- ing is very close to the information-theoretic upper bound on all the test cases.