Failure Transducers and Applications in Knowledge-Based Text Processing

Finite-state devices encoding lexica and related knowledge bases often become very large. A well-known technique for reducing the size of ﬁnite-state automata is the use of failure transitions. Here we generalize the concept of failure transitions for ﬁnite-state automata to the case of subsequential transducers. Failure transitions in the new sense do not have input but may produce output. As an application ﬁeld for failure transducers we consider text rewriting with large rewrite lexica under the leftmost-longest replacement strategy. It is shown that using failure transducers leads to a huge space reduction compared to the use of standard subsequential transducers. As a concrete example we show how all Wikipedia concepts in an input text can be linked in an online manner with the Wikipedia pages of the concepts using failure transducers.


Introduction
A wellknown technique for reducing the size of large finite-state automata is the use of failure transitions (Aho and Corasick, 1975;Mohri, 1995;Crochemore and Hancart, 1997;Kourie et al., 2012;Björklund et al., 2014). While automata help to find strings in text, more advanced text processing tasks are often based on knowledge bases that provide information on characteristic portions of input texts (endings, words, phrases, etc.). Using this information, given input texts are translated to a new output form. Examples for this form of "text rewriting" include various forms of tagging, stemming, and (linguistic, semantic,..) annotation (KESA, 2016).
There are a number of efficient techniques for representing a finite dictionary of string entries with their corresponding mappings as finite-state machines and transducers (Mihov and Maurel, 2001;Daciuk et al., 2010). These techniques can produce a very compact representation of the dictionary. But in order to perform text rewriting based on the dictionary one has to traverse the dictionary starting from each text position and in addition apply a conflict resolution strategy. Therefore the time complexity for text rewriting is given by the length of the text multiplied by the maximal length of a dictionary entry. Deterministic finite-state transducers offer an elegant framework to solve such a text processing task in a more efficient way. In (Mihov and Schulz, 2007) we considered "rewriting dictionaries", i.e. collections of strings where for each entry a replacement value (another string) is specified. We showed how to translate a given rewriting dictionary into a subsequential finite-state transducer that may be used to replace all occurrences of dictionary entries in a text by the replacements with only one traversal of the text by the transducer. Using this solution, the time complexity for text rewriting is linear in the length of the text and does not depend on the dictionary. For resolving conflicts between overlapping entries the leftmostlargest rewriting strategy is used. However, when using large rewriting dictionaries the size of the resulting subsequential transducer can become very large. A similar technique is used by Schmitz in (Schmitz, 2011) for constructing subsequential transducers that represent part-of-speech rules.
In this paper we introduce f-transducers, a new kind of deterministic transducer with failure transitions. A failure transition in our sense does not consume input, but it is essential that it may produce output. We show how to translate a given rewriting dictionary into an f-transducer that has the same functionality as the subsequential finite-state transducer obtained in (Mihov and Schulz, 2007). In this way, a huge space reduction is obtained for large rewriting dictionaries. Since transitions in transducers come with output, saving transitions has even a larger benefit than in the automaton case. As a concrete application we consider a rewriting dictionary with 8 million entries where each title of a page of the English Wikipedia obtains a link text with anchor on the corresponding page of the concept. The f-transducer obtained from the translation runs over a text and replaces every mentioning of a Wikipedia concept by a link to the Wikipedia page. In this way, texts can be linked to the Wikipedia in an online-manner.
We start with formal preliminaries in Section 2. In Section 3 we introduce failure transducers. Section 4 presents the construction of f-transducers for text rewriting. The algorithm and complexity analysis of our construction is given in Section 5. Section 6 describes the annotation of concept names with Wikipedia. We finish with a short conclusion in Section 7.

Formal Preliminaries
An alphabet is a finite set Σ of symbols. Words of length n ≥ 0 over an alphabet Σ are introduced as usual and written a 1 . . . a n (a i ∈ Σ). The unique word of length 0 is written ε. As usual, Σ * denotes the set of all words over Σ. The concatenation of two words u, v ∈ Σ * is written u · v or uv.
The language accepted by A is L( Definition 2.2 A failure automaton or fautomaton is a tuple where Σ, Q, i, F, δ, f is a deterministic finitestate automaton and f : Q → Q is a partial function called the failure function. Let FA = Σ, Q, q 0 , F, δ, f be an f-automaton. The completed transition function δ f : Similarly as δ and f also δ f is a partial function. The generalized completed transition function is the (partial) function δ * f : Q×Σ * → Q inductively defined as The language of the f-automaton FA is defined as Definition 2.3 A subsequential transducer is a tuple T = Σ, Q, q 0 , F, δ, λ, Ψ where Σ, Q, q 0 , F, δ is a deterministic finite-state automaton, λ : Q × Σ → Σ * is a partial function called the transition output function and Ψ : F → Σ * is a total function called the state output function. The domains of δ and λ must coincide. The generalized transition function δ * is defined as above. The generalized output function is the partial function λ * : Q × Σ * → Σ * defined as • λ * (q, wσ) := λ * (q, w) · λ(δ * (q, w), σ) for all q ∈ Q, w ∈ Σ * , σ ∈ Σ such that δ * (q, w) and δ(δ * (q, w), σ) are defined.
The notion of paths in a finite-state device and the length of a path are introduced as usual.
Definition 2.4 A position of t ∈ Σ * is a pair u, v such that t = uv. An infix occurrence (of the infix v) in t ∈ Σ * is a triple u, v, w such that t = uvw. An infix occurrence u 1 , v 1 , w 1 of the text t blocks another infix occurrence u 2 , v 2 , w 2 of t if |u 1 | < |u 2 | < |u 1 v 1 |. In this case we write u 1 , v 1 , w 1 < ov u 2 , v 2 , w 2 and say that the two infix occurrences overlap. A set A of infix occurrences of the text t is said to be non-overlapping if two distinct infix occurrences of A never overlap.
Definition 2.5 Let A, B be two sets of infix occurrences of the text t. We define

Failure transducers
Failure transducers, or f-transducers, represent a kind of deterministic transducer with a failure transition function. When applying a failure transition during text traversal, an empty part of the input is consumed. However, a non-empty output string may be produced. We start with an illustrating example.  When reading, say, symbol a in State 2, we follow the failure link to 1, producing output D. Then we use the a-transition from 1 to arrive at State 3. The total output is DA. The example might appear artificial since with other outputs we could not use the same technique. However, we shall see later that moving parts of the output to the failure transitions is often possible.
Transducers that are used for rewriting texts need to accept arbitrary strings. For this reason we do not introduce a special set of final states in our formalization of failure-transducers. As a matter of fact, generalizations are possible.
where Σ is a finite alphabet, Q is a set of states, q 0 ∈ Q is the start state, δ : Q × Σ → Q is the deterministic transition function, λ : Q × Σ → Σ * is the transition output function, ϕ : Q → Σ * is the failure transition output function, and f : Q → Q is the failure transition function. The following conditions must hold: 1. ϕ and f are partial functions such that dom(ϕ) = dom(f ).
Informally, the way how an f-transducer processes a text t can be described as follows: starting from the start state and using the deterministic transition function δ we read the symbols of the text. The output at each transition is defined by the transition output function. When reaching a state p and a text symbol σ such that δ(p, σ) is not defined we apply a series of failure transitions until we arrive at a state p n such that δ(p n , σ) = p is defined. The output produced on this intermediate walk has two parts. The first part is the concatenation of all failure transition outputs of the states p, . . . , p n visited. The second part is given by the transition output of the final σ-transition. Finally, when arriving at state q at the end of the text, we apply a series of failure transitions, producing failure transition outputs, until we reach a state for which the failure function is not defined.
The following lemma shows that an ftransducer FT in a natural way defines a corresponding subsequential transducer with the same output function. The proof is a direct consequence of our definition of the output function O F T .
Lemma 3.4 Let FT = Σ, Q, q 0 , δ, λ, ϕ, f be an f-transducer, let δ f , λ f , and Ψ f as above. Then In what follows we consider failure transducers FT where each state q can be reached from the start state. The depth of a state q ∈ Q, denoted d(q), is the minimal length of a path from start q 0 to q.

Definition 3.5 A backwards f-transducer is an ftransducer
Proposition 3.6 The time complexity (assuming a random access machine) for rewriting a word α of length n to a word β of length m by a backwards f-transducer is O(n + m) and does not depend on the size of the transducer.
The simple proof is omitted.

From rewrite dictionaries to f-transducers
As an application field for f-transducers we now look at text rewriting using dictionaries of a particular form.
Definition 4.1 (Mihov and Schulz, 2007) A rewrite dictionary is a pair D = (D, Σ) where Σ is an alphabet and D is a finite mapping of words over Σ. The mapping can be represented in the form Each string α i is called an entry or an original of D, and β i is called the replacement value for α i (i = 1, . . . , k).
Definition 4.2 Let t ∈ Σ * be a text and D = (D, Σ) be a rewrite dictionary. A rewrite occurrence of D in t is an infix occurrence u, v, w t such that v ∈ dom(D). By C D t we denote the set of all rewrite occurrences of D in the text t. The global rewriting function associated with D is the mapping L(D) : Σ * → Σ * that given an input text t replaces each leftmost-longest infix occurrence of C D t in t by the corresponding replacement value. 4 Example 4.3 (From (Mihov and Schulz, 2007)). Let D denote the rewriting dictionary with alphabet Σ := {a, b, c, 1, 2, 3, 4, 5} and mapping D of the form The leftmost-longest infix occurrences of C D t in the text t = abcbbbabccb are We now describe a procedure for translating a rewrite lexicon D into a backwards f-transducer FT such that the global rewriting function L(D) associated with D and O F T are identical. The construction is a variant of the construction presented (Mihov and Schulz, 2007) for translating rewrite dictionaries into standard subsequential transducers. As in (Mihov and Schulz, 2007) we proceed in two steps.
Step 1. Given the rewrite lexicon D, as in (Mihov and Schulz, 2007) we build a trie transducer T D representing the domain of the lexicon mapping D. The final states of T D correspond to the entries ("originals") of D, and the failure transition output of each final state is defined as the image of the entry. The transition output for each transition is the empty string ε. The trie transducer thus represents the finite mapping D given by D.
Step 2. The second step, where we build the failure transducer FT representing the global rewriting function for D, is based on a procedure where we visit the states of the trie transducer in a breadth-first manner, starting at the initial state q 0 . We first complete the initial state q 0 adding loop transitions with any symbol σ ∈ Σ such that there is no outgoing σ-transition from q 0 in the trie. The transition output for a loop transitions with symbol σ is σ. For all states q which are direct ancestors of q 0 i.e. such that q = δ(q 0 , σ) we define f (q) := q 0 and ϕ(q) := σ if q is not final. In case q is final the function ϕ(q) is already defined. Assume now that for the state q = q 0 we have already defined f (q) = p q and ϕ(q) = γ q . Let q = δ(q, σ) be an ancestor of q in the trie.
Case a. If q is a final state of the trie transducer -i.e., if the failure transition output ϕ(q ) for q is already defined -we just define f (q ) := q 0 .
The definition of δ f shows that we find state f (q ) by starting from f (q) = p q and applying failure transitions until we arrive at a state p n such that δ(p n , σ) = f (q ) is defined.
Example 4.4 As an illustration for the translation of rewrite lexica into f-transducer we use the rewite lexicon from Example 4.3. The resulting trie transducer (Step 1) and f-transducer (Step 2) are shown in Figure 2. When processing text t = abcbbbabccb we first read prefix abc with no output. Then two failure transitions produce output 25 before we can read the next letter b from the start. After reading t we have produced output 25bb45 and the current state is the b-successor of the start. We have to add the final state output for this state, which is given by the failure transition output b (cf. Def. 3.3). The total output is 25bb45b as in Example 4.3.
Remark 4.5 For the following correctness proof we sketch Step 2 of the parallel construction of a subsequential transducer T in (Mihov and Schulz, 2007). Recall that in this case for each state q and each symbol σ ∈ Σ such that q does not have a σ transition in the transducer trie T D a new σtransition with suitable output needs to be added. For q 0 the procedure is as above (as for each state, also q 0 is made final). Consider a state q 0 = q processed during Step 2. Let plab (q) = a 1 . . . a r denote the label of the path π in the trie from q 0 to q. The skip part of plab (q) = a 1 . . . a r , denoted u 1 , is: 1. a 1 if the state sequence π does not contain any final state of T D , and 2. a 1 · · · a f (f ≤ r) if this prefix of plab (q) leads from q 0 to the last final state of π.
The read part of plab (q) (denoted u 2 ) is the remaining part u 2 of plab (q) = u 1 u 2 . Note that the read part is the empty word if q is a final state of T D or if q is a direct successor of the start state q 0 . The failure state for q is the state p obtained when traversing the transducer T with the read part u 2 , starting from q 0 . The construction order guarantees that u 2 can be completely read in the preliminary version of the subsequential transducer T computed up to this point since the length of the read part u 2 is smaller than the length of plab (q).
The output prefix for q is the string γ that represents the concatenation of (i) the output of the transducer T for the skip part u 1 (either a 1 or the substitute for the lexical entry a 1 · · · a f , cf. cases above) and (ii) the transition output of the transducer T for the read part u 2 when starting from q 0 . With these notions, the processing of q can be described in the following way: 1. The state output for q is the concatenation of the output prefix γ with the state output of the failure state p .
2. If a new σ-transition from q is needed, the target state is the σ-successor of the failure state for q (it always exists since the failure state has smaller depth than q). The transition output for the new σ-transition from q is the concatenation of the output prefix γ with the transition output of the transition with label σ from the failure state p .
Correctness proof. Because of space limitations, an independent and fully selfcontained proof cannot be given here. However, using the correctness of the parallel construction in (Mihov and Schulz, 2007) (shown in the paper) and Remark 4.5 we can prove correctness of the new construction. Let FT = Σ, Q, q 0 , δ, λ, ϕ, f denote the f-transducer obtained. Let T = Σ, Q, q 0 , Q, δ T , λ T , Ψ T denote the subsequential transducer obtained from the construction described in (Mihov and Schulz, 2007). The following lemma captures some parallelisms between the two devices. We use the notation introduced in Definition 3.3.
Lemma 4.6 1. For any state q = q 0 the state f (q) is the failure state of q in the sense of Remark 4.5. We have d(f (q)) < d(q).
3. For any state q = q 0 the failure transition output ϕ(q) is the output prefix γ (q) in the sense of Remark 4.5.
The proof for Lemma 4.6 is given in below. Looking at Part 3 of Lemma 4.6 it is simple to see that for each state q we have Ψ T (q) = Ψ f (q).
When we now compare the outputs produced for a text t by T and FT respectively we find, using Lemma 4.6, that In (Mihov and Schulz, 2007) it has been shown that O T represents the global rewrite function L(D) for the rewrite lexicon D under the leftmost largest strategy in the sense of Definition 4.2.
Hence the same holds for O F T , which shows that the new construction is correct.
Proof of Lemma 4.6. Recall that for a state q, the length of the unique path from q 0 to q in the transducer trie is denoted d(q). The proof is by induction on d(q). If d(q) = 0 we have q = q 0 . In this case, Claims 2 and 4 are obvious since we have δ T (q 0 , σ) = δ f (q 0 , σ) = There is nothing to show as to Claims 1 and 3. For the induction step consider a successor state q = δ(q , σ ). Let p denote the failure state of q in the sense of Remark 4.5. Let p := δ T (p , σ ). By induction hypothesis we have (i) p = f (q ) and We show Claim 1. If q is final, then q 0 = f (q) is the failure state of q in the sense of Remark 4.5. In the other case the failure state of q in the sense of Remark 4.5 is We show Claim 2. Let σ ∈ Σ. If δ(q, σ) is defined we have δ T (q, σ) = δ(q, σ) = δ f (q, σ). In the other case, first consider the case where q is final. Then q 0 is the failure state for q and δ T (q, σ) = δ T (q 0 , σ) = δ f (q 0 , σ) = δ f (f (q), σ) = δ f (q, σ). By induction hypothesis d(δ f (q 0 , σ)) ≤ 1 and thus d(δ T (q, σ)) ≤ 1 < d(q)+1. If q is not final, then p = δ T (p , σ ) is the failure state for q. We have seen that d(p) < d(q). By induction hypothesis δ T (p, σ) = δ f (p, σ). Claim 1 shows that p = f (q). Hence We have d(δ T (q, σ)) = d(δ T (p, σ)) ≤ d(p) + 1 ≤ d(q) + 1.

Implementation and complexity analysis
Algorithm 1 presents the pseudo-code of the construction. Clearly, the number of states of the f-transducer is bounded by ||D I || + 1 and the number of transitions is bounded by 2||D I || and does not depend on the alphabet size, where ||D I || = α,β ∈D |α|. The complexity of the trie construction is O(||D||), where ||D|| = α,β ∈D |α| + |β|. If each output string is represented in the standard way as a sequence of symbols, the space complexity of ϕ can get cubic in ||D||. Using the technique for tree-based output string representation introduced in (Mihov and Schulz, 2007) the space complexity remains linear in ||D||. In that case the space and time complexity of the proposed algorithm 1 is O(||D||). In contrast, the space and time complexity of the construction presented in (Mihov and Schulz, 2007) is O(||D|| · |Σ|) for an alphabet Σ. The complexity for rewriting a text t to t is O(|t| + |t |).

Application for Online Hyperlinking Using Link Databases
Document repositories often come with a large number of internal or external links that lead from concepts (entities, references, etc.) mentioned in the text to other web pages. A well-known example is the Wikipedia 2 . On a Wikipedia page, all concepts of the text that are described in more detail in some other article of the Wikipedia are highlighted. When clicking at the concept the visitor is led to the relevant page of the Wikipedia, using an internal link (wikilink). In the same way, arbitrary texts can be linked with the Wikidedia.
To create a wikilink, concept names such as "United Kingdom" or "Kingdom University" mentioned in the text are replaced by anchor elements of the form <a href="/wiki/United_Kingdom"> United Kingdom</a> <a href="/wiki/Kingdom_University"> Kingdom University</a> In order to automatize this form of hyperlinking, rewriting dictionaries may be used that list relevant concept names (e.g. "United Kingdom") together with the corresponding anchor elements Algorithm 1 Construction of an f-transducer for a given alphabet Σ and rewrite dictionary D.
Algorithm 1 has been applied for constructing a rewrite f-transducer for annotating texts with the anchor elements of practically all Wikipedia concept names in English 3 . The rewrite dictionary has 8, 083, 029 entries and occupies 831 MB. The corresponding f-transducer was constructed by a not optimized implementation of Algorithm 1 using 64 bits for pointer, number and character representation. The resulting f-transducer has 69, 037, 940 states and occupies 11.5 GB. On an Intel Xeon CPU at 2.40 GHz the construction time is 63 minutes. Using the constructed f-transducer, 100 MB of text are rewritten in approximately 19 seconds. This makes it possible to maintain huge knowledge bases in memory and rewrite texts in an online matter. In the same way, texts could be linked with any other collection of linked open data.
In order to compare the new construction with the construction presented in (Mihov and Schulz,3 There can be multiple concept names for one Wikipedia page. time size subseq. transducer 283 s 1170 MB f-transducer 12 s 79 MB Table 1: Translation of a 5 MB rewrite dictionary into a text rewriting device. Construction times and sizes when using a subsequential transducer and an f-transducer.
2007) we used a 5 MB English correction dictionary with 220, 231 entries. All words in the correction dictionary are over the 26 lower case English letters. The rewrite f-transducer is constructed in 12 seconds and occupies 79 MB. The corresponding subsequential transducer is constructed in 283 seconds and occupies 1170 MB (cf.

Conclusion
In this paper we introduced the concept of failure transducers. We presented in detail the construction of f-transducers for dictionary based text rewriting under the left-most-longest match strategy for conflict resolution. The advantage of the new construction compared to the method presented in (Mihov and Schulz, 2007) is space and construction time economy. The new construction 8 avoids the increase in complexity which is caused by the enormous number of transitions needed for subsequential transducers when using rewrite dictionary over large alphabets. In our construction the number of ordinary transitions and the number of failure transitions of the f-transducer are bounded by the sum of the lengths of input words in the dictionary and do not depend on the alphabet size. This made it possible to construct an f-transducer for annotating all concept names of the English Wikipedia in a text in online manner on a personal computer. The same technique can be applied to similar forms of knowledge-based text rewriting. In this way, interesting items in input texts can be linked "on demand" in a userdriven online manner to distict targets such as lexica, authority pages, product catalogues and other resources.