Finite State Transducer Calculus for Whole Word Morphology

The research on machine learning of morphology often involves formulating morphological descriptions directly on surface forms of words. As the established two-level morphology paradigm requires the knowledge of the underlying structure, it is not widely used in such settings. In this paper, we propose a formalism describing structural relationships between words based on theories of morphology that reject the notions of internal word structure and morpheme. The formalism covers a wide variety of morphological phenomena (including non-concatenative ones like stem vowel alternation) without the need of workarounds and extensions. Furthermore, we show that morphological rules formulated in such way can be easily translated to FSTs, which enables us to derive performant approaches to morphological analysis, generation and automatic rule discovery.


Introduction
In computational linguistics, morphological analysis is usually understood as segmenting words into smaller meaningful units, called morphs.There exists a well-established computational model for such analysis, called two-level morphology (Koskenniemi, 1983;Beesley and Karttunen, 2003).It models the mapping between the surface forms of words and the morph sequences using handwritten rules, which are compiled to Finite State Transducers.This allows for a composition of lexicon and rules to an efficient morphological analyzer.Examples of such analyzers include Omorfi for Finnish (Pirinen, 2015), Morphisto for German (Zielinski and Simon, 2008) and TRMorph for Turkish (C ¸öltekin, 2010).
However, the research coming from the machine learning side often requires models that describe string transformations between surface forms directly, without referring to any underlying structures which cannot be observed in the data and are difficult to infer by a learning algorithm.Such transformations can also be described and implemented as finite-state transducers.Despite that, a standardized model of this kind of morphological description seems to be lacking.Instead, many authors develop their own models and implementations for the purpose of a concrete learning algorithm.With some exceptions, the design, implementation and performance of the string processing algorithms is usually not described in detail and the approaches used for that are sometimes suboptimal.
In this paper, we present a finite-state computational model of string transformations on surface forms based on a linguistic theory called Whole Word Morphology.We first review research on machine learning of morphology which motivates the need for such a model (Sec.2).In Sec. 3, we describe the formalism and its linguistic foundations, and in Sec. 4, we present the implementation of the formalism within the FST calculus.Sec. 5 contains a procedure for automatic rule discovery from unannotated data, while in Sec.6, we measure the performance of our implementation of the model.

Motivation and Related Work
The recent research on machine learning of morphology tends more and more often towards models describing transformations on whole words, instead of representing words as concatenations of morphs.Arguably the most important reason for this is that morph boundaries are often not clearly visible in surface forms due to morphophonology and orthography. 1In the following, we review some of the papers utilizing such transformational models.Our focus here is not the learning algorithm (which is usually the main focus of the respective paper), but the assumed model of morphology, together with its linguistic and computational foundations.(Neuvel and Fulop, 2002) present a computational model of morphology based on Whole Word Morphology (Ford et al., 1997).Morphology is described in terms of patterns which summarize structural similarities and differences between pairs of words.The patterns consist of constant elements and wildcards: for example, the relationship within the pair (receive, reception) would be expressed as /Xceive/ ↔ /Xception/.In order to discover such rules automatically, the authors use rather simple string processing algorithms: they try matching every word to every other and check whether the beginnings or the ends of the words match.They subsequently compute an alignment by anchoring the words either at their beginning or end.(Wicentowski, 2002) proposes a transformational model designed for learning mappings between inflected forms and lemmas.It is based on splitting words into seven parts and describing the changes in each part separately.In addition to prefixation and suffixation, it aims to cover phenomena such as internal vowel changes or changes at the boundary between stem and prefix/suffix (e.g.hop ∼ hopping), which are attributed to a separate segment.(Lindén, 2008(Lindén, , 2009) ) likewise attempts to model the transformation between base and inflected form part by part, but adopts a simpler, three-way split into prefix, stem and suffix.(Lindén, 2009) mentions that the model was implemented as a cascade of Finite State Transducers.(Botha and Blunsom, 2013) propose a model of morphology aimed at capturing especially the templatic morphology found in Semitic languages.The model is based on Simple Range Concatenating Grammars (SRCGs), which are a mildly context-sensitive class of formal grammars.It is thought as an extension of the purely concatenative model, which can be represented by a contextfree (or perhaps even regular) grammar.
signed with the goal of efficient implementation of handwritten grammars and, despite some research in this direction (Theron and Cloete, 1997;Koskenniemi, 2013), is rather not suitable for the machine learning scenario.(Durrett and DeNero, 2013) and (Ahlberg et al., 2014) present two different approaches to learning inflection from complete paradigm tables.The input data in such setting are lists of tuples (b, w, t), where b is the base word (lemma), w the inflected word and t a tag, i.e. a bundle of inflectional features.An important point of learning algorithms for this task is an appropriate model of string transformations from b to w. (Durrett and DeNero, 2013) use a semi-Markov log-linear model to model the probability of application of individual transformations (like prefix, stem or suffix change) independently, while (Ahlberg et al., 2014) model string transformations on whole words in form of patterns with wildcards.We note that the string transformation model of (Durrett and DeNero, 2013) is tightly coupled to the machine learning method applied by the authors, while the model of (Ahlberg et al., 2014) is more general and independent of the classification method (in this case, memory-based classification).
With works like (Soricut and Och, 2015;Narasimhan et al., 2015;Luo et al., 2017), we can observe a shift from segmentation to word-based string transformations also in the area of unsupervised learning of morphology.Currently, they appear to adopt very simple transformation models that only involve affixation.On the other hand, (Janicki, 2015) and (Sumalvico, 2017) present a probabilistic model suitable for unsupervised learning, which is based on Whole Word Morphology and describes morphology in terms of wholeword transformation patterns.
As a conclusion from the above literature review, we recognize a need for a standardized model of morphological relationship between surface forms of words.As most of the models presented above are motivated by the need to cover non-concatenative phenomena, especially internal vowel changes and Semitic templatic morphology, the model we aim at should be able to handle those phenomena in a natural and general fashion.Following (Neuvel and Fulop, 2002), we see Whole Word Morphology as the right linguistic foundation for such formalism, and following (Lindén, 2008(Lindén, , 2009)), we consider FSTs to be the right tool for implementing string transformations efficiently.Thus, the contribution of the present paper is twofold: 1.A formal definition of a transformational model of morphology, similar to the ones em-ployed by (Neuvel and Fulop, 2002;Ahlberg et al., 2014;Janicki, 2015), 2. An implementation of the model based on the FST calculus.
3 The Formalism

Definitions
We base our formalism on the linguistic theory of Whole Word Morphology (henceforth, WWM) introduced by (Ford et al., 1997;Neuvel and Singh, 2002).It models structural similarities in form and meaning between words in form of rules, which are expressed as patterns containing wildcards.
For example, the relationship within the French pair (chanteur, chanteuse) can be expressed by the following rule:2 /Xoer/ N.MASC ↔ /Xøz/ N.FEM (1) In the above rule, X denotes a variable which can be instantiated with any string of phonemes and represents the common part of both words.The units inside slashes refer to whole words in their surface forms.
In general, we express a morphological rule with n variables as follows: The elements a i and b i are constants (literal strings), which usually represent the differing parts of words on the left-hand and right-hand side of the rule. 3The elements X i are variables (wildcards), which represent the part that is preserved by the rule, but varies from pair to pair.Additionally, the following conditions must be satisfied: 1.The variables must be retained in the same order on both sides of the rule.
2. For 0 < i < n, either a i or b i has to be nonempty.
Because of the first condition, we can represent such rule as a vector of 2n + 2 strings: a 0 , a 1 , . . ., a n , b 0 , b 1 , . . ., b n .Contrary to (1), which is a relational description and thus uses a bidirectional arrow, we formulate our rules as having a privileged direction.Although most rules can be applied in both directions, the productivity of back-formation is mostly much lower, so that specifying a direction seems linguistically plausible.Modeling rules which are similarly productive in both directions can be achieved by including the reverse rule separately in the grammar.
As an illustration of (2), the rule expressing the relationship between the German pairs (singen, gesungen), (klingen, geklungen), (trinken, getrunken) could have the following form: The rule could also contain more constant elements to express the necessary conditions for its application: With each rule, we can associate a function r, which transforms a word fitting to the left-hand side of the rule into a set of corresponding words fitting to the right-hand side: Note that the outcome of the rule application is a set of words, rather than a single word.In general, the rule application might result in multiple different words, because there might be different ways of splitting the word into the sequence a 0 X 1 a 1 . . .X n a n .For example, the application of the rule /X 1 aX 2 / → /X 1 äX 2 e/ to the German word Kanal results in the set: {Känale, Kanäle}.
In case the word does not fit to the left-hand side of the rule, the rightmost condition is never fulfilled and the result is an empty set.Thus, the function r is defined on the whole of Σ + .

Coverage of Morphological Phenomena
In addition to covering affixation, circumfixation and stem vowel alternations, as shown already in the previous section, the following further morphological phenomena can be handled by the formalism: Templatic morphology.A relationship between pairs like the Arabic (kataba, kutiba) can be generalized as the following rule: Although the formalism does not provide a way to restrict the instantiations of variables to a single consonant, it could be easily extended to express such restrictions on variables in a form similar to regular expressions.
Compounding.The proponents of Whole Word Morphology and similar theories explicitly reject the analysis of compounds as 'words composed of multiple words' (Singh and Dasgupta, 2003;Starosta, 2003).In consequence, compounds are also analyzed as related to a single word, while the other part is considered to be a morphological constant.For example, the English word blackberry would be related to black via the following rule: /X/ N/ADJ → /Xberry/ N (7) According to (Singh and Dasgupta, 2003;Starosta, 2003), the relationship between the a rule like (7) and the word 'berry' is purely etymological and thus not a part of a synchronic description of morphology.This claim is supported by the fact that newly coined compounds (in languages that exhibit compounding) virtually always involve at least one part that is already known as 'compound-forming', rather than combining two arbitrary words.Indeed, in morphological analyzers based on two-level morphology, the cyclicity used to model compouding often causes massive overgeneration.

WWM Rules as FSTs
A rule defined as in (2) can be easily converted to an FST.The general scheme for that is given in Fig. 1.The arrows represent concatenation and each rectangular block represents a transducer.There are two kinds of blocks: transducers mapping corresponding constants, like a 0 : b 0 , and transducers representing the variables.The latter are simply identity transducers accepting Σ + .Figure 2 shows a concrete FST corresponding to the rule (4).

Analysis
There is no concept of a 'morphological analysis' in WWM.Each word is treated as an independent unit of language.However, given a word, we might be interested in its structural relationships to other words.
Let R be the set of rules found in the morphology of a language of interest and let T r be a transducer corresponding to rule r.The disjunction of all rules, T R , yields a transducer accepting morphologically related pairs: Further, let V denote a vocabulary and T V the identity transducer corresponding to V .With the following composition, we obtain a transducer capable of mapping all words from V to all their possible derivations: T A can be called a 'WWM analyzer'.A lookup of an unknown word v in T A yields all words from the known vocabulary from which v can be derived.Furthermore, a three-way composition gives us all pairs of related words from V .

Generation
Another common question of morphology is: Given a vocabulary V and a set of rules R, what further words can be postulated?The identity transducer for such new words, T N , is obtained from the following formula: where T A ↓ denotes the output projection of T A and \ denotes subtraction.

Automatic Rule Discovery
As shown in Sec. 3, our definition of rule is general enough to capture many morphological phenomena, including some important nonconcatenative ones.On the other hand, the resulting computational model is simple enough to allow for completely unsupervised rule discovery without prior linguistic knowledge.In this section, we show how to achieve this in two stages: first, we identify pairs of string-similar words in the vocabulary.Then, we extract candidate rules from each such pair.Frequent patterns are good candidates for rules, which can be passed to a further statistical model, like the one of (Janicki, 2015;Sumalvico, 2017).

Finding Pairs of Similar Words
A plausible and widely used string similarity measure is edit distance (Levenshtein, 1966).Using the Fast Similarity Search algorithm (Bocek et al., 2007), we are able to identify pairs of words with edit distance at most k without comparing each word to every other.The algorithm works by generating a deletion neighborhood of each word, consisting of strings that can be obtained from that word by deleting up to k characters.The resulting list of pairs (word, substring) is sorted according to the substring.Observe that words with edit distance ≤ k are guaranteed to share a common substring, although words sharing a common substring might also have edit distance > k.Thus, we treat pairs of words sharing a common substring as candidates, for which edit distance has to be computed with usual means.
For the purpose of discovering potential morphological rules, it is reasonable to modify the notion of edit distance.Firstly, morphological rules usually operate on groups of consecutive letters, rather than single letters independently, so deletion or substitution of a segment of consecutive letters should yield higher similarity than deletion or substitution of the same number of non-consecutive letters.Secondly, although we are going to permit word-internal alternations, more change should be permitted at the beginning and at the end of words, since that is where most morphological rules operate.Bearing in mind the representation (2), let l affix denote the maximum length of a morphological constant at the beginning or the end of a word (a 0 , b 0 , a n , b n in (2)), l infix the maximum length of a morphological constant inside the word (a i , b i for 0 < i < n in (2)) and k max the maximum number of variables.In order to generate pairs which are related by a rule satisfying this constraint, we obtain the following constraints on a deletion environment: deleting up to l affix con-secutive letters at the beginning and end of the word, and up to l infix consecutive letters in at most k max − 1 slots inside the word.The usual setting for those parameters, which covers a vast majority of morphological rules encountered in practice, is Such settings allow for deletion of up to 13 letters in total, so that even for middle-length words it would consider all pairs to be similar.In order to prevent this, we introduce an additional constraint: the total amount of deleted characters must be smaller than half of the word's length.In this way, we can consider long affixes, but only if enough of the word is still left to form a recognizable stem.
With all those constraints, computing a deletion neighborhood of a word becomes a complex operation.It is therefore helpful to visualize and implement it using transducers.We will construct the transducer S mapping words to their deletion neighborhoods as a composition of two simpler transducers: S = S 1 • S 2 .The transducer S 1 (Fig. 3) performs the deletions, substituting a special symbol δ for each deleted character.The transducer consists of segments, corresponding to the deleted sequences: states 0-5 represent the prefix, 10-15 the suffix and 7-9 the infix.Between each pair of segments, an arbitrary number of identity mappings is performed (state sequences 5-6 and 9-10).The epsilon transitions, for example from states 0-4 to 5, correspond to a less-thanmaximum number of deletions in a given slot.It can easily be seen that changing e.g. the parameter l affix simply corresponds to altering the length of the top and bottom chains, just as l infix correspond to the length of the middle chain and k max − 1 to the number of such middle chains.
The transducer S 2 (Fig. 4) takes the output of S 1 and checks whether the number of deletions is smaller than the number of remaining characters.As the general formulation of this problem cannot be solved by a finite-state machine, it requires a bound on word length.In my implementation, I restrict the maximum word length to 20 characters, but it is easy to change this parameter.The states of S 2 correspond to the difference between the number of letters and the number of deletions seen so far.The states above the initial state correspond to positive, and the ones below to negative values.Furthermore, S 2 removes the deletion symbols and returns the substring consisting of the remaining letters.
We can now generate all pairs of similar words from a lexicon automaton L by performing the following composition: There are various ways to implement this in prac-tice.Computing the composition directly is usually not feasible because of high memory complexity.One possibility is to use S for substring generation, but otherwise proceed as in the original FastSS algorithm: store the words and substrings in an index structure, either on disk or in memory, then retrieve words for each substring.
Another possibility is to use S to generate substrings for a given word and then look the substrings up in the transducer (L • S) −1 to obtain similar words.The latter composition can be computed statically.We additionally convert the resulting transducer to HFST optimized lookup format (Silfverberg and Lindén, 2009).While the lookup approach is still significantly slower, it has an advantage in providing a way to retrieve all words w similar to a given word w at once.It is thus better suited for parallelization, especially in case the pairs (w, w ) are subject to further processing.

Extraction of Rule Candidates
Given a pair (w, w ) of string-similar words, we want to extract morphological rules modeling the difference between those words.For this purpose, we first align the words on character-to-character basis using the well-known dynamic programming algorithm for computing edit distance (Wagner and Fischer, 1974).Then, we attribute each character mapping either to a morphological constant or a variable, in a way that fulfills the constraints on l affix , l infix and k.The candidate rules are constructed incrementally while iterating over the alignment and unfinished rules are stored in a priority queue.In case an aligned character pair can be attributed either to a constant or to a variable, both possibilities are stored in a queue, so that at the end we obtain multiple rules with varying degrees of generality.For example, the rules extracted from the German pair (trifft, getroffen) include /X 1 iX 2 t/ → /geX 1 oX 2 en/ (the most general rule), as well as e.g./Xifft/ → /geXoffen/.
Table 1 shows example rules extracted from a word list coming from German Wikipedia.While the top of the list consists entirely of morphological patterns, the bottom of the table shows that patterns resulting from accidental word similarities can also become frequent enough to be confused with morphological rules.Thus, this approach identifies rule candidates, which have to be further filtered based on other criteria than mere frequency.

Experiments
We have implemented the algorithms described in the previous section using the HFST library (Lindén et al., 2011).Furthermore, we conducted experiments realizing the algebraic operations described in Sec. 4 and the rule discovery procedure described in Sec. 5.The results demonstrate that our model is suitable for building analyzers based on the Whole Word Morphology paradigm and the required computational resources are easily achievable.
First, we run the rule discovery procedure on word lists extracted from German Wikipedia. 4he generation of pairs of similar words and the subsequent rule extraction is implemented in a parallelized fashion.Table 2 shows the computation times for various sizes of input vocabulary and numbers of processes.The results demonstrate that this step is feasible for input data of as much as 150,000 words (and probably even somewhat larger).In our view, this is enough to discover the vast majority of productive morphological rules.
We disjunct several thousand most frequent rules to construct a rule transducer T R , which is used in algebraic operations shown in Table 3.Most operations are realized within at most several minutes, the longest one being the construction of the largest generator in slightly above 11 minutes.
Note that the computation times reported in Table 3 are much shorter than the ones in Table 2.Moreover, the former appear to increase linearly in both |V | and |R|.Thus, although the limits on the vocabulary size in the rule discovery procedure are quite tight, once we have discovered the rules (or obtained them in another way, e.g.manually written), we can apply the transducer to find pairs of related words in much larger lexica.Using 3way composition (Allauzen and Mohri, 2008) for computing T A •T V could probably further improve the analysis of a new lexicon.

Conclusion
We have presented a formalism allowing for the description of morphological regularities as transformational patterns on whole words in their surface forms.The formalism is grounded in linguistic theories rejecting the notion of internal structure of words and can be especially useful in the context of machine learning, where descriptions of such underlying structures are not available.It captures non-concatenative phenomena naturally and allows for representing rules as FSTs, so that performant algorithms for morphological analysis and generation are readily available as algebraic operations on transducers.We suggest that such standardized formalism can present an alternative to models of morphology and string processing algorithms developed for a specific machine learning method, which are common in the literature.

Figure 3 :Figure 4 :
Figure 3: The transducer S 1 for generating a deletion neighborhood.