Neural Graphical Models over Strings for Principal Parts Morphological Paradigm Completion

Many of the world’s languages contain an abundance of inflected forms for each lexeme. A critical task in processing such languages is predicting these inflected forms. We develop a novel statistical model for the problem, drawing on graphical modeling techniques and recent advances in deep learning. We derive a Metropolis-Hastings algorithm to jointly decode the model. Our Bayesian network draws inspiration from principal parts morphological analysis. We demonstrate improvements on 5 languages.


Introduction
Inflectional morphology modifies the form of words to convey grammatical distinctions (e.g. tense, case, and number), and is an extremely common and productive phenomenon throughout the world's languages (Dryer and Haspelmath, 2013). For instance, the Spanish verb poner may transform into one of over fifty unique inflectional forms depending on context, e.g. the 1 st person present form is pongo, but the 2 nd person present form is pones. These variants cause data sparsity, which is problematic for machine learning since many word forms will not occur in training corpora. Thus, a necessity for improving NLP on morphologically rich languages is the ability to analyze all inflected forms for any lexical entry. One way to do this is through paradigm completion, which generates all the inflected forms associated with a given lemma.
Until recently, paradigm completion has been narrowly construed as the task of generating a full paradigm (e.g. noun declension, verb conjugation) based on a single privileged form-the lemma (i.e. the citation form, such as poner). While recent work (Durrett and DeNero, 2013;Hulden, 2014;Nicolai et al., 2015;Ahlberg et al., 2015;Faruqui et al., 2016) has made tremendous progress on this narrower task, paradigm completion is not only broader in scope, but is better solved without privileging the lemma over other forms. By forcing string-to-string transformations from one inflected form to another to go through the lemma, the transformation problem is often made more complex than by allowing transformations to happen directly or through a different intermediary form. This interpretation is inspired by ideas from linguistics and language pedagogy, namely principal parts morphology, which argues that forms in a paradigm are best derived using a set of citation forms rather than a single form (Finkel and Stump, 2007a;Finkel and Stump, 2007b).
Directed graphical models provide a natural formalism for principal parts morphology since a graph topology can represent relations between inflected forms and principal parts. Specifically, we apply string-valued graphical models (Dreyer and Eisner, 2009; to the problem. We develop a novel, neural parameterization of string-valued graphical models where the conditional probabilities in the Bayesian network are given by a sequence-to-sequence model (Sutskever et al., 2014). However, under such a parameterization, exact inference and decoding are intractable. Thus, we derive a sampling-based decoding algorithm. We experiment on 5 languages: Arabic, German, Latin, Russian, and Spanish, showing that our model outperforms a baseline approach that privileges the lemma form.

A Generative Model of Principal Parts
We first formally define the task of paradigm completion and relate it to research in principal parts morphology. Let Σ be a discrete alphabet of characters in a language. Formally, for a given lemma ∈ Σ * , we define the complete paradigm of that lemma π( ) = m 1 , . . . , m N , where each m i ∈ Σ * is an inflected form. 1 For example, the paradigm for the English lemma to run is defined as π (RUN) = run, runs, ran, running . While the size of a typical English verbal paradigm is comparatively small (|π| = N = 4), in many languages the size of the paradigms can be very large (Kibrik, 1998). The task of paradigm completion is to predict all elements of the tuple π given one or more forms (m i ). Paradigm completion solely from the lemma (m ), however, largely ignores the linguistic structure of the paradigm. Given certain inflected word forms, termed principal parts, the construction of a set of other word forms in the same paradigm is fully deterministic. Latin verbs are famous for having four such principal parts (Finkel and Stump, 2009).
Inspired by the concept of principal parts, we present a solution to the paradigm completion task in which target inflected forms are predicted from other forms in the paradigm, rather than only from the lemma. We implement this solution in the form of a generative probabilistic model of the paradigm. We define a joint probability distribution over the entire paradigm: where pa T (·) is a function that returns the parent of the node i with respect to the tree T , which encodes the source form from which each target form is predicted. In terms of graphical modeling, this p(π) is a Bayesian network over string-valued variables . Trees provide a natural formalism for encoding the intuition behind principal parts theory, and provide a fixed paradigm structure prior to inference. We construct a graph with nodes for each cell in the paradigm, as in Figure 1. The parent of each node is another form in the paradigm that best predicts that node.

Paradigm Trees
Baseline Network. Predicting inflected forms only from the lemma involves a particular graphical model in which all the forms are leaves attached to the lemma. This network is treated as a baseline, and is depicted in Figure 1a.  Heuristic Network. We heuristically induce a paradigm tree with the following procedure. For each ordered pair of forms in a paradigm π, we compute the number of distinct edit scripts that convert one form into the other. The edit script procedure is similar to that described in Chrupała et al. (2008). For each ordered pair (i, j) of inflected forms in π, we count the number of distinct edit paths mapping from m i to m j , which serves as a weight on the edge w i→j . Empirically, w i→j is a good proxy for how deterministic a mapping is. We use Edmonds' algorithm (Edmonds, 1967) to find the minimal directed spanning tree. The intuition behind this procedure is that the number of deterministic arcs should be maximized.
Gold Network. Finally, for Latin verbs we consider a graph that matches the classic pedagogical derivation of Latin verbs from four principal parts.

Inflection Generation with RNNs
RNNs have recently achieved state-of-the-art results for many sequence-to-sequence mapping problems and paradigm completion is no exception. Given the success of LSTM-based (Hochreiter and Schmidhuber, 1997) and GRU-based (Cho et al., 2014) morphological inflectors (Faruqui et al., 2016;, we choose a neural parameterization for our Bayesian network, i.e. the conditional probability p(m i | m pa T (i) ) is computed using a RNN. Our graphical modeling approach as well as the inference algorithms subsequently discussed in §4.2 are agnostic to the minutiae of any one parameterization, i.e. the encoding p(m i | m pa T (i) ) is a black box.

LSTMs with Hard Monotonic Attention
We define the conditional distributions in our Bayesian network p(m i | m pa T (i) ) as LSTMs with hard monotonic attention , which we briefly overview. These networks map one inflection to another, e.g. mapping the English gerund running to the past tense ran, using an encoder-decoder architecture (Sutskever et al., 2014) run over an augmented alignment alphabet, consisting of copy, substitution, deletion and insertion, as in Dreyer et al. (2008). For strings x, y ∈ Σ * , the alignment is extracted from the minimal weight edit path using the BioPython toolkit (Cock et al., 2009). Crucially, as the model is locally normalized we may sample strings from the conditional p(m i | m pa T (i) ) efficiently using forward sampling. This network stands in contrast to attention models (Bahdanau et al., 2015) in which the alignments are soft and not necessarily monotonic. We refer the reader to  for exact implementation details as we use their code out-of-the-box. 2

Neural Graphical Models over Strings
Our Bayesian network defined in Equation (1) is a graphical model defined over multiple stringvalued random variables, a framework formalized in Dreyer and Eisner (2009). In contrast to previous work, e.g. , which considered conditional distributions encodable by finite-state machines, we offer the first neural parameterization for such graphical models. With the increased expressivity comes computational challenges-inference becomes intractable. Thus, we fashion an efficient sampling algorithm.

Parameter Estimation
Following previous work (Faruqui et al., 2016), we train our model in the fully observed setting with complete paradigms as training data. As our model is directed, this makes parameter estimation relatively straightforward. We may estimate the parameters of each LSTM independently without performing joint inference during training. We follow the training procedure of , using a maximum of 300 epochs of SGD.

Approximate Joint Decoding
In a Bayesian network, the maximum-a-posteriori (MAP) inference problem refers to finding the most probable configuration of the variables given some evidence. In our case, this requires finding the best set of inflections to complete the paradigm given an observed set of inflected forms. Returning to the English verbal paradigm, given the past tense form ran and the 3 rd person present singular runs, the goal of MAP inference is to return the most probable assignment to the past tense and gerund form (the correct assignment is ran and running). In many Bayesian networks, e.g. models with finite support, exact MAP inference can be performed efficiently with the sum-product belief propagation algorithm (Pearl, 1988) when the model has a tree structure. Despite the tree structure, the LSTM makes exact inference intractable. Thus, we resort to an approximate scheme.

Simulated Annealing
The Metropolis-Hastings algorithm (Metropolis et al., 1953;Hastings, 1970) is a popular Markov-Chain Monte Carlo (MCMC) (Robert and Casella, 2013) algorithm for approximate sampling from intractable distributions. As with all MCMC algorithms, the goal is to construct a Markov chain whose stationary distribution is the target distribution. Thus, after having mixed, taking a random walk on the Markov chain is equivalent to sampling from the intractable distribution. Here, we are interested in sampling from p(π), where part of π may be observed. Simulated annealing (Kirkpatrick et al., 1983;Andrieu et al., 2003) is a slight modification of the Metropolis-Hastings algorithm suitable for MAP inference. We add the temperature hyperparameter τ , which we decrease on a schedule. We achieve the MAP estimate as τ → 0. The algorithm works as follows: Given a paradigm with tree T , we sam- ple a latent node i in the tree uniformly at random. We then sample a new string m i from the proposal distribution q i (see §4.4), which we accept (replacing m i ) with probability We iterate until convergence and accept the final configuration of values as our approximate MAP estimate. We give pseudocode in Algorithm 1 for clarity.

Proposal Distribution
We define a tractable proposal distribution for our neural graphical model over strings using a procedure similar to the stochastic inverse method of Stuhlmüller et al. (2013) for probabilistic programming. In addition to estimating the parameters of an LSTM defining the distribution p(m i | m pa T (i) ), we also estimate parameters of an LSTM to define the inverse distribution p(m pa T (i) | m i ). As we observe only complete paradigms at training time, we train networks as in §4.1. First, we define the neighborhood of a node i as all those nodes adjacent to i (connected by an ingoing or outgoing edge). We define the proposal distribution as a mixture model of all conditional distributions in the neighborhood Crucially, some of the distributions are stochastic inverses. Sampling from q i is tractable: We sample a mixture component uniformly and then sample a string.

Related Work
Our effort is closest to Faruqui et al. (2016), who proposed the first neural paradigm completer. Many neural solutions were also proposed in the  SIGMORPHON shared task on morphological reinflection . Notably, the winning system used an encoder-decoder architecture . Neural networks have been used in other areas of computational morphology, e.g. morpheme segmentation (Wang et al., 2016;Cotterell and Schütze, 2017), morphological tagging (Heigold et al., 2016), and language modeling (Botha and Blunsom, 2014).

Experiments and Results
Our proposed model generalizes previous efforts in paradigm completion since all previously proposed models take the form of Figure 1a, i.e. a graphical model where all leaves connect to the lemma. Unfortunately, in that configuration, observing additional forms cannot help at test time since information must flow through the lemma, which is always observed. We conjecture that principal parts-based topologies will outperform the baseline topology for that reason. We propose a controlled experiment in which we consider identical training and testing conditions and vary only the topology.
Data. Data for training, development, and testing is randomly sampled from the UniMorph dataset (Sylak-Glassman et al., 2015). 3 We run experiments on Arabic, German, and Russian nominal paradigms and Latin and Spanish verbal paradigms. The sizes of the resulting data splits are given in Table 2. For the development and test splits we always include the lemma (as is standard) while sampling additional observed forms. On average one third of all forms are observed.
Evaluation. Evaluation of the held-out sets proceeds as follows: Given the observed forms in the paradigm, we jointly decode the remaining forms as discussed in §4.2; joint decoding is performed without Algorithm 1 for the baseline-instead, we decode as in . We measure accuracy (macro-averaged) on the held-out forms.
Results. In general, we find that our principal parts-inspired networks outperform lemmacentered baseline networks. In Arabic, German, and Latin, we find the largest gains (for Latin, our heuristic topology closely matches that of the gold tree, validating the heuristics we use). We attribute the gains to the ability to use knowledge from attested forms that are otherwise difficult to predict, e.g. forms based on the Arabic broken plural, the German plural, and any of the Latin present perfect forms. In the case of paradigms with portions which are difficult to predict without knowledge of a representative form, knowing multiple principle parts will be a boon given a proper tree improvement-we attribute this to the fact that almost all of the test examples were regular -ar verbs and, thus, fully predictable. Finally, in the case of Russian we see only minor improvements-this stems from need to maintain a different optimal topology for each declension. Because our model assumes a fixed paradigmatic structure in the form of a tree, using multiple topologies is not possible.

Conclusion
We have presented a directed graphical model over strings with a RNN parameterization for principleparts-inspired morphological paradigm completion. This paradigm gives us the best of two worlds. We can exploit state-of-the-art neural morphological inflectors while injecting linguistic insight into the structure of the graphical model itself. Due to the expressivity of our parameterization, exact decoding becomes intractable. To solve this, we derive an efficient MCMC approach to approximately decode the model. We validate our model experimentally and show gains over a baseline which represents the topology used in nearly all previous research.