A Structured Variational Autoencoder for Contextual Morphological Inflection

Statistical morphological inflectors are typically trained on fully supervised, type-level data. One remaining open research question is the following: How can we effectively exploit raw, token-level data to improve their performance? To this end, we introduce a novel generative latent-variable model for the semi-supervised learning of inflection generation. To enable posterior inference over the latent variables, we derive an efficient variational inference procedure based on the wake-sleep algorithm. We experiment on 23 languages, using the Universal Dependencies corpora in a simulated low-resource setting, and find improvements of over 10% absolute accuracy in some cases.


Introduction
The majority of the world's languages overtly encodes syntactic information on the word form itself, a phenomenon termed inflectional morphology (Dryer et al., 2005).In English, for example, the verbal lexeme with lemma talk has the four forms: talk, talks, talked and talking.Other languages, such as Archi (Kibrik, 1998), distinguish more than a thousand verbal forms.Despite the cornucopia of unique variants a single lexeme may mutate into, native speakers can flawlessly predict the correct variant that the lexeme's syntactic context dictates.Thus, in computational linguistics, a natural question is the following: Can we estimate a probability model that can do the same?
The topic of inflection generation has been the focus of a flurry of individual attention of late and, moreover, has been the subject of two shared tasks * All authors contributed equally.(1) and overlayed with example values of the random variables in the sequence.We highlight that all the conditionals in the Bayesian network are recurrent neural networks, e.g., we note that mi depends on m<i because we employ a recurrent neural network to model the morphological tag sequence.(Cotterell et al., 2016(Cotterell et al., , 2017)).Most work, however, has focused on the fully supervised case-a source lemma and the morpho-syntactic properties are fed into a model, which is asked to produce the desired inflection.In contrast, our work focuses on the semi-supervised case, where we wish to make use of unannotated raw text, i.e., a sequence of inflected tokens.
Concretely, we develop a generative directed graphical model of inflected forms in context.A contextual inflection model works as follows: Rather than just generating the proper inflection for a single given word form out of context (for example walking as the gerund of walk), our generative model is actually a fully-fledged language model.In other words, it generates sequences of inflected words.The graphical model is displayed in Fig. 1 and examples of words it may generate are pasted on top of the graphical model notation.That our model is a language model enables it to exploit both inflected lexicons and unlabeled raw text in a principled semi-supervised way.In order to train using raw-text corpora (which is useful when we have less annotated data), we marginalize out the unobserved lemmata and morpho-syntactic annotation from unlabeled data.In terms of Fig. 1, this refers to marginalizing out m 1 , . . ., m 4 and 1 , . . ., 4 .As this marginalization is intractable, we derive a variational inference procedure that allows for efficient approximate inference.Specifically, we modify the wake-sleep procedure of Hinton et al. (1995).It is the inclusion of raw text in this fashion that makes our model token level, a novelty in the camp of inflection generation, as much recent work in inflection generation (Dreyer et al., 2008;Durrett and DeNero, 2013;Nicolai et al., 2015;Ahlberg et al., 2015;Faruqui et al., 2016), trains a model on type-level lexicons.
We offer empirical validation of our model's utility with experiments on 23 languages from the Universal Dependencies corpus in a simulated lowresource setting. 1 Our semi-supervised scheme improves inflection generation by over 10% absolute accuracy in some cases.

Inflectional Morphology
To properly discuss models of inflectional morphology, we require a formalization.We adopt the framework of word-based morphology (Aronoff, 1976;Spencer, 1991).Note in the present paper, we omit derivational morphology.
We define an inflected lexicon as a set of 4tuples consisting of a part-of-speech tag, a lexeme, an inflectional slot, and a surface form.A lexeme is a discrete object that indexes the word's core meaning and part of speech.In place of such an abstract lexeme, lexicographers will often use a lemma, denoted by , which is a designated 2 sur-face form of the lexeme (such as the infinitive).For the remainder of this paper, we will use the lemma as a proxy for the lexeme, wherever convenient, although we note that lemmata may be ambiguous: bank is the lemma for at least two distinct nouns and two distinct verbs.For inflection, this ambiguity will rarely3 play a role-for instance, all senses of bank inflect in the same fashion.
A part-of-speech (POS) tag, denoted t, is a coarse syntactic category such as VERB.Each POS tag allows some set of lexemes, and also allows some set of inflectional slots, denoted as σ, such as TNS=PAST, PERSON=3 .Each allowed tag, lexeme, slot triple is realized-in only one way-as an inflected surface form, a string over a fixed phonological or orthographic alphabet Σ.
(In this work, we take Σ to be an orthographic alphabet.)Additionally, we will define the term morphological tag, denoted by m, which we take to be the POS-slot pair m = t, σ .We will further define T as the set of all POS tags and M as the set of all morphological tags.A paradigm π(t, ) is the mapping from tag t's slots to the surface forms that "fill" those slots for lexeme/lemma .For example, in the English paradigm π(VERB, talk), the past-tense slot is said to be filled by talked, meaning that the lexicon contains the tuple VERB, talk, PAST, talked .
A cheat sheet for the notation is provided in Tab. 2.
We will specifically work with the UniMorph annotation scheme (Sylak-Glassman, 2016).Here, each slot specifies a morpho-syntactic bundle of inflectional features such as tense, mood, person, number, and gender.For example, the German surface form Wörtern is listed in the lexicon with tag NOUN, lemma Wort, and a slot specifying the feature bundle NUM=PL, CASE=DAT .The full paradigms π(NOUN, Wort) and π(NOUN, Herr) are found in Tab. 1.

Morphological Inflection
Now, we formulate the task of context-free morphological inflection using the notation developed in §2.Given a set of N form-tag-lemma triples { f i , m i , i } N i=1 , the goal of morphological inflection is to map the pair m i , i to the form f i .As the part-of-speech tag -all these terms are defined next.the definition above indicates, the task is traditionally performed at the type level.In this work, however, we focus on a generalization of the task to the token level-we seek to map a bisequence of lemma-tag pairs to the sequence of inflected forms in context.Formally, we will denote the lemmamorphological tag bisequence as , m and the form sequence as f .Foreshadowing, the primary motivation for this generalization is to enable the use of raw-text in a semi-supervised setting.

Generating Sequences of Inflections
The primary contribution of this paper is a novel generative model over sequences of inflected words in their sentential context.Following the notation laid out in §2.2, we seek to jointly learn a distribution over sequences of forms f , lemmata , and morphological tags m.The generative procedure is as follows: First, we sample a sequence of tags m, each morphological tag coming from a language model over morphological tags: m i ∼ p θ (• | m <i ).Next, we sample the sequence of lemmata given the previously sampled sequence of tags m-these are sampled conditioned only on the corresponding morphological tag: i ∼ p θ (• | m i ).Finally, we sample the sequence of inflected words f , where, again, each word is chosen conditionally independent of other elements of the sequence: This yields the factorized joint distribution: We depict the corresponding directed graphical model in Fig. 1.
Relation to Other Models in NLP.As the graphical model drawn in Fig. 1 shows, our model is quite similar to a Hidden Markov Model (HMM) (Rabiner, 1989).There are two primary differences.First, we remark that an HMM directly emits a form f i conditioned on the tag m i .Our model, in contrast, emits a lemma i conditioned on the morphological tag m i and, then, conditioned on both the lemma i and the tag m i , we emit the inflected form f i .In this sense, our model resembles the hierarchical HMM of Fine et al. (1998) with the difference that we do not have interdependence between the lemmata i .The second difference is that our model is non-Markovian: we sample the i th morphological tag m i from a distribution that depends on all previous tags, using an LSTM language model ( §4.1).This yields richer interactions among the tags, which may be necessary for modeling long-distance agreement phenomena.
Why a Generative Model?What is our interest in a generative model of inflected forms?Eq. ( 1) is a syntax-only language model in that it only allows for interdependencies between the morphosyntactic tags in p θ (m).However, given a tag sequence m, the individual lemmata and forms are conditionally independent.This prevents the model from learning notions such as semantic frames and topicality.So what is this model good for?Our chief interest is the ability to train a morphological inflector on unlabeled data, which is a boon in a low-resource setting.As the model is generative, we may consider the latent-variable model: where we marginalize out the latent lemmata and morphological tags from raw text.The sum in Eq. ( 2) is unabashedly intractable-given a sequence f , it involves consideration of an exponential (in |f |) number of tag sequences and an infinite number of lemmata sequences.Thus, we will fall back on an approximation scheme (see §5).

Recurrent Neural Parameterization
The graphical model from §3 specifies a family of models that obey the conditional independence assumptions dictated by the graph in Fig. 1.In this section we define a specific parameterization using long short-term memory (LSTM) recurrent neural network (Hochreiter and Schmidhuber, 1997) language models (Sundermeyer et al., 2012).

LSTM Language Models
Before proceeding, we review the modeling of sequences with LSTM language models.Given some alphabet ∆, the distribution over sequences x ∈ ∆ * can be defined as follows: where x <j = x 1 , . . ., x j−1 .The prediction at time step j of a single element x j is then parametrized by a neural network: where W ∈ R |∆|×d and b ∈ R |∆| are learned parameters (for some number of hidden units d) and the hidden state h j ∈ R d is defined through the recurrence given by Hochreiter and Schmidhuber (1997) from the previous hidden state and an embedding of the previous character (assuming some learned embedding function e : ∆ → R c for some number of dimensions c):

Our Conditional Distributions
We discuss each of the factors in Eq. (1) in turn.
We define p θ (m) as an LSTM language model, as defined in §4.1, where we take ∆ = M, i.e., the elements of the sequence that are to be predicted are tags like POS=V, TNS=GERUND .Note that the embedding function e does not treat them as atomic units, but breaks them up into individual attributevalue pairs that are embedded individually and then summed to yield the final vector representation.To be precise, each tag is first encoded by a multi-hot vector, where each component corresponds to a attribute-value pair in the slot, and then this multihot vector is multiplied with an embedding matrix.
2 Lemma Generator: The next distribution in our model is a lemma generator which we define to be a conditional LSTM language model over characters (we take ∆ = Σ), i.e., each x i is a single (orthographic) character.The language model is conditioned on t i (the part-ofspeech information contained in the morphological tag m i = t i , σ i ), which we embed into a lowdimensional space and feed to the LSTM by concatenating its embedding with that of the current character.Thusly, we obtain the new recurrence relation for the hidden state: where [ i ] j denotes the j th character of the generated lemma i and e : T → R c for some c is a learned embedding function for POS tags.Note that we embed only the POS tag, rather than the entire morphological tag, as we assume the lemma depends on the part of speech exclusively.
</w>), one LSTM running left-to-right, the other right-to-left.Concatenating the hidden states of both RNNs at each time step results in hidden states h (enc) j . The decoder, again, takes the form of an LSTM language model (we take ∆ = Σ), producing the inflected form character by character, but at each time step not only the previous hidden state and the previously generated token are considered, but attention (a convex combination) over all encoder hidden states h (enc) j , with the distribution given by another neural network; see Luong et al. (2015).

Semi-Supervised Wake-Sleep
We train the model with the wake-sleep procedure, which requires us to perform posterior inference over the latent variables.However, the exact computation in the model is intractable-it involves a sum over all possible lemmatizations and taggings of the sentence, as shown in Eq. ( 2).Thus, we fall back on a variational approximation (Jordan et al., 1999).We train an inference network q φ ( , m | f ) that approximates the true posterior over the latent variables p θ ( , m | f ). 5 The variational family we choose in this work will be detailed in §5.5.We fit the distribution q φ using a semi-supervised extension of the wake-sleep algorithm (Hinton et al., 1995;Dayan et al., 1995;Bornschein and Bengio, 2014).We derive the algorithm in the following subsections and provide pseudo-code in Alg. 1.
Note that the wake-sleep algorithm shows structural similarities to the expectation-maximization (EM) algorithm (Dempster et al., 1977), and, presaging the exposition, we note that the wake-sleep procedure is a type of variational EM (Beal, 2003).
The key difference is that the E-step minimizes an inclusive KL divergence, rather than the exclusive one typically found in variational EM.

Data Requirements of Wake-Sleep
We emphasize again that we will train our model in a semi-supervised fashion.Thus, we will assume a set of labeled sentences, D labeled , represented as a set of triples f , , m , and a set of unlabeled sentences, D unlabeled , represented as a set of surface form sequences f .

The Sleep Phase
Wake-sleep first dictates that we find an approximate posterior distribution q φ that minimizes the KL divergences for all form sequences: with respect to the parameters φ, which control the variational approximation q φ .Because q φ is trained to be a variational approximation for any input f , it is called an inference network.In other words, it will return an approximate posterior over the latent variables for any observed sequence.Importantly, note that computation of Eq. ( 7) is still hard-it requires us to normalize the distribution p θ , which, in turn, involves a sum over all lemmatizations and taggings.However, it does lend itself to an efficient Monte Carlo approximation.As our model is fully generative and directed, we may easily take samples from the complete joint.Specifically, we will take K samples f , ˜ , m ∼ p θ (•, •, •) by forward sampling and define them as D sleep .We remark that we use a tilde to indicate that a form, lemmata or tag is sampled, rather than human annotated.Using K samples, we obtain the objective which we could maximize by fitting the model q φ through backpropagation (Rumelhart et al., 1986), as one would during maximum likelihood estimation.

The Wake Phase
Now, given our approximate posterior q φ ( , m | f ), we are in a position to re-estimate the parameters of the generative model p θ (f , , m).Given a set of unannotated sentences D unlabeled , we again first consider the objective where D wake is a set of triples f , ˜ , m with f ∈ D unlabeled and ˜ , m ∼ q φ (•, • | f ), maximizing with respect to the parameters θ (we may stochastically backprop through the expectation simply by backpropagating through this sum).Note that Eq. ( 9) is a Monte Carlo approximation of the inclusive divergence of the data distribution of D unlabeled times q φ with p θ .

Adding Supervision to Wake-Sleep
So far we presented a purely unsupervised training method that makes no assumptions about the latent lemmata and morphological tags.In our case, however, we have a very clear idea what the latent variables should look like.For instance, we are quite certain that the lemma of talking is talk and that it is in fact a GERUND.And, indeed, we have access to annotated examples D labeled in the form of an annotated corpus.In the presence of these data, we optimize the supervised sleep phase objective, which is a Monte Carlo approximation of D KL (D labeled || q φ ).Thus, when fitting our variational approximation q φ , we will optimize a joint objective S = S sup + γ sleep • S unsup , where S sup , to repeat, uses actual annotated lemmata and morphological tags; we balance the two parts of the objective with a scaling parameter γ sleep .Note that on the first sleep phase iteration, we set γ sleep = 0 since taking samples from an untrained p θ (•, •, •) when we have available labeled data is of little utility.We will discuss the provenance of our data in §7.2.Likewise, in the wake phase we can neglect the approximation q φ in favor of the annotated latent maximize log q φ on D labeled ∪ D sleep this corresponds to Eq. (10) + Eq. (8) 10: maximize log p θ on D labeled ∪ D wake this corresponds to Eq. (11) + Eq. (9) variables found in D labeled ; this leads to the following supervised objective which is a Monte Carlo approximation of D KL (D labeled || p θ ).As in the sleep phase, we will maximize W = W sup + γ wake • W unsup , where γ wake is, again, a scaling parameter.

Our Variational Family
How do we choose the variational family q φ ?In terms of NLP nomenclature, q φ represents a joint morphological tagger and lemmatizer.The opensource tool LEMMING (Müller et al., 2015) represents such an object.LEMMING is a higher-order linear-chain conditional random field (CRF; Lafferty et al., 2001), that is an extension of the morphological tagger of Müller et al. (2013).Interestingly, LEMMING is a linear model that makes use of simple character n-gram feature templates.On both the tasks of morphological tagging and lemmatization, neural models have supplanted linear models in terms of performance in the high-resource case (Heigold et al., 2017).However, we are interested in producing an accurate approximation to the posterior in the presence of minimal annotated examples and potentially noisy samples produced during the sleep phase, where linear models still outperform non-linear approaches (Cotterell and Heigold, 2017).We note that our variational approximation is compatible with any family.

Interpretation as an Autoencoder
We may also view our model as an autoencoder, following Kingma and Welling (2013), who saw that a variational approximation to any generative model naturally has this interpretation.The crucial difference between Kingma and Welling ( 2013) and this work is that our model is a structured variational autoencoder in the sense that the space of our latent code is structured: the inference network encodes a sentence into a pair of lemmata and morphological tags , m .This bisequence is then decoded back into the sequence of forms f through a morphological inflector.The reason the model is called an autoencoder is that we arrive at an auto-encoding-like objective if we combine the p θ and q φ as so: where f is a copy of the original sentence f .Note that this choice of latent space sadly precludes us from making use of the reparametrization trick that makes inference in VAEs particularly efficient.In fact, our whole inference procedure is quite different as we do not perform gradient descent on both q φ and p θ jointly but alternatingly optimize both (using wake-sleep).We nevertheless call our model a VAE to uphold the distinction between the VAE as a model (essentially a specific Helmholtz machine (Dayan et al., 1995), justified by variational inference) and the end-to-end inference procedure that is commonly used.
Another way of viewing this model is that it tries to force the words in the corpus through a syntactic bottleneck.Spiritually, our work is close to the conditional random field autoencoder of Ammar et al. (2014).
We remark that many other structured NLP tasks can be "autoencoded" in this way and, thus, trained by a similar wake-sleep procedure.For instance, any two tasks that effectively function as inverses, e.g., translation and backtranslation, or language generation and parsing, can be treated with a similar variational autoencoder.While this work only focuses on the creation of an improved morphological inflector p θ (f | , m), one could imagine a situation where the encoder was also a task of interest.That is, the goal would be to improve both the decoder (the generation model) and the encoder (the variational approximation).

Related Work
Closest to our work is Zhou and Neubig (2017), who describe an unstructured variational autoencoder.However, the exact use case of our respective models is distinct.Our method models the syntactic dynamics with an LSTM language model over morphological tags.Thus, in the semisupervised setting, we require token-level annotation.Additionally, our latent variables are interpretable as they correspond to well-understood linguistic quantities.In contrast, Zhou and Neubig (2017) infer latent lemmata as real vectors.To the best of our knowledge, we are only the second attempt, after Zhou and Neubig (2017), to attempt to perform semi-supervised learning for a neural inflection generator.Other non-neural attempts at semi-supervised learning of morphological inflectors include Hulden et al. (2014).Models in this vein are non-neural and often focus on exploiting corpus statistics, e.g., token frequency, rather than explicitly modeling the forms in context.All of these approaches are designed to learn from a typelevel lexicon, rendering direct comparison difficult.

Experiments
While we estimate all the parameters in the generative model, the purpose of this work is to improve the performance of morphological inflectors through semi-supervised learning with the incorporation of unlabeled data.

Low-Resource Inflection Generation
The development of our method was primarily aimed at the low-resource scenario, where we observe a limited number of annotated data points.Why low-resource?When we have access to a preponderance of data, morphological inflection is close to being a solved problem, as evinced in SIGMORPHON's 2016 shared task.However, the CoNLL-SIGMORPHON 2017 shared task showed there is much progress to be made in the lowresource case.Semi-supervision is a clear avenue.

Data
As our model requires token-level morphological annotation, we perform our experiments on the Universal Dependencies (UD) dataset (Nivre et al., 2017).As this stands in contrast to most work on morphological inflection (which has used the UniMorph (Sylak-Glassman et al., 2015) 6 datasets), we use a converted version of UD data, in which the UD morphological tags have been deterministically converted into UniMorph tags.
For each of the treebanks in the UD dataset, we divide the training portion into three chunks consisting of the first 500, 1000 and 5000 tokens, respectively.These labeled chunks will constitute three unique sets D labeled .The remaining sentences in the training portion will be used as unlabeled data D unlabeled for each language, i.e., we will discard those labels.The development and test portions will be left untouched.
Languages.We explore a typologically diverse set of languages of various stocks: Indo-European, Afro-Asiatic, Turkic and Finno-Ugric, as well as the language isolate Basque.We have organized our experimental languages in Tab. 3 by genetic grouping, highlighting sub-families where possible.The Indo-European languages mostly exhibit fusional morphologies of varying degrees of complexity.The Basque, Turkic, and Finno-Ugric languages are agglutinative.Both of the Afro-Asiatic languages, Arabic and Hebrew, are Semitic and have templatic morphology with fusional affixes.

Evaluation
The end product of our procedure is a morphological inflector, whose performance is to be improved through the incorporation of unlabeled data.Thus, we evaluate using the standard metric accuracy.We will evaluate at the type level, as is traditional in the morphological inflection literature, even though the UD treebanks on which we evaluate are token-level resources.Concretely, we compile an incomplete type-level morphological lexicon from the tokenlevel resource.To create this resource, we gather all unique form-lemma-tag triples f, , m present in the UD test data.7

Baselines
As mentioned before, most work on morphological inflection has considered the task of estimating statistical inflectors from type-level lexicons.Here, in The structured variational autoencoder (SVAE) always outperforms the neural network (NN), but only outperformed the FST-based approach when trained on 5000 annotated tokens.Thus, while semi-supervised training helps neural models reduce their sample complexity, roughly 5000 annotated tokens are still required to boost their performance above more symbolic baselines.contrast, we require token-level annotation to estimate our model.For this reason, there is neither a competing approach whose numbers we can make a fair comparison to nor is there an open-source system we could easily run in the token-level setting.This is why we treat our token-level data as a list of "types"8 and then use two simple type-based baselines.
First, we consider the probabilistic finite-state transducer used as the baseline for the CoNLL-SIGMORPHON 2017 shared task. 9We consider this a relatively strong baseline, as we seek to generalize from a minimal amount of data.As described by Cotterell et al. (2017), the baseline performed quite competitively in the task's low-resource setting.Note that the finite-state machine is created by heuristically extracting prefixes and suffixes from the word forms, based on an unsupervised alignment step.The second baseline is our neural inflector p(f | , m) given in §4 without the semisupervision; this model is state-of-the-art on the high-resource version of the task.
We will refer to our baselines as follows: FST is the probabilistic transducer, NN is the neural sequence-to-sequence model without semisupervision, and SVAE is the structured variational autoencoder, which is equivalent to NN but also trained using wake-sleep and unlabeled data. conll-sigmorphon2017/

Results
We ran the three models on 23 languages with the hyperparameters and experimental details described in App. A. We present our results in Fig. 2 and in Tab. 3. We also provide sample output of the generative model created using the dream step in App.B. The high-level take-away is that on almost all languages we are able to exploit the unlabeled data to improve the sequence-to-sequence model using unlabeled data, i.e., SVAE outperforms the NN model on all languages across all training scenarios.However, in many cases, the FST model is a better choice-the FST can sometimes generalize better from a handful of supervised examples than the neural network, even with semi-supervision (SVAE).We highlight three finer-grained observations below.
Observation 1: FST Good in Low-Resource.As clearly evinced in Fig. 2, the baseline FST is still competitive with the NN, or even our SVAE when data is extremely scarce.Our neural architecture is quite general, and lacks the prior knowledge and inductive biases of the rule-based system, which become more pertinent in low-resource scenarios.Even though our semi-supervised strategy clearly improves the performance of NN, we cannot always recommend SVAE for the case when we only have 500 annotated tokens, but on average it does slightly better.The SVAE surpasses the FST when moving up to 1000 annotated tokens, becoming even more pronounced at 5000 annotated tokens.Observation 2: Agglutinative Languages.The next trend we remark upon is that languages of an agglutinating nature tend to benefit more from the semi-supervised learning.Why should this be?Since in our experimental set-up, every language sees the same number of tokens, it is naturally harder to generalize on languages that have more distinct morphological variants.Also, by the nature of agglutinative languages, relevant morphemes could be arbitrarily far from the edges of the string, making the (NN and) SVAE's ability to learn more generic rules even more valuable.
One interesting advantage that the neural models have over the FSTs is the ability to learn nonconcatenative phenomena.The FST model is based on prefix and suffix rewrite rules and, naturally, struggles when the correctly reinflected form is more than the concatenation of these parts.Thus we see that for the two semitic language, the SVAE is the best method across all resource settings.

Conclusion
We have presented a novel generative model for morphological inflection generation in context.The model allows us to exploit unlabeled data in the training of morphological inflectors.As the model's rich parameterization prevents tractable in-ference, we craft a variational inference procedure, based on the wake-sleep algorithm, to marginalize out the latent variables.Experimentally, we provide empirical validation on 23 languages.We find that, especially in the lower-resource conditions, our model improves by large margins over the baselines.

A Hyperparameters and Experimental Details
Here, we list all the hyperparameters and other experimental details necessary for the reproduction of the numbers presented in Tab. 3. The final experiments were produced with the follow setting.We performed a modest grid search over various configurations in the search of the best option on development for each component.
LSTM Morphological Tag Language Model.
The morphological tag language model is a 2-layer vanilla LSTM trained with hidden size of 200.It is trained to for 40 epochs using SGD with a cross entropy loss objective, and an initial learning rate of 20 where the learning rate is quartered during any epoch where the loss on the validation set reaches a new minimum.We regularize using dropout of 0.2 and clip gradients to 0.25.The morphological tags are embedded (both for input and output) with a multi-hot encoding into R 200 , where any given tag has an embedding that is the sum of the embedding for its constituent POS tag and each of its constituent slots.
Lemmata Generator.The lemma generator is a single-layer vanilla LSTM, trained for 10000 epochs using SGD with a learning rate of 4, using a batch size of 20000.The LSTM has 50 hidden units, embeds the POS tags into R 5 and each token (i.e., character) into R 5 .We regularize using weight decay (1e-6), no dropout, and clip gradients to 1.When sampling lemmata from the model, we cool the distribution using a temperature of 0.75 to generate more "conservative" values.The hyperparameters were manually tuned on Latin data to produce sensible output and fit development data and then reused for all languages of this paper.
Morphological Inflector.The reinflection model is a single-layer GRU-cell seq2seq model with a bidirectional encoder and multiplicative attention in the style of Luong et al. (2015), which we train for 250 iterations of AdaDelta (Zeiler, 2012).Our search over the remaining hyperparameters was as follows (optimal values in bold): input embedding size of [50,100,200,300 ], hidden size of [50,100,150,200], and a dropout rate of [0.0, 0.1, 0.2, 0.3, 0.4, 0.5].
Lemmatizer and Morphological Tagger.The joint lemmatizer and tagger is LEMMING as described in §5.5.It is trained with default parame-ters, the pretrained word vectors from Bojanowski et al. (2016) as type embeddings, and beam size 3.

Wake-Sleep
We run two iterations (I = 2) of wake-sleep.Note that each of the subparts of wakesleep: estimating p θ and estimating q φ are trained to convergence and use the hyperparameters described in the previous paragraphs.We set γ wake and γ sleep to 0.25, so we observe roughly 1 /4 as many dreamt samples as true samples.The samples from the generative model often act as a regularizer, helping the variational approximation (as measured on morphological tagging and lemmatization accuracy) on the UD development set, but sometimes the noise lowers performance a mite.Due to a lack of space in the initial paper, we did not deeply examine the performance of the tagger-lemmatizer outside the context of improving inflection prediction accuracy.Future work will investigate question of how much tagging and lemmatization can be improved through the incorporation of samples from our generative model.In short, our efforts will evaluate the inference network in its own right, rather than just as a variational approximation to the posterior.
B Fake Data from the Sleep Phase

Figure 1 :
Figure1: A length-4 example of our generative model factorized as in Eq. (1) and overlayed with example values of the random variables in the sequence.We highlight that all the conditionals in the Bayesian network are recurrent neural networks, e.g., we note that mi depends on m<i because we employ a recurrent neural network to model the morphological tag sequence.

Figure 2 :
Figure2: Violin plots showing the distribution over accuracies.The structured variational autoencoder (SVAE) always outperforms the neural network (NN), but only outperformed the FST-based approach when trained on 5000 annotated tokens.Thus, while semi-supervised training helps neural models reduce their sample complexity, roughly 5000 annotated tokens are still required to boost their performance above more symbolic baselines. morph.POS/morph.

Table 1 :
As an exhibit of morphological inflection, full paradigms (two numbers and four cases, 8 slots total) for the German nouns Wort ("word") and Herr ("gentleman"), with abbreviated and tabularized UniMorph annotation.

Table 2 :
Notational cheat sheet for the paper.

Table 3 :
Type-level morphological inflection accuracy across different models, training scenarios, and languages