Multi-space Variational Encoder-Decoders for Semi-supervised Labeled Sequence Transduction

Labeled sequence transduction is a task of transforming one sequence into another sequence that satisfies desiderata specified by a set of labels. In this paper we propose multi-space variational encoder-decoders, a new model for labeled sequence transduction with semi-supervised learning. The generative model can use neural networks to handle both discrete and continuous latent variables to exploit various features of data. Experiments show that our model provides not only a powerful supervised framework but also can effectively take advantage of the unlabeled data. On the SIGMORPHON morphological inflection benchmark, our model outperforms single-model state-of-art results by a large margin for the majority of languages.


Introduction
This paper proposes a model for labeled sequence transduction tasks, tasks where we are given an input sequence and a set of labels, from which we are expected to generate an output sequence that reflects the content of the input sequence and desiderata specified by the labels. Several examples of these tasks exist in prior work: using labels to moderate politeness in machine translation results (Sennrich et al., 2016), modifying the output language of a machine translation system (Johnson et al., 2016), or controlling the length of a summary in summarization (Kikuchi et al., 2016). In particular, however, we are motivated by the task of morphological reinflection (Cotterell et al.,1 An implementation of our model are available at https://github.com/violet-zct/ MSVED-morph-reinflection. 2016), which we will use as an example in our description and test bed for our models. In morphologically rich languages, different affixes (i.e. prefixes, infixes, suffixes) can be combined with the lemma to reflect various syntactic and semantic features of a word. The ability to accurately analyze and generate morphological forms is crucial to creating applications such as machine translation (Chahuneau et al., 2013;Toutanova et al., 2008) or information retrieval (Darwish and Oard, 2007) in these languages. As shown in 1, re-inflection of an inflected form given the target linguistic labels is a challenging subtask of handling morphology as a whole, in which we take as input an inflected form (in the example, "playing") and labels representing the desired form ("pos=Verb, tense=Past") and must generate the desired form ("played").
Approaches to this task include those utilizing hand-crafted linguistic rules and heuristics (Taji et al., 2016), as well as learning-based approaches using alignment and extracted transduction rules (Durrett and DeNero, 2013;Alegria and Etxeberria, 2016;Nicolai et al., 2016). There have also been methods proposed using neural sequenceto-sequence models (Faruqui et al., 2016;Kann et al., 2016;Ostling, 2016), and currently ensembles of attentional encoder-decoder models (Kann and Schütze, 2016a,b) have achieved state-of-art results on this task. One feature of these neural models however, is that they are trained in a largely supervised fashion (top of Fig. 1), using data explicitly labeled with the input sequence and labels, along with the output representation. Needless to say, the ability to obtain this annotated data for many languages is limited. However, we can expect that for most languages we can obtain large amounts of unlabeled surface forms that may allow for semi-supervised learning over this unlabeled data (entirety of Fig. 1). 2 In this work, we propose a new framework for labeled sequence transduction problems: multi-space variational encoder-decoders (MSVED, §3.3). MSVEDs employ continuous or discrete latent variables belonging to multiple separate probability distributions 3 to explain the observed data. In the example of morphological reinflection, we introduce a vector of continuous random variables that represent the lemma of the source and target words, and also one discrete random variable for each of the labels, which are on the source or the target side.
This model has the advantage of both providing a powerful modeling framework for supervised learning, and allowing for learning in an unsupervised setting. For labeled data, we maximize the variational lower bound on the marginal log likelihood of the data and annotated labels. For unlabeled data, we train an auto-encoder to reconstruct a word conditioned on its lemma and morphological labels. While these labels are unavailable, a set of discrete latent variables are associated with each unlabeled word. Afterwards we can perform posterior inference on these latent variables and maximize the variational lower bound on the marginal log likelihood of data.
Experiments on the SIGMORPHON morphological reinflection task  find that our model beats the state-of-the-art for a single model in the majority of languages, and is particularly effective in languages with more complicated inflectional phenomena. Further, we find that semi-supervised learning allows for significant further gains. Finally, qualitative evaluation of lemma representations finds that our model is able to learn lemma embeddings that match with human intuition. 2 Faruqui et al. (2016) have attempted a limited form of semi-supervised learning by re-ranking with a standard ngram language model, but this is not integrated with the learning process for the neural model and gains are limited.
3 Analogous to multi-space hidden Markov models (Tokuda et al., 2002) 2 Labeled Sequence Transduction In this section, we first present some notations regarding labeled sequence transduction problems in general, then describe a particular instantiation for morphological reinflection. Notation: Labeled sequence transduction problems involve transforming a source sequence x (s) into a target sequence x (t) , with some labels describing the particular variety of transformation to be performed. We use discrete variables K to denote the labels associated with each target sequence, where K is the total number of labels. Let denote a vector of these discrete variables. Each discrete variable y (t) k represents a categorical feature pertaining to the target sequence, and has a set of possible labels. In the later sections, we also use y (t) and y (t) k to denote discrete latent variables corresponding to these labels.
Given a source sequence x (s) and a set of associated target labels y (t) , our goal is to generate a target sequence x (t) that exhibits the features specified by y (t) using a probabilistic model p(x (t) |x (s) , y (t) ). The best target sequencex (t) is then given by: p(x (t) |x (s) , y (t) ). (1) Morphological Reinflection Problem: In morphological reinflection, the source sequence x (s) consists of the characters in an inflected word (e.g., "played"), while the associated labels y (t) describe some linguistic features (e.g., y (t) pos = Verb, y (t) tense = Past) that we hope to realize in the target. The target sequence x (t) is therefore the characters of the re-inflected form of the source word (e.g., "played") that satisfy the linguistic features specified by y (t) . For this task, each discrete variable y (t) k has a set of possible labels (e.g. pos=V, pos=ADJ, etc) and follows a multinomial distribution.

Preliminaries: Variational Autoencoder
As mentioned above, our proposed model uses probabilistic latent variables in a model based on neural networks. The variational autoencoder  is an efficient way to handle (continuous) latent variables in neural (a) VAE models. We describe it briefly here, and interested readers can refer to Doersch (2016) for details. The VAE learns a generative model of the probability p(x|z) of observed data x given a latent variable z, and simultaneously uses a recognition model q(z|x) at learning time to estimate z for a particular observation x ( Fig. 2(a)). q(·) and p(·) are modeled using neural networks parameterized by φ and θ respectively, and these parameters are learned by maximizing the variational lower bound on the marginal log likelihood of data: The KL-divergence term (a standard feature of variational methods) ensures that the distributions estimated by the recognition model q φ (z|x) do not deviate far from our prior probability p(z) of the values of the latent variables. To optimize the parameters with gradient descent,  introduce a reparameterization trick that allows for training using simple backpropagation w.r.t. the Gaussian latent variables z. Specifically, we can express z as a deterministic variable z = g φ ( , x) where is an independent Gaussian noise variable ∼ N (0, 1). The mean µ and the variance σ 2 of z are reparameterized by the differentiable functions w.r.t. φ. Thus, instead of generating z from q φ (z|x), we sample the auxiliary variable and obtain z = µ φ (x) + σ φ (x) • , which enables gradients to backpropagate through φ.

Multi-space Variational Autoencoders
As an intermediate step to our full model, we next describe a generative model for a single sequence with both continuous and discrete latent variables, the multi-space variational auto-encoder (MSVAE). MSVAEs are a combination of two threads of previous work: deep generative models with both continuous/discrete latent variables for classification problems Maaløe et al., 2016) and VAEs with only continuous variables for sequential data (Bowman et al., 2016;Chung et al., 2015;Zhang et al., 2016;Fabius and van Amersfoort, 2014;Bayer and Osendorfer, 2014). In MSVAEs, we have an observed sequence x, continuous latent variables z like the VAE, as well as discrete variables y.
In the case of the morphology example, x can be interpreted as an inflected word to be generated. y is a vector representing its linguistic labels, either annotated by an annotator in the observed case, or unannotated in the unobserved case. z is a vector of latent continuous variables, e.g. a latent embedding of the lemma that captures all the information about x that is not already represented in labels y.

MSVAE:
Because inflected words can be naturally thought of as "lemma+morphological labels", to interpret a word, we resort to discrete and continuous latent variables that represent the linguistic labels and the lemma respectively. In this case when the labels of the sequence y is not observed, we perform inference over possible linguistic labels and these inferred labels are referenced in generating x.
The generative model p θ (x, y, z) = p(z)p π (y)p θ (x|y, z) is defined as: Like the standard VAE, we assume the prior of the latent variable z is a diagonal Gaussian distribution with zero mean and unit variance. We assume that each variable in y is independent, resulting in a factorized distribution in Eq. 4, where Cat(y k |π k ) is a multinomial distribution with parameters π k . For the purposes of this study, we set these to a uniform distribution π k,j = 1 |π k | . f (x; y, z, θ) calculates the likelihood of x, a function parametrized by deep neural networks. Specifically, we employ an RNN decoder to generate the target word conditioned on the lemma variable z and linguistic labels y, detailed in §5.
When inferring the latent variables from the given data x, we assume the joint distribution of latent variables z and y has a factorized form, i.e. q(z, y|x) = q(z|x)q(y|x) as shown in Fig. 2(c).

312
The inference model is defined as follows: where the inference distribution over z is a diagonal Gaussian distribution with mean and variance parameterized by neural networks. The inference model q(y|x) on labels y has the form of a discriminative classifier that generates a set of multinomial probability vectors π φ (x) over all labels for each tag y k . We represent each multinomial distribution q(y k |x) with an MLP. The MSVAE is trained by maximizing the following variational lower bound U(x) on the objective for unlabeled data: Note that this introduction of discrete variables requires more sophisticated optimization algorithms, which we will discuss in §4.1. Labeled MSVAE: When y is observed as shown in Fig. 2(b), we maximize the following variational lower bound on the marginal log likelihood of the data and the labels: which is a simple extension to Eq. 2. Note that when labels are not observed, the inference model q φ (y|x) has the form of a discriminative classifier, thus we can use observed labels as the supervision signal to learn a better classifier. In this case we also minimize the following cross entropy as the classification loss: where p l (x, y) is the distribution of labeled data. This is a form of multi-task learning, as this additional loss also informs the learning of our representations.

Multi-space Variational Encoder-Decoders
Finally, we discuss the full proposed method: the multi-space variational encoder-decoder (MSVED), which generates the target x (t) from the source x (s) and labels y (t) . Again, we discuss two cases of this model: labels of the target sequence are observed and not observed.
MSVED: The graphical model for the MSVED is given in Fig. 2 (e). Because the labels of target sequence are not observed, once again we treat them as discrete latent variables and make inference on the these labels conditioned on the target sequence. The generative process for the MSVED is very similar to that of the MSVAE with one important exception: while the standard MSVAE conditions the recognition model q(z|x) on x, then generates x itself, the MSVED conditions the recognition model q(z|x (s) ) on the source x (s) , then generates the target x (t) . Because only the recognition model is changed, the generative equations for p θ (x (t) , y (t) , z) are exactly the same as Eqs. 3-5 with x (t) swapped for x and y (t) swapped for y. The variational lower bound on the conditional log likelihood, however, is affected by the recognition model, and thus is computed as: Labeled MSVED: When the complete form of x (s) , y (t) , and x (t) is observed in our training data, the graphical model of the labeled MSVED model is illustrated in Fig. 2 (d). We maximize the variational lower bound on the conditional log likelihood of observing x (t) and y (t) as follows: 4 Learning MSVED Now that we have described our overall model, we discuss details of the learning process that prove 313 useful to its success.

Learning Discrete Latent Variables
One challenge in training our model is that it is not trivial to perform back-propagation through discrete random variables, and thus it is difficult to learn in the models containing discrete tags such as MSVAE or MSVED. 4 To alleviate this problem, we use the recently proposed Gumbel-Softmax trick (Maddison et al., 2014;Gumbel and Lieblein, 1954) to create a differentiable estimator for categorical variables. The Gumbel-Max trick (Gumbel and Lieblein, 1954) offers a simple way to draw samples from a categorical distribution with class probabilities π 1 , π 2 , · · · by using the argmax operation as follows: one hot(arg max i [g i + log π i ]), where g 1 , g 2 , · · · are i.i.d. samples drawn from the Gumbel(0,1) distribution. 5 When making inferences on the morphological labels y 1 , y 2 , · · · , the Gumbel-Max trick can be approximated by the continuous softmax function with temperature τ to generate a sample vectorŷ i for each label i: where N i is the number of classes of label i. When τ approaches zero, the generated sampleŷ i becomes a one-hot vector. When τ > 0,ŷ i is smooth w.r.t π i . In experiments, we start with a relatively large temperature and decrease it gradually.

Learning Continuous Latent Variables
MSVED aims at generating the target sequence conditioned on the latent variable z and the target labels y (t) . This requires the encoder to generate an informative representation z encoding the content of the x (s) . However, the variational lower bound in our loss function contains the KL-divergence between the approximate posterior q φ (z|x) and the prior p(z), which is relatively easy to learn compared with learning to generate output from a latent representation. We observe that with the vanilla implementation the KL cost quickly decreases to near zero, setting q φ (z|x) equal to standard normal distribution. In this case, the RNN decoder can easily rely on the true output of last time step during training to decode the next token, which degenerates into an RNN language model. Hence, the latent variables are ignored by the decoder and cannot encode any useful information. The latent variable z learns an undesirable distribution that coincides with the imposed prior distribution but has no contribution to the decoder. To force the decoder to use the latent variables, we take the following two approaches which are similar to Bowman et al. (2016). KL-Divergence Annealing: We add a coefficient λ to the KL cost and gradually anneal it from zero to a predefined threshold λ m . At the early stage of training, we set λ to be zero and let the model first figure out how to project the representation of the source sequence to a roughly right point in the space and then regularize it with the KL cost.
Although we are not optimizing the tight variational lower bound, the model balances well between generation and regularization. This technique can also be seen in (Kočiskỳ et al., 2016;Miao and Blunsom, 2016). Input Dropout in the Decoder: Besides annealing the KL cost, we also randomly drop out the input token with a probability of β at each time step of the decoder during learning. The previous ground-truth token embedding is replaced with a zero vector when dropped. In this way, the RNN decoder could not fully rely on the ground-truth previous token, which ensures that the decoder uses information encoded in the latent variables.

Architecture for Morphological Reinflection
Training details: For the morphological reinflection task, our supervised training data consists of source x (s) , target x (t) , and target tags y (t) . We test three variants of our model trained using different types of data and different loss functions. First, the single-directional supervised model (SD-Sup) is purely supervised: it only decodes the target word from the given source word with the loss function L l (x (t) , y (t) |x (s) ) from Eq. 12. Second, the bi-directional supervised model (BD-Sup) is trained in both directions: decoding the target word from the source word and decoding the source word from the target word, which corresponds to the loss function L l (x (t) , y (t) |x (s) ) + L u (x (s) |x (t) ) using Eqs. 11-12. Finally, the semisupervised model (Semi-sup) is trained to maxi- † represents the best single supervised model score, ‡ represents the best model including semi-supervised models, and bold represents the best score overall. #LD and #ULD are the number of supervised data and unlabeled words respectively.
The weight α controls the relative weight between the loss from unlabeled data and labeled data. We use Monte Carlo methods to estimate the expectation over the posterior distribution q(z|x) and q(y|x) inside the objective function 14. Specifically, we draw Gumbel noise and Gaussian noise one at a time to compute the latent variables y and z.
The overall model architecture is shown in Fig. 3. Each character and each label is associated with a continuous vector. We employ Gated Recurrent Units (GRUs) for the encoder and de-coder. Let − → h t and ← − h t denote the hidden state of the forward and backward encoder RNN at time step t. u is the hidden representation of x (s) concatenating the last hidden state from both directions where T is the word length. u is used as the input for the inference model on z. We represent µ(u) and σ 2 (u) as MLPs and sample z from N (µ(u), diag(σ 2 (u))), using z = µ + σ • , where ∼ N (0, I). Similarly, we can obtain the hidden representation of x (t) and use this as input to the inference model on each label y (t) i which is also an MLP following a softmax layer to generate the categorical probabilities of target labels.
In decoding, we use 3 types of information in calculating the probability of the next character : (1) the current decoder state, (2) a tag context vector using attention (Bahdanau et al., 2015) over the tag embeddings, and (3) the latent variable z. The intuition behind this design is that we would like the model to constantly consider the lemma represented by z, and also reference the tag corresponding to the current morpheme being generated at this point. We do not marginalize over the latent variable z however, instead we use the mode µ of z as the latent representation for z. We use beam search with a beam size of 8 to perform search over the character vocabulary at each decoding time step. Other experimental setups: All hyperparameters are tuned on the validation set, and include the following: For KL cost annealing, λ m is set to be 0.2 for all language settings. For character drop-out at the decoder, we empirically set β to be 0.4 for all languages. We set the dimension of character embeddings to be 300, tag label embeddings to be 200, RNN hidden state to be 256, and latent variable z to be 150. We set α the weight for the unsupervised loss to be 0.8. We train the model with Adadelta (Zeiler, 2012) and use earlystop with a patience of 10. 6 Experiments 6.1 Background: SIGMORPHON 2016 SIGMORPHON 2016 is a shared task on morphological inflection over 10 different morphologically rich languages. There are a total of three tasks, the most difficult of which is task 3, which requires the system to output the reinflection of an inflected word. 6 The training data format in task 3 is in triples: (source word, target labels, target word). In the test phase, the system is asked to generate the target word given a source word and the target labels. There are a total of three tracks for each task, divided based the amount of supervised data that can be used to solve the problem, among which track 2 has the strictest limitation of only using data for the corresponding task. As this is an ideal testbed for our method, which can learn from unlabeled data, we choose track 2 and task 3 to test our our model's ability to exploit this data.
As a baseline, we compare our results with the MED system (Kann and Schütze, 2016a) which achieved state-of-the-art results in the shared task. This system used an encoder-decoder model with attention on the concatenated source word and target labels. Its best result is obtained from an ensemble of five RNN encoder-decoders (Ensemble). To make a fair comparison with our models, which don't use ensembling, we also calculated single model results (Single).
All models are trained using the labeled training data provided for task 3. For our semi-supervised model (Semi-sup), we also leverage unlabeled data from the training and validation data for tasks 1 and 2 to train variational auto-encoders.

Results and Analysis
From the results in Tab. 1, we can glean a number of observations. First, comparing the results of our full Semi-sup model, we can see that for all languages except Spanish, it achieves accuracies better than the single MED system, often by a large margin. Even compared to the MED ensembled model, our single-model system is quite competitive, achieving higher accuracies for Hungarian,  Navajo, Maltese, and Arabic, as well as achieving average accuracies that are state-of-the-art.
Next, comparing the different varieties of our proposed models, we can see that the semisupervised model consistently outperforms the bidirectional model for all languages. And similarly, the bidirectional model consistently outperforms the single direction model. From these results, we can conclude that the unlabeled data is beneficial to learn useful latent variables that can be used to decode the corresponding word.
Examining the linguistic characteristics of the models in which our model performs well provides even more interesting insights.  estimate how often the inflection process involves prefix changes, stem-internal changes or suffix changes, the results of which are shown in Tab. 2. Among the many languages, the inflection processes of Arabic, Maltese and Navajo are relatively diverse, and contain a large amount of all three forms of inflection. By examining the experimental results together with the morphological inflection process of different languages, we found that among all the languages, Navajo, Maltese and Arabic obtain the largest gains in performance compared with the ensem- This strongly demonstrates that our model is agnostic to different morphological inflection forms whereas the conventional encoder-decoder with attention on the source input tends to perform better on suffixing-oriented morphological inflection. We hypothesize that for languages that the inflection mostly comes from suffixing, transduction is relatively easy because the source and target words share the same prefix and the decoder can copy the prefix of the source word via attention. However, for languages in which different inflections of a lemma go through different morphological processes, the inflected word and the target word may differ greatly and thus it is crucial to first analyze the lemma of the inflected word before generating the corresponding the reinflection form based on the target labels. This is precisely what our model does by extracting the lemma representation z learned by the variational inference model.

Analysis on Tag Attention
To analyze how the decoder attends to the linguistic labels associated with the target word, we randomly pick two words from the Arabic and Navajo test set and plot the attention weight in Fig. 5. The Arabic word "al-'imārātiyyātu" is an adjective which means "Emirati", and its source word in the test data is "'imārātiyyin" 7 . Both of these are declensions of "'imārātiyy". The source word is 7 https://en.wiktionary.org/wiki/%D8% A5%D9%85%D8%A7%D8%B1%D8%A7%D8%AA%D9%8A singular, masculine, genitive and indefinite, while the required inflection is plural, feminine, nominative and definite. We can see from the left heat map that the attention weights are turned on at several positions of the word when generating corresponding inflections. For example, "al-" in Arabic is the definite article that marks definite nouns. The same phenomenon can also be observed in the Navajo example, as well as other languages, but due to space limitation, we don't provide detailed analysis here.

Visualization of Latent Lemmas
To investigate the learned latent representations, in this section we visualize the z vectors, examining whether the latent space groups together words with the same lemma. Each sample in SIG-MORPHON 2016 contains source word and target words which share the same lemma. We run a heuristic process to assign pairs of words to groups that likely share a lemma by grouping together word pairs for which at least one of the words in each pair shares a surface form. This process is not error free -errors may occur in the case where multiple lemmas share the same surface form -but in general the groupings will generally reflect lemmas except in these rare erroneous cases, so we dub each of these groups a pseudo-lemma.
In Fig. 6, we randomly pick 1500 words from Maltese and visualize the continuous latent vectors of these words. We compute the latent vectors as µ φ (x) in the variational posterior inference (Eq. 6) without adding the variance. As expected, words that belong to the same pseudo-lemma (in the same color) are projected into adjacent points in the two-dimensional space. This demonstrates that the continuous latent variable captures the canonical form of a set of words and demonstrates the effectiveness of the proposed representation.  Table 3: Randomly picked output examples on the test data. Within each block, the first, second and third lines are outputs that ours is correct and MED's is wrong, ours is wrong and MED's is correct, both are wrong respectively.

Analyzing Effects of Size of Unlabeled Data
From Tab. 1, we can see that semi-supervised learning always performs better than supervised learning without unlabeled data. In this section, we investigate to what extent the size of unlabeled data can help with performance. We process a German corpus from a 2017 Wikipedia dump and obtain more than 100,000 German words. These words are ranked in order of occurrence frequency in Wikipedia. The data contains a certain amount of noise since we did not apply any special processing. We shuffle all unlabeled data from both the Wikipedia and the data provided in the shared task used in previous experiments, and increase the number of unlabeled words used in learning by 10,000 each time, and finally use all the unlabeled data (more than 150,000 words) to train the model. Fig. 7 shows that the performance on the test data improves as the amount of unlabeled data increases, which implies that the unsupervised learning continues to help improve the model's ability to model the latent lemma representation even as we scale to a noisy, real, and relatively large-scale dataset. Note that the growth rate of the performance grows slower as more data is added, because although the number of unlabeled data is increasing, the model has seen most word patterns in a relatively small vocabulary.

Case Study on Reinflected Words
In Tab. 3, we examine some model outputs on the test data from the MED system and our model respectively. It can be seen that most errors of MED and our models can be ascribed to either over-copy or under-copy of characters. In particular, from the complete outputs we observe that our model tends to be more aggressive in its changes, resulting in it performing more complicated transformations, both successfully (such as Maltese "ndammhomli" to "tindammhiex") and unsuccessfully ("tqożżx" to "qażżejtx"). In contrast, the attentional encoderdecoder model is more conservative in its changes, likely because it is less effective in learning an abstracted representation for the lemma, and instead copies characters directly from the input.

Conclusion and Future Work
In this work, we propose a multi-space variational encoder-decoder framework for labeled sequence transduction problem. The MSVED performs well in the task of morphological reinflection, outperforming the state of the art, and further improving with the addition of external unlabeled data. Future work will adapt this framework to other sequence transduction scenarios such as machine translation, dialogue generation, question answering, where continuous and discrete latent variables can be abstracted to guide sequence generation.