A Cross-Sentence Latent Variable Model for Semi-Supervised Text Sequence Matching

We present a latent variable model for predicting the relationship between a pair of text sequences. Unlike previous auto-encoding–based approaches that consider each sequence separately, our proposed framework utilizes both sequences within a single model by generating a sequence that has a given relationship with a source sequence. We further extend the cross-sentence generating framework to facilitate semi-supervised training. We also define novel semantic constraints that lead the decoder network to generate semantically plausible and diverse sequences. We demonstrate the effectiveness of the proposed model from quantitative and qualitative experiments, while achieving state-of-the-art results on semi-supervised natural language inference and paraphrase identification.


Introduction
Text sequence matching is a task whose objective is to predict the degree of match between two or more text sequences.For example, in natural language inference, a system has to infer the relationship between a premise and a hypothesis sentence, and in paraphrase identification a system should find out whether a sentence is a paraphrase of the other.Since various natural language processing problems, including answer sentence selection, text retrieval, and machine comprehension, involve text sequence matching components, building a high-performance text matching model plays a key role in enhancing quality of systems for these problems (Tan et al., 2016;Rajpurkar et al., 2016;Wang and Jiang, 2017;Tymoshenko and Moschitti, 2018).
With the emergence of large-scale corpora, endto-end deep learning models are achieving remarkable results on text sequence matching; these include architectures that are linguistically motivated (Bowman et al., 2016a;Chen et al., 2017a;Kim et al., 2019), that introduce external knowledge (Chen et al., 2018), and that use attention mechanisms (Parikh et al., 2016;Shen et al., 2018b).The recent deep neural network-based work on text matching could roughly be categorized into two subclasses: i) methods that exploit inter-sentence features and ii) methods based on sentence encoders.In this work, we focus on the latter where sentences1 are separately encoded using a shared encoder and then fed to a classifier network, due to its efficiency and general applicability across tasks.
Meanwhile, despite the success of deep neural networks in natural language processing, the fact that they require abundant training data might be problematic, as constructing labeled data is a time-consuming and labor-intensive process.To mitigate the data scarcity problem, several semisupervised learning paradigms, that take advantage of unlabeled data when only some of the data examples are labeled (Chapelle et al., 2010), are proposed.These unlabeled data are much easier to collect, thus utilizing them could be a good option; for example in text matching, possibly related sentence pairs could be retrieved from a database of text via simple heuristics such as word overlap.
In this paper, we propose a cross-sentence latent variable model for semi-supervised text sequence matching.The proposed framework is based on deep probabilistic generative models (Kingma and Welling, 2014;Rezende et al., 2014) and is extended to make use of unlabeled data.As it is trained to generate a sentence that has a given relationship with a source sentence, both sentences in a pair are utilized together, and thus training objectives are defined more naturally than other models that consider each sentence separately (Zhao et al., 2018;Shen et al., 2018a).To further regularize the model to generate more plausible and diverse sentences, we define semantic constraints and use them for fine-tuning.
From experiments, we empirically prove that the proposed method significantly outperforms previous work on semi-supervised text sequence matching.We also conduct extensive qualitative analyses to validate the effectiveness of the proposed model.
The rest of the paper is organized as follows.In §2, we briefly introduce the background for our work.We describe the proposed cross-sentence latent variable model in §3, and give results from experiments in §4.We study the prior work related to ours in §5 and conclude in §6.

Background 2.1 Variational Auto-Encoders
Variational auto-encoder (VAE, Kingma and Welling, 2014) is a deep generative model for modeling the data distribution p θ (x).It assumes that a data point x is generated by the following random process: (1) z is sampled from p(z) and (2) x is generated from p θ (x|z).
Thus the natural training objective would be to directly maximize the marginal log-likelihood log p θ (x) = log z p θ (x|z)p(z)dz.However it is intractable to compute the marginal loglikelihood without using simplifying assumption such as mean-field approximation (Blei et al., 2017).Therefore the following variational lower bound −L is used as a surrogate objective: where q φ (z|x) is a variational approximation to the unknown p θ (z|x), and D KL (q p) is the Kullback-Leibler (KL) divergence between q and p. Maximizing the surrogate objective −L is proven to minimize D KL (q φ (z|x) p θ (z|x)), and it can also be seen as maximizing the expected data log-likelihood with respect to q φ while using D KL (q φ (z|x) p θ (z)) as a regularization term.
VAEs for text pair modeling.The most simple approach to modeling text pairs using the VAE framework is to consider two text sequences separately (Zhao et al., 2018;Shen et al., 2018a).That is, a generator is trained to reconstruct a single input sequence rather than integrating both sequences, and the two latent representations encoded from a variational posterior are given to a classifier network.When label information is not available, only the reconstruction objective is used for training.This means that the classifier parameters are not updated in the unsupervised setting, and thus the interaction between the variational posterior (or encoder) and the classifier could be restricted.

von Mises-Fisher Distribution
Since the advent of deep generative models with variational inference, the typical choice for prior and variational posterior distribution has been the Gaussian, likely due to its well-studied properties and easiness of reparameterization.However it often leads a model to face the posterior collapse problem where a model ignores latent variables by pushing the KL divergence term to zero (Chen et al., 2017b;van den Oord et al., 2017), especially in text generation models where powerful decoders are used (Bowman et al., 2016b;Yang et al., 2017).
A vMF distribution is a probability distribution on the (d − 1)-sphere, therefore samples are compared according to their directions, reminiscent of the cosine similarity.It has two parametersmean direction µ ∈ R d and concentration κ ∈ R. As the KL divergence between vMF(µ, κ) and the hyperspherical uniform distribution U(S d−1 ) = vMF(•, 0) only depends on κ, the KL divergence is a constant if the concentration parameter is fixed.Therefore when vMF(µ, κ) with fixed κ and vMF(•, 0) are used as posterior and prior, the posterior collapse does not occur inherently.
To the best of our knowledge, Guu et al. (2018) were the first to use vMF as posterior and prior  for VAEs, and Xu and Durrett (2018) empirically proved the effectiveness of vMF-VAE in natural language generation.Davidson et al. (2018) generalized the vMF-VAE and proposed the reparameterization trick for vMF.We refer readers to Appendix A for detailed description of vMF we used.

Proposed Framework
In this section, we describe the proposed framework in detail.We formally define the crosssentence latent variable model (CS-LVM) and describe the optimization objectives.We also introduce semantic constraints to keep learned representations in a semantically plausible region.Fig. 1 illustrates the entire framework.

Cross-Sentence Latent Variable Model
Though the auto-encoding frameworks described in §2.1 have intriguing properties, it may hinder the possibility of training an encoder to extract rich features for text pair modeling, due to the fact that the generative modeling process is confined within a single sequence.Therefore the interaction between a generative model and a discriminative classifier is restricted, since the two sequences are separately modeled and the pair-wise information is only considered through the classifier network.
Our proposed CS-LVM addresses this problem by cross-sentence generation of text given a text pair and its label.As the sentences in a pair are directly related within a generative model, the training objectives are defined in a more principled way than VAE-based semi-supervised text matching frameworks.Notably it also mimics the dataset construction process of some corpora: a worker generates a target text given a label and a source text (e.g.Bowman et al., 2015;Williams et al., 2018).
Given a pair (x 1 , x 2 ), let x s , x t ∈ {x 1 , x 2 } be a source and a target sequence respectively.Then we assume x t is generated according to the following process (see Fig. 2a): 1. a latent variable z s that contains the content of a source sequence is sampled from p(z s ), 2. a variable y that determines the relationship between a target and the source sequence is sampled from p(y), 3. x t is generated from a conditional distribution p θ (x t |z s , y).
In the above process, the class label y is treated as a hidden variable in the unsupervised case and an observed variable in the supervised case.
Accordingly, when the label information is available, the optimization objective for a generative model is the marginal log-likelihood of the observed variables x t and y: To address the intractability we instead optimize the lower bound of Eq. 1:2 log p θ (x t , y) ≥ −D KL (q φ (z s |x s ) p(z s )) + E q φ (zs|xs) [log p θ (x t |y, z s )] + log p(y), (2) where q φ (z s |x s ) is a variational approximation of the posterior p θ (z s |x t , y).Though Eq. 2 holds for any q φ having the same support with p(z s ), we choose this form of variational posterior from the following motivation: since x s is related to x t by the label information y, x s would have an influence on the space of z s in a similar way to (x t , y).Due to this particular choice of q φ , z s depends only on x s and is independent of the label information possibly permeated in x t .In other words, this design induces q φ to extract the features needed for controlling the semantics only from x s , while preventing q φ from encoding other biases.
To extend the objective to the unsupervised setup, we marginalize out y from Eq. 2 using a classifier distribution.We will provide more detailed explanation of the optimization objectives in §3.3.

Architecture
Now we describe the architectures we used for constructing CS-LVM.We first encode a source sequence into a fixed-length representation using a recurrent neural network (RNN): g enc (x s ) = m s .From m s we obtain a variational approximate distribution q φ (z s |x s ) = g code (m s ) and sample a latent representation z s ∼ q φ (z s |x s ).In our experiments, a long short-term memory (LSTM) recurrent network and a feed-forward network are used as g enc and g code respectively.From the fact that the mean direction parameter µ s of vMF(µ s , κ) should be a unit vector, g code additionally normalizes the output of the feed-forward network to be g code (m s ) 2 = 1.
Then we generate the target sequence x t from z s and y.Similarly to the encoder network, we use an LSTM for a decoder, thus the distribution is factorized as follows: where x t = (x t,1 , . . ., x t,Nx t ) and w t,0 = <s>, w t,Nx t +1 = </s> are special tokens indicating the start and the end of a sequence.
We project the word index w t,i and label index y into embedding spaces to obtain the word embedding w t,i and label embedding y.Then to construct an input for i-th time step, v t , we concatenate the i-th target word embedding w t,i , the label embedding y, and the latent representation z s altogether: , where g out is a feed-forward network and g dec i is the state transition function of the decoder LSTM at i-th time step.
For a discriminative classifier network we follow the siamese architecture, as mentioned in §1.
x s and x t are fed to a shared LSTM network f enc Algorithm 1 Training procedure of CS-LVM.
Compute L l (θ, φ, ψ; x l,s , x l,t , y l ) by ( 6) 6: Compute Lu(θ, φ, ψ; xu,s, xu,t) by ( 9) 7: Update θ, φ, ψ by gradient descent on L l + Lu 8: until stop criterion is met 9: procedure FINETUNE(X l , Xu, θ, φ, ψ) 10: repeat 11: Update θ, φ, ψ following line 3-7 12: Update θ by gradient descent on (11-14) 13: until stop criterion is met to obtain sentence vectors h 1 = f enc (x s ) and h 2 = f enc (x t ).Then h 1 and h 2 are combined by the function f f use to form a single fused vector, and the fused representation is given to a feedforward network f disc to infer the relationship: To learn from data more efficiently and to reduce the number of trainable parameters, we tie the weights for two encoders-for the generative model and the discriminative classifier; i.e. g enc = f enc .This mitigates the problem that only source sequences are used for training g enc and enhances the interaction between the generative model and the classifier.We will see from experiments that tying encoder weights improves performance and stabilizes optimization ( §4.3).
Also note that the functions g are only used in training, and the model has the same test-time computational complexity with typical classification models.

Optimization
In this subsection we describe how the entire model is optimized.We first define optimization objectives for supervised and unsupervised training, and then introduce constraints to regularize the model to generate sequences with intended semantic characteristics.The entire optimization procedure is summarized in Algorithm 1.

Supervised Objective
In the supervised setting, a data sample is assumed to contain label information: (x 1 , x 2 , y) ∈ X l .Without loss of generality let us assume (x s , x t ) = (x 1 , x 2 ). 3 Since y is an observed vari-able in this case, we can directly use Eq. 2 in training.From Eqs. 2 and 3, the objective for the generative model is defined by:4 where z s ∼ q φ (z s |x s ) and p(y), p(z s ) are prior distributions of y, z s .Considering that we assume p(y) to be a fixed uniform distribution of labels, the log p(y) term can be ignored in training: ∇ θ,φ log p(y) 2 = 0.
For training, the typical teacher forcing method is used; i.e. ground-truth words are used as input words.We use vMF(g code (m s ), κ) (κ: hyperparameter) for the variational posterior q φ (z s |x s ) and vMF(•, 0) for the prior p(z s ).
The discriminator objective is defined as a conventional maximum likelihood: Finally, the two objectives are combined to construct the objective for supervised training: where λ is a hyperparameter.

Unsupervised Objective
In this case, the model does not have an access to label information; a data point is represented by (x s , x t ) ∈ X u and thus y is a hidden variable.To facilitate the unsupervised training, we marginalize y out as below and derive the lower bound: And from the assumption presented in the graphical model (Fig. 2b), on the characteristics of a task.For some experiments we additionally used swapped data examples, (xs, xt) = (x2, x1), for training.We explain more on this in §4.
Finally we obtain the following lower bound for log p θ (x t ) from Eqs. 7 and 8:5 Here the second expectation term can be computed either by enumeration or sampling, and we used the former as the datasets we used have relatively small label sets (2 or 3) and it is known to yield better results than sampling (Xu et al., 2017).We will compare the two methods in §4.3.
To sum up, at every training iteration, given a labeled and unlabeled data sample (x l,s , x l,t , y l ), (x u,s , x u,t ), we optimize the following objective.

Fine-Tuning with Semantic Constraints
Since the generator is trained via maximum likelihood training which considers all words in a sentence equivalently, the label information may not be reflected enough in generation owing to highfrequency words.For example in natural language inference, the word occurrences of the following three hypothesis sentences highly overlap, but they should have different relation with the premise.6P: A man is cutting metal with a tool .H1: A man is cutting metal .H2: A man is cutting metal with the wrong tool .H3: A man is cutting metal with his mind .
Thus for some data points, the strategy that only predicts words that overlap across hypotheses could receive a fairly high score, which might weaken the integration of y into the generator.To mitigate this, we fine-tune the trained generator using the following semantic constraint: where ỹ ∼ p(y), z s ∼ q φ (z s |x s ), and x t = argmax xt p θ (x t |ỹ, z s ).This constraint enforces the sequence x t generated by conditioning on ỹ and z s to actually have the relationship ỹ with x s .
We also introduce a constraint on z that keeps the distributions of z t (the latent content variable obtained by encoding the generated sequence x t ) and z s close: where z t ∼ q φ ( z t | x t ).In other words, it pushes the generated sequence x t to be in a similar semantic space with the ground-truth target sequence x t .Consequently, it can help alleviate the generator collapse problem where a generator produces only a handful of simple neutral patterns independent of the input sequence, by relating z t to z t . 7  From similar motivation, we also add an additional constraint that encourages the generated sentences originating from different source sentences to be dissimilar.To reflect this, we define the following minibatch-level constraint that penalizes the mean direction vectors encoded from the generated sentences for being too close: where we denote values related to i-th sample of a minibatch B using superscript: (i) .In the above, µ and d(•, •) is a distance measure between vectors.The mean direction vector µ of vMF(µ, κ) is on a unit hypersphere, so we use the cosine distance: As the sequence generation process is not differentiable, the gradients from the semantic constraints cannot propagate to the generator parameters.To relax the discreteness, we use the Gumbel-Softmax reparameterization (Jang et al., 2017;Maddison et al., 2017).Using the Gumbel-Softmax trick, we obtain a continuous probability vector that approximates a sample from the categorical distribution of words at each step, and use the probability vector to compute the expected word embedding for the subsequent step.
When multiple constraints are used, they are combined using the homoscedastic uncertainty 7 The basic assumption behind this constraint is that a source and a target sequence are associated in a certain aspect, and it generally holds in most of the available pair classification datasets e.g.SNLI, SICK, SciTail, QQP, MRPC.

Experiments
We evaluate the proposed model on two semisupervised tasks: natural language inference and paraphrase identification.We also implement a strong baseline that has a similar architecture to LSTM-VAE (Shen et al., 2018a) but uses vMF distribution for prior and posterior, named LSTM-vMF-VAE.To further explore the proposed model, we conduct extensive qualitative analyses.For detailed settings and hyperparameters, please refer to Appendix C.

Natural Language Inference
Natural language inference (NLI) is a task of predicting the relationship given a premise and a hypothesis sentence.We use  2018a), we consider scenarios where 28k, 59k, and 120k labeled data samples are available.Also, for fair comparison with the prior work, we set the size of a word vocabulary set to 20,000 and do not utilize pre-trained word embeddings such as GloVe (Pennington et al., 2014).
To combine the representations of a premise and a hypothesis and to construct an input to f disc , we use the following heuristic-based fusion proposed by Mou et al. (2016): where [a; b] indicates concatenation of vectors a, b and is the element-wise product.Table 1 summarizes the result of experiments.We can clearly see that the proposed CS-LVM architecture substantially outperforms other models based on auto-encoding.Also, the semantic constraints brought additional boost in performance, achieving the new state of the art in semisupervised classification of the SNLI dataset.

Paraphrase Identification
Paraphrase identification (PI) is a task whose objective is to infer whether two sentences have the same semantics.We use the Quora Question Pairs dataset (QQP, Wang et al., 2017) for experiments.QQP consists of over 400k sentence pairs each of which has label information indicating whether the sentences in a pair paraphrase each other or not.We experiment for the cases where the number of labeled data is 1k, 5k, 10k, and 25k, and set the vocabulary size to 10,000, following Shen et al. (2018a).Unlike auto-encoding-based models that treat sentences in a pair equivalently, the CS-LVM processes them asymmetrically for its cross-sentence generating property.This property is useful when some relationships are asymmetric (e.g.NLI), however the paraphrase relationship is bidirectional, so that we also use swapped text pairs in training.To fuse sentence representations, the following symmetric function is used, as in Ji and Eisenstein (2013): The result of experiments on QQP is summarized in Table 2. Again, the proposed CS-LVM consistently outperforms other supervised and semi-supervised models by a large margin, setting the new state-of-the-art result on the QQP dataset with the semi-supervised setting.

Ablation Study
To assess the effect of each element, we experiment with model variants where some of the components are removed.Specifically, we conduct an ablation study for the following variants: (i) without cross-sentence generation (i.e.auto-encoding setup), (ii) replacing the vMF distribution with Gaussian, (iii) computing the expectation term of Eq. 9 by sampling, and (iv) without encoder weight sharing (i.e.f enc = g enc ).SNLI dataset is used for the model ablation experiments, and trained models are not fine-tuned in order to focus only on the efficacy of each model component.
Results of ablation study are presented in Table 3.As expected, the cross-sentence generation is the most critical factor for the performance, except for the 28k setting where the encoder weight tying brought the biggest gain.In 59k and 120k settings, all other variants that maintain the cross-generating property outperform the VAE-based models (see (ii), (iii), (iv)).
Replacing a vMF with a Gaussian does not severely harm the accuracy, however it requires the additional process of finding a KL cost annealing rate.When sampling is used instead of enumeration for computing Eq. 9, about 1.2x speedup is observed in exchange for slight performance degradation, and thus sampling could be a good option in the case that the number of label classes is large.
Finally, as mentioned in §3.2, variants whose encoder weights are untied do not work well.We conjecture this is because g enc receives the error signal only from a source sentence and could not fully benefit from both sentences.The fact that the performance degradation is larger when the number of labeled data is small also agrees with our hypothesis, since unlabeled data affect the classifier encoder only by the entropy term when encoder weights are not shared.

Generated Sentences
We give examples of generated sentences, to validate that the proposed model learns to generate text having desired properties.From Table 4, we can see that sentences generated from the identical input sentence properly reflect the label information given.More generated examples are presented in Appendix D.
Further, to quantitatively measure the quality of generated sentences, we construct artificial datasets, where each premise and label in the SNLI development set is used as input to our trained generator and generated hypotheses are collected.Then we prepare a LSTM classifier that is trained on the original SNLI dataset as a surrogate for the ideal classifier, and use it for measuring the quality of generated datasets. 9We also compute the diversity of the generated hypotheses using the metrics proposed by Li et al. (2016), to verify the effect of diversity-promoting semantic constraints.
Results of the evaluation on the artificial datasets are presented in Table 5.The classifier trained on the original dataset predicts the generated data fairly well, from which we verify that the generated sentences contain desired semantics.Also, as expected, fine-tuning with R y in-

Related Work
Semi-supervised learning for text classification.Using unlabeled data for text classification is an important subject and there exists much previous research (Zhu et al., 2003;Nigam et al., 2006;Zhu, 2008, to name but a few).Notably, the work of Xu et al. (2017) applies the semi-supervised VAE (Kingma et al., 2014) to the single-sentence text classification problem.Zhao et al. (2018); Shen et al. (2018a) present VAE models for the semisupervised text sequence matching, while their models have drawbacks as mentioned in §3.
When the use of external corpora is allowed, the performance can further be increased.Dai and Le (2015); Ramachandran et al. (2017) train an encoder-decoder network on large corpora and fine-tune the learned encoder on a specific task.Recently, there have been remarkable improvements in pre-trained language representations (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2018), where language models trained on extremely large data brought a huge performance boost.These methods are orthogonal to our work, and additional enhancements are expected when they are used together with our model.

Cross-sentence generating LVMs.
There exists some prior work on cross-sentence generating LVMs.Shen et al. (2017) introduce a similar data generation assumption to ours and apply the idea to unaligned style transfer and natural language generation.Zhang et al. (2016);Serban et al. (2017) use latent variable models for machine translation and dialogue generation.Kang et al. (2018) propose a data augmentation framework for natural language inference that generates a sentence, however unlabeled data are not considered in their work.Deudon (2018) build a sentence-reformulating deep generative model whose objective is to measure the semantic similarity between a sentence pair.However their work cannot be applied to a multi-class classification problem, and the generative objective is only used in pre-training, not considering the joint optimization of the generative and the discriminative objective.To the best of our knowledge, our work is the first work on introducing the concept of crosssentence generating LVM to the semi-supervised text matching problem.Table 5: Results of evaluation of generated artificial datasets.distinct-1 and distinct-2 compute the ratio of the number of unique unigrams or bigrams to that of the total generated tokens (Li et al., 2016).

Conclusion
In this work, we proposed a cross-sentence latent variable model (CS-LVM) for semi-supervised text sequence matching.Given a pair of text sequences and the corresponding label, it uses one of the sequences and the label as input and generates the other sequence.Due to the use of crosssentence generation, the generative model and the discriminative classifier interacts more strongly, and from experiments we empirically proved that the CS-LVM outperforms other models by a large margin.We also defined multiple semantic constraints to further regularize the model, and observed that fine-tuning with them gives additional increase in performance.
For future work, we plan to focus on generating more realistic text and use the generated text in other tasks e.g.data augmentation, addressing adversarial attack.Although the current model makes fairly plausible sentences, it tends to prefer relatively short and safe sentences, as the main goal of the training is to accurately predict the relationship between sentences.We expect the model could perform more natural generation via applying recent advancements on deep generative models.

A von Mises-Fisher Distribution
A von Mises-Fisher (vMF) distribution is the distribution defined on a m-dimensional unit hypersphere.It is parameterized by two parameters: the mean direction µ ∈ R m and the concentration κ ∈ R. The probability density function (pdf) of vMF(µ, κ) is defined by where and I v (κ) is the modified Bessel function of the first kind at order v. Eq. 17 is used in the computation of R z .
A sample from a vMF distribution is drawn from the acceptance-rejection scheme presented in Algorithm 1 of Davidson et al. (2018).In their algorithm, a stochastic variable obtained from the acceptance-rejection sampling does not depend on µ, thus the sampling process can be rewritten as a deterministic function that accepts the stochastic variable as input (i.e.reparameterization trick).
Note that the KL divergence does not depend on µ, thus the KL divergence is a constant if κ is fixed.Intuitively, this is because the hyperspherical uniform distribution has equal probability density at every point on the unit hypersphere, and D KL (vMF(µ, κ) vMF(•, 0)) should not be changed under rotations.

B Derivation of Lower Bounds
Let q θ (z s |•) be a distribution that has the same support with p(z s ).Then the KL divergence between q θ (z s |•) and p θ (z s |x t , y) can be written as

C Implementation Details
We used PyTorch 10 and AllenNLP 11 libraries for implementation.The default weight initialization scheme of the AllenNLP library is used unless explicitly stated.For all CS-LVM experiments, the size of word embeddings and hidden dimensions of LSTMs are set to 300, and the size of label embeddings is 50.g code is implemented as a linear projection of the last hidden state of the encoder LSTM followed by normalization.g out is a linear projection followed by the softmax function, and we reuse the word embeddings as its weight matrix (Press and Wolf, 2017;Inan et al., 2017).The discriminative classifier is a feedforward network with single hidden layer and the ReLU activation function, and the hidden dimension is set to 1200.We apply dropout on word embeddings and the classifier with probabilities p w and p c respectively.
When multiple semantic constraints are used, to make uncertainty weights be always positive and be optimized stably, we instead use log σ 2 i as model parameter, as in Kendall et al. (2018).Each log σ 2 i is initialized with zero.The temperature parameter of the Gumbel-Softmax is linearly annealed using the following schedule: where r = 10 −4 is the annealing rate and t is the training step.
For the LSTM-vMF-VAE experiments, we used the same hyperparameters and grid search scheme with those of the CS-LVM, except that we perform an additional search on the dimension of a latent code d ∈ {50, 150, 300}.
Adam optimizer (Kingma and Ba, 2015) with learning rate γ = 10 −3 is used for all experiments, except for 1k QQP experiments where stochastic gradient descent optimizer is used.When finetuning the model, we set γ to 10 −4 .For other hyperparameters, we follow the configuration suggested by the authors.Best hyperparameter configurations found for SNLI and QQP datasets are presented in Tables 6 and 7.

D Generated Examples
We used beam search with B = 10 when generating sentences, and the length normalization (Wu et al., 2016) is applied with α = 0.7.
Examples are presented in Tables 8-11.Though almost all generated hypotheses are realistic, we see that they lack diversity and fail to encode label information in some cases.For example, the phrase 'is/are sleeping' appears in generated sentences frequently when conditioned on the 'contradiction' label, likely because generating a set of simple patterns could be a shortcut to the objective.In Table 5, we verified from experiments that adding constraints helps enhancing accuracy and diversity, however a model is still relatively in favor of generating 'easy' sentences.We conjecture that the problem has its root in the fact that the primary objective of our model is to correctly classify the input, not to generate diverse outputs.

Figure 2 :
Figure 2: Illustration of the graphical models.(a) the generative process of the output x t ; (b) the approximate inference of z s and the discriminative classifier for y.
kids are playing in water .thekids are having fun .the kids are sleeping .blurrypeople walking in the city at night .people are walking .the people are walking to work .the people are inside .a woman sits in a chair under a tree and plays an acoustic guitar .a woman is playing an instrument .thewoman is a musician .a woman is playing the drums .threemen converse in a crowd .three men are talking .three men are talking about politics .themen are sleeping .awoman in a yellow shirt seated at a table .awoman is sitting .a woman is sitting at a table .the woman is standing a woman hugs a fluffy white dog .a woman is holding a dog .a woman is playing with her dog .awoman is sleeping .acrowd of people in colorful dresses .people are wearing costumes .the people are in a parade .the people are sitting down .a clown making a balloon animal for a pretty lady .a clown performs .theclown is a clown .the clown is sleeping .

Table 4 :
Selected samples generated from the model trained on the SNLI dataset.

Table 8 :
Sentences generated from the CS-LVM model trained on the SNLI dataset.Failure cases are denoted by strikethrough text.

Table 9 :
Sentences generated from the CS-LVM + R y model trained on the SNLI dataset.Note that failed examples in Table8are corrected due to the use of R y .

Table 10 :
Sentences generated from the CS-LVM + R z model trained on the SNLI dataset.Failure cases are denoted by strikethrough text.

Table 11 :
Sentences generated from the CS-LVM + R µ model trained on the SNLI dataset.Failure cases are denoted by strikethrough text.