Learning Implicit Text Generation via Feature Matching

Generative feature matching network (GFMN) is an approach for training implicit generative models for images by performing moment matching on features from pre-trained neural networks. In this paper, we present new GFMN formulations that are effective for sequential data. Our experimental results show the effectiveness of the proposed method, SeqGFMN, for three distinct generation tasks in English: unconditional text generation, class-conditional text generation, and unsupervised text style transfer. SeqGFMN is stable to train and outperforms various adversarial approaches for text generation and text style transfer.


Introduction
Generative feature matching networks (GFMNs) (dos Santos et al., 2019) has been recently proposed for learning implicit generative models by performing moment matching on features from pre-trained neural networks. This approach demonstrated that GFMN could produce state-of-the-art image generators while avoiding instabilities associated with adversarial learning. Similarly to training generative adversarial networks (GANs) (Goodfellow et al., 2014), GFMN training requires to backpropagate through the generated data to update the model parameters. This backpropagation through the generated data, combined with adversarial learning instabilities, has proven to be a compelling challenge when applying GANs for discrete data such as text. However, it remains unknown if this is also an issue for feature matching networks since the effectiveness of GFMN for sequential discrete data has not yet been studied.
In this work, we investigate the effectiveness of GFMN for different text generation tasks. As a * *work done prior to joining Amazon first contribution, we propose a new formulation of GFMN for unconditional sequence generation, which we name Sequence-GFMN or SeqGFMN for short, by performing token level feature matching. SeqGFMN has a stable training because it does not concurrently train a discriminator, which in principle could easily learn to distinguish between one-hot and soft one-hot representations. As a result, we can use soft one-hot representations that the generator outputs during training without using the Gumbel softmax or REINFORCE algorithm as needed in GANs for text. Additionally, different from GANs (Zhu et al., 2018), SeqGFMN can produce meaningful text without the need of pre-training the generator with maximum likelihood estimation (MLE). We perform experiments using Bidirectional Encoder Representations from Transformers (BERT), GloVe, and FastText as our feature extractor networks. We use two different corpora, and assess both the quality and diversity of the generated texts with three different quantitative metrics: BLEU, Self-BLEU and Fréchet Infersent Distance (FID). Additionally, we show that the latent space induced by SeqGFMN contains semantic and syntactic structure, as evidenced by interpolations in the z space.
Our second contribution consists in proposing a new strategy for class-conditional generation with GFMN. The key idea here is to perform class-wise feature matching. We apply SeqGFMN to perform sentiment-based conditional generation using the Yelp Reviews dataset, and assess its performance using classification accuracy, BLEU, and Self-BLEU.
Finally, as a third contribution, we demonstrate that the feature matching loss is an effective approach to perform distribution matching in the context of unsupervised text style transfer (UTST). Most previous work on UTST adapts the autoencoder framework by adding an additional loss term: adversarial loss or back-translation loss. Our method consists in replacing the adversarial and back-translation loss with style-wise feature matching. Our experimental results indicate that the feature matching loss produces better results than the traditionally used losses.
2 Feature Matching Nets for Text

SeqGFMN
Let G be a sequence generator implemented as a neural network with parameters θ, and let E be a pretrained NLP feature extractor network with L hidden layers, that produces features at tokenlevel for each token in a sequence of length T . The method consists of training G by minimizing the following token-level feature matching loss function: where ||.|| 2 is the L 2 loss; x is a real data point sampled from the data distribution p data ; z ∈ R nz is a noise vector sampled from the normal distribution N (0, I nz ); E j,t (x) denotes the token-level t feature map at a hidden layer j from E; M ≤ L is the number of hidden layers used to perform feature matching; T is the maximum sequence length; and σ 2 p data and σ 2 p G are the variances of the features for real data and generated data respectively. Note that this loss function is quite different from both the MLE loss used in regular language models and the adversarial loss used in GANs.
In order to train G, we first precompute µ j,t p data and σ j,t p data , on the entire training data. During training, we generate a minibatch of fake data by passing the Gaussian noise vector through the generator. The fixed feature extractor E is used to extract features on the output of the generator at a per-token level. The loss is then computed, as mentioned in Eq. 1. The parameters θ of the generator G are optimized using stochastic gradient descent. Note that the network E is used for feature extraction only and is kept fixed during the training of G. Similar to (dos Santos et al., 2019), we use ADAM moving average, which allows us to use small minibatch sizes. Fig. 1 illustrates Se-qGFMN training; note that we use mean matching only for brevity, in practice we match both mean and diagonal covariance.
In our SeqGFMN framework, the output of the generator G is a sequencex of soft one-hot representations, {w 1 ,w 2 , ...,w T }, where each element w i consists in the output of the softmax function at token i. In the feature extractor E, these soft onehot representations are multiplied by an embedding matrix to generate soft embeddings, which are then fed to the following layers of E.

Class-Conditional SeqGFMN
Conditional generation is motivated by the assumption that if the training data can be clustered into distinct and meaningful classes, knowledge of such classes at training time would improve the overall performance of the model. For class-based text generation, some datasets provide such opportunity by labeling the training data with relevant classes (e.g., positive/negative sentiment for Yelp Reviews dataset), information that can be leveraged by our model to condition the generation.
For this to be effective, the extracted features used for SeqGFMN need to be sufficiently representative of the text generated yet still be different between classes. To account for the knowledge of latent classes, we extend the loss from Eq.1 for the case of two distinct classes: follows the same definition for means and variances as Eq.1, with the exception that they are now class-dependent. Given a class c, we allow for conditional generation by conditioning the noise vector z on c. Indeed, if z ∼ N (0, I nz ), applying a class dependent linear transformation z c = A c z +b c will change the noise distribution such that z c ∼ N b c , A c A c . A c and b c are learned at training time so to minimize our loss. This enables the model to effectively sample Figure 1: For each training iteration, Generator (G) outputs N sentences from noise signals z 1 · · · z N . A fixed feature extractor is used to extract token level features (E j,t ) for the generated data. L is the L 2 -norm of the difference between extracted features means of generated and real data µ j,t pdata , which is then backpropagted to update the parameters of G. The same strategy is used for variance terms in L (here ignored for brevity).
a new input noise from distinct distributions, conditioned on the class c. Since the model can update the linear transformation parameters A c and b c to minimize its loss, the model can learn transformations that separate or disentangle between the different classes c naturally. For example, conditioning on sentiment where c = 0 is the negative sentiment class and c = 1 the positive class, amounts simply to learning two transformations (A 0 , b 0 ) and (A 1 , b 1 ). This approach can be extended beyond learning linear transformations to allow for deep neural network to be employed. During training, a minibatch is composed of input noise samples conditioned on class c. Within our generator, we use a conditional batch normalization (condBN) from (Dumoulin et al., 2016). The conditional BN is a 2stage process: First, we perform a standard BN of a minibatch regardless of c where y i = BN γ,β (x i ), using notations from (Ioffe and Szegedy, 2015). Then y i enters a second stage where w i = γ c y i +β c brings class dependency on c as proposed in (Dumoulin et al., 2016). This allows for the influence of class conditioning to carry over the whole model where conditional BN is used. Our models can have three distinct configurations: conditional input noise, conditional BN, or both conditional input noise and conditional BN.

Unsupervised Text Style Transfer (UTST) with SeqGFMN
Text style transfer consists of rewriting a sentence from a given style s i (e.g., informal) into a different style s j (e.g., formal) while maintaining the content and keeping the sentence fluent. The major challenge for this task is the lack of parallel data, and many recent approaches adapt the encoderdecoder framework to work with non-parallel data (Shen et al., 2017;Fu et al., 2018). This adaptation normally consists in using: (1) the reconstruction loss in an autoencoding fashion, which is intended to learn a conditional language model (decoder D) while providing content preservation; together with (2) a classification loss produced by a style classifier C, which is intended to guarantee the correct transfer. Balancing these two losses while generating good quality sentences is difficult, and several approaches such as adversarial discriminators (Shen et al., 2017) and cycle-consistency loss (Melnyk et al., 2017) have been employed in recent works. Here, we use feature matching as a way to alleviate this problem. Essentially, our unsupervised text style transfer approach is an encoderdecoder trained with the following three losses: Reconstruction loss: Given an input sentence x s i from set X and its decoded sentencex the reconstruction loss measures how well the decoder D is able to reconstruct it: Classification loss: This loss is formulated as : whereX is the set of style transferred sentences generated by the current model. For the classifier, the first term provides supervised signal regarding style classification and the second term gives additional training signal from the transferred data, enabling the classifier to be trained in a semisupervised regime. For the encoder-decoder the second term gives feedback on the current generator's effectiveness on transferring sentences to a different style. Feature Matching loss: It is computed in a similar way as the class-conditional loss (Eq. 2). This loss consists of matching statistics of the features for each style separately. This means that when transferring from style s i to s j , we match the features of the resulting sentence with the features of real data that are from the target style s j .
is different from our setup, as our discriminator is not learned, and our feature matching is per token and not on a global sentence level. Sequence GAN (SeqGAN) (Yu et al., 2017), MaliGAN (Che et al., 2017), and RankGAN (Lin et al., 2017) use a pretrained generator with MLE loss with a per token reward discriminator that is trained with reinforcement learning. SeqGFMN is similar to SeqGAN in the sense that it has a per token reward (per token feature matching loss). Still, it alleviates the need for pre-training the generator and the cumbersome training of a discriminator by relying on a fixed, state-of-the-art, text feature extractor such as BERT.
For unsupervised text style transfer, different adaptations of the encoder-decoder framework have been proposed recently. (Shen et al., 2017;Fu et al., 2018) uses adversarial classifiers to decode to a different style/language. (Melnyk et al., 2017),(Nogueira dos Santos et al., 2018) proposed a method that combines a collaborative classifier with the back-transfer loss. (Prabhumoye et al., 2018) presented an approach that trains different encoders, one per style, by combining the encoder of a pre-trained NMT and style classifiers. The main difference between our approach and these previous work consists in the fact that we use the feature matching loss to perform distribution matching.

Experiments and Results
Datasets: We evaluate our proposed approach on three different english datasets: MSCOCO (Lin et al., 2014), EMNLP 2017 WMT News dataset (Bojar et al., 2017), and Yelp Reviews Dataset (Shen et al., 2017). Both COCO and WMT News datasets are used for unconditional models, while Yelp Reviews is employed to evaluate classconditional generation and unsupervised text style transfer. Feature Extractors for Textual Data: We experiment with different feature extractors that generate token-level representations. We use word embeddings from GloVe (Pennington et al., 2014) and FastText (Bojanowski et al., 2017) as representatives of shallow (cheap-to-train) architectures. As a representative of large, deep feature extractor we use BERT (Devlin et al., 2018). Devlin et al. (2018) demonstrated that the features extracted by BERT can boost the performance of diverse NLP tasks. Our hypothesis is that BERT features are informative enough to allow the training of (cross-domain) text generators with the help of feature matching. Metrics: In order to evaluate the diversity and quality of texts of the unconditional generators we use three metrics BLEU (Papineni et al., 2002), Self-BLEU (Zhu et al., 2018) and Fréchet Infersent Distance, FID (Heusel et al., 2017). Additionally, for class-conditional generation and unsupervised text style transfer, we report accuracy scores from a CNN sentiment classifier trained on the Yelp.

Experimental Results
Unconditional Text Generation: In Tab. 1, we show quantitative results for SeqGFMN trained on COCO and WMT News using different feature extractors. As expected, BERT as a feature extractor gives better performance because of a more significant and richer features used.
We also present a comparison with other implicit generative models for text generation from scratch. We compare SeqGFMN with five different GAN approaches: SeqGAN (Yu et al., 2017), MaliGAN (Che et al., 2017), RankGAN (Lin et al., 2017), TextGAN (Zhang et al., 2017a) and Rel-GAN (Weili Nie and Patel, 2019). We do not use generator pre-training for any of the models. As reported in Tab. 1, SeqGFMN outperforms all GAN models in terms of BLEU and FID. The combination of low BLEU and low Self-BLEU for the different GANs indicates that the learned models generate random n-grams that do not appear in the test set. All GANs fail to learn reasonable models due to the challenges of learning a discrete data generator from scratch under the min-max game. Whereas, SeqGFMN can learn suitable generators without the need of generator pre-training. Class-conditional Generation: Conditional generation experiments were conducted on Yelp Reviews dataset with sentiment labels (178K negative, 268K positive). For this experiment, we first pre-trained the Generator using a conditional denoising AE where class labels are provided only to the decoder D. The architecture of the encoder is the same as in (Zhang et al., 2017b) with three strided convolutional layers. Once pre-trained, D is used as initialization for our Generator G. The training is similar   Tab. 2 presents results for our regular model (baseline) and the three conditional generators: Cond. Noise, Cond. Batch Normalization (BN), Cond. Noise+BN. We use 10K generated sentences for each sentiment class to compute classification accuracy. In terms of accuracy and BLEU-3 score, the Cond. Noise+BN model provides the best generator as it is able to capture and leverage the class information. Unsupervised Text Style Transfer (UTST): In Table 3, we report BLEU and accuracy scores for Se-qGFMN and six baselines: BackTranslation (Prabhumoye et al., 2018), which uses back-transfer loss; CrossAligned (Shen et al., 2017), MultiDecoder (Fu et al., 2018), and StyleEmbedding (Fu et al., 2018), which use adversarial loss; and Tem-plateBased (Li et al., 2018) and Del-Retrieval (Li et al., 2018), which uses rule-based methods. The BLEU score is computed between the transferred sentences and the human-annotated transferred references, similar to (Li et al., 2018). And, the accuracy is based on our pre-trained classifier. Compared to the other models, SeqGFMN produces the best balance between BLEU and accuracy. Additionally, if we use back-transfer loss together with feature matching loss (SeqGFMN + BT) our model gets a significant improvement on both metrics.

Conclusion
We presented new implicit generative models based on feature matching loss that are suitable for unconditional and conditional text generation. Our results demonstrated that backpropagating through discrete data is not an issue for the training via matching distributions at the token level. SeqGFMN can be trained from scratch without the need for RL or Gumbel Softmax. This approach has allowed us to create effective models for unconditional generation, class-conditional generation, and unsupervised text style transfer. We believe this work opens a new competitive avenue in the area of implicit generative models for sequential data.

Appendices A Experimental Setup
SeqGFMN Generator: We use a deconvolutional generator that extends the decoder architecture proposed in . It consists of three strided deconvolutional layers followed by cosine similarity between the generated token embeddings and an embedding matrix. Our adaptations are as follows: (1) we added two convolutional layers after the second deconvolution; (2) we added a self-attention layer before the last deconvolutional layer; (3) we added a convolutional layer after the last deconvolutional layer; (4) after the final convolution, we multiply the resulting token embeddings by the embedding matrix and apply the softmax function to generate a probability distribution over the vocabulary. We use the embedding matrix from BERT model and this matrix is not updated during the training of seqGFMN. The number of convolutional filters used is 400 with kernel size of 5. SeqGFMN Training: SeqGFMNs are trained with an ADAM optimizer for which most hyperparameters are kept fixed across datasets. We use n z = 100 and minibatch size of 128. We use learning rates of 10 −4 and 10 −3 for updating G, and ADAM Moving Averages (AMA), respectively.

B Unconditional Text Generation
An interesting comparison would be between Se-qGFMN and GANs that use BERT as a pre-trained discriminator. However, GANs fail to train when a very deep network is used as the discriminator Moreover, SeqGFMN also outperforms GAN generators even when shallow word embeddings (Glove / FastText) are used to perform feature matching. Pretrained word embeddings are normally used in GANs for text. In Tab. 4, we present randomly selected samples that were generated by SeqGFMN and RelGAN. These samples corroborate the quantitative results and show that SeqGFMN can generate good text when trained from scratch. At the same time, the state-of-the-art method RelGAN is unable to generate reasonable text without pretraining.

C Class-Conditional Generation
In Tab. 5, we present cherry-picked examples of generated text. Interestingly, since our input noise z is transformed according to sentiment c, we implicitly have a pairing between z 0 and z 1 . Text generated from z 0 and z 1 are related to the same z. The effect of this implicit pairing can be seen in the examples where sentences seem somehow related, but of the opposite sentiment. Qualitatively, conditional SeqGFMN models can leverage class information to improve generation.
In Table 6, we present samples of original and sentiment transferred sentences. For each original sentence, we show the reference transferred sentence from the test set (done by a human) and the sentence that was transferred by SeqGFMN. Similar to other recently proposed UTST methods, the most successful cases of sentiment transfer are the ones where the transfer can be done by removing and replacing a few words of the sentence. In Table  6, the last example of each block are cases where SeqGFMN does not do a good job when significant changes in the original sentence are required to perform a more fluent sentiment transfer.

D Unsupervised Text Style Transfer
The baselines are calculated with the data collected by (Luo et al., 2019) 1 and using Unsupervised NMT methods (Zhang et al., 2018).

E Interpolation
We interpolate in the latent space of SeqGFMN z and check whether the sentences generated by the interpolation are syntactically and/or semantically related. In detail, we sample two vectors z 0 and z 1 from the prior distribution p z and build intermediate points z λ = λz 1 + (1 − λ)z 0 . In Tab. 7, we show samples from two interpolations, on models trained on COCO and WMT news dataset. In both these cases, we notice that there exists some syntactic and/or semantic relationship between the sentences along the interpolating path. This is supporting evidence that the latent space induced by SeqGFMN is meaningful, and related sentences are close together in this latent space.    COCO a group of people sleeps in the street a group of people standing in the street a toy of people warming a street sidewalk an automobile car lies on an short parking road an automobile car lies on an green parking road an automobile car lies on an green bike field the automobile car lies on an green parking field the automobile car is on an green parking field WMT News "although that might do nothing -i admit it-and i've invested time time at work," i tend to say it doesn do nothing. "although the odds do it -i get it-and ross hasn always conceded his chance at it," i tend to say our odds are there. reportedly upon the call to court, i get it, while romney has promised that his ban did nothing but say voters had better announce... reportedly upon the call at court and i get it, while voters didn ##rem realize the ban was there. the said pledge would take on one another day, sexually claiming to top the worst in your period at the academy. the us has to feed two-thirds in one month, typically in the best ##quest best ##gist at the in & in millions in. this will cover two-thirds billion trillion in this period, possibly two-thirds -63 0 in one months. in addition, regulators selected millions in one years, potentially billions in another decade, possibly the bottom-profile economies ... Table 7: Interpolation in the latent space z of SeqGFMN models trained on COCO Image Captions and WMT News.