Control, Generate, Augment: A Scalable Framework for Multi-Attribute Text Generation

We introduce CGA, a conditional VAE architecture, to control, generate, and augment text. CGA is able to generate natural English sentences controlling multiple semantic and syntactic attributes by combining adversarial learning with a context-aware loss and a cyclical word dropout routine. We demonstrate the value of the individual model components in an ablation study. The scalability of our approach is ensured through a single discriminator, independently of the number of attributes. We show high quality, diversity and attribute control in the generated sentences through a series of automatic and human assessments. As the main application of our work, we test the potential of this new NLG model in a data augmentation scenario. In a downstream NLP task, the sentences generated by our CGA model show significant improvements over a strong baseline, and a classification performance often comparable to adding same amount of additional real data.


Introduction
Recently, natural language generation (NLG) has become a prominent research topic in NLP due to its diverse applications, ranging from machine translation (e.g., Sennrich et al. (2016)) to dialogue systems (e.g., Budzianowski and Vulić (2019)). The common goal of these applications using automatic text generation is the augmentation of datasets used for supervised NLP tasks. To this end, one of the key demands of NLG is controlled text generation, more specifically, the ability to systematically control semantic and syntactic aspects of generated text.
Most previous approaches simplify this problem by approximating NLG with the control of one single attribute of the text, such as sentiment or formality (e.g., Li et al. (2018), Fu et al. (2018), and John et al. (2019)). However, the problem of controlled generation typically relies on multiple components such as lexical, syntactic, semantic and stylistic aspects. Therefore, the simultaneous control of multiple attributes becomes vital to generate natural sentences suitable for downstream tasks. Methods such as the ones presented by Hu et al. (2017) and Subramanian et al. (2018) succeed in simultaneously controlling various attributes. However, these methods depend on the transformation of input reference sentences, or do not scale easily to more than two attributes due to architectural complexities, such as the requirement for separate discriminators for each additional attribute.
In light of these challenges, we propose the Control, Generate, Augment framework (CGA), a powerful model to synthesize additional labeled data sampled from a latent space. The accurate multi-attribute control of our approach offers significant performance gains on downstream NLP tasks. We provide the code and all generated English sentences to facilitate future research 1 .
The main contribution of this paper is a scalable model which learns to control multiple semantic and syntactic attributes of a sentence. The CGA model requires only a single discriminator for simultaneously controlling multiple attributes. To the best of our knowledge, we are the first to incorporate techniques such as cyclical word-dropout and a context-aware loss, which allow the CGA model to generate natural sentences given a latent representation and an attribute vector, without requiring an input reference sentence during training. We present automatic and human assessments to confirm the multi-attribute control and high quality of the generated sentences. Further, we provide a thorough comparison to previous work.
We use CGA as a natural language generation method for data augmentation, which boosts the performance of downstream tasks. We present data augmentation experiments on various English datasets, where we significantly outperform a strong baseline and achieve a performance often comparable to adding same amount of additional real data.

Method
We now present our model for controlled text generation. Our model is based on the Sentence-VAE framework by Bowman et al. (2016). However, we modify this model to allow the generation of sentences conditioned not only on the latent code but also on an attribute vector. We achieve this by disentangling the latent code from the attribute vector, in a similar way as the Fader networks (Lample et al., 2017), originally developed for computer vision tasks. As we will see, this simple adaption is not sufficient, and we introduce further techniques to improve the multi-attribute sentence generation.

Model Architecture
We assume access to a corpus of sentences X = {x i } N i=1 and a set of K categorical attributes of interest. For each sentence x i , we use an attribute vector a i to represent these K associated attributes. Example attributes include the sentiment or verb tense of a sentence.
Given a latent representation z, which encodes the context information of the corpus and an attribute vector a, our goal is to construct a ML model which generates a new sentence x containing the attributes of a.

Sentence Variational Autoencoder
The main component of our model is a Variational Auto-Encoder (Kingma and Welling, 2013). The encoder network E θenc , parameterized by a trainable parameter θ enc , takes as input a sentence x and defines a probabilistic distribution over the latent code z: The decoder G θ dec , parameterized by a trainable parameter θ dec , tries to reconstruct the input sentence x from a latent code z and its attribute vector a. We assume that the reconstructed sentencex has the same number of tokens as the input sentence x: where T is the length of the input sentence andx t is the t th token. Here we use p G to denote both sentence-level probability and word-level conditional probability.
To train the encoder and decoder, we use the following VAE loss: where p(z) is a standard Gaussian distribution.
When we try to optimize the loss in Equation 3, the KL term often vanishes. This problem is known in text generation as posterior collapse (Bowman et al., 2016). To mitigate this problem we follow Bowman et al. (2016) and add a weight λ kl to the KL term in Equation 3. At the start of training, we set the weight to zero, so that the model learns to encode as much information in z as possible. Then, as training progresses, we gradually increase this weight, as in the standard KL-annealing technique.
Moreover, the posterior collapse problem occurs partially due to the fact that, during training, our decoder G θ dec predicts each token conditioned on the previous ground-truth token. We aim to make the model rely more on z. A natural way to achieve this is to weaken the decoder by removing some or all of this conditional information during the training process. Previous works (Bowman et al., 2016;Hu et al., 2017) replace a -randomly selected -significant portion of the ground-truth tokens with UNKNOWN tokens. However, this can severely affect the decoder and deteriorate the generative capacity of the model. Therefore, we define a new word-dropout routine, which aims at both accommodating the posterior collapse problem and preserving the decoder capacity. Instead of fixing the word-dropout rate to a large constant value as in Bowman et al. (2016), we use a cyclical worddropout rate ζ: where s is the current training iteration, k max and k min are fixed constant values we define as upper and lower thresholds, and τ defines the period of the cyclical word-dropout rate schedule (see Suppl. Section A.2).
Disentangling Latent Code z and Attribute Vector a To be able to generate sentences given any attribute vector a , we have to disentangle the attribute vector with the latent code. In other words, we seek that z is attribute-invariant: A latent code z is attribute-invariant if given two sentences x 1 and x 2 , they only differ in their attributes (e.g., two versions of the same review expressing opposite sentiment). Hence, they should result in the same latent representation z = E θenc (x 1 ) = E θenc (x 2 ).
To achieve this, we use a concept from predictability minimization (Schmidhuber, 1992) and adversarial training for domain adaptation (Ganin et al., 2016;Louppe et al., 2017), which was recently applied in the Fader Networks by Lample et al. (2017). We apply adversarial learning directly on the latent code z of the input sentence x. We set a min-max game and introduce a discriminator D θ disc (z), that takes as input the latent code and tries to predict the attribute vector a. Specifically, D θ disc (z) outputs for each attribute k, a probability distribution p k D over all its possible values. To train the discriminator, we optimize for the following loss: where a k is the ground-truth of the k th attribute.
Simultaneously, we hope to learn an encoder and decoder which (1) combined with the attribute vector a, allows the decoder to reconstruct the input sentence x, and (2) does not allow the discriminator to infer the correct attribute vector corresponding to x. We optimize for: Context-Aware Loss Equation 6 forces our model to choose which information the latent code z should retain or disregard. However, this approach comes with the risk of deteriorating the quality of the latent code itself. Therefore, inspired by Sanakoyeu et al. (2018), we propose an attributeaware context loss, which tries to preserve the context information by comparing the sentence latent representation and its back-context representation: We use a "stop-gradient" procedure, i.e., we compute the gradient w.r.t. E θenc (x), which makes the function in Equation 7 differentiable.
The latent vector z = E θenc (x) can be seen as a contextual representation of the input sentence x. This latent representation is changing during the training process and hence adapts to the attribute vector. Thus, when measuring the similarity between z and the back-context representation E θenc (G θ dec (E θenc (x))), we focus on preserving those aspects which are profoundly relevant for the context representation.
Finally, when training the encoder and decoder (given the current discriminator), we optimize for the following loss:

Evaluation
To assess our newly proposed model for controlled sentence generation, we perform the following evaluations described in this section: An automatic and human evaluation to analyze the quality of the new sentences with multiple controlled attributes; an examination of sentence embedding similarity to assess the diversity of the generated samples; downstream classification experiments with data augmentation on two different datasets to prove the effectiveness of the new sentences in a pertinent

Sentence
Attributes it was a great time to get the best in town and i loved it.
Past / Positive it was a great time to get the food and it was delicious.
Past / Positive it is a must! Present/Positive they're very reasonable and they are very friendly and helpful. Present / Positive i had a groupon and the service was horrible.
Past / Negative this place was the worst experience i've ever had.
Past / Negative it is not worth the money.
Present / Negative there is no excuse to choose this place.
Present / Negative   Table 3: Mean attribute matching accuracy of the 30K generated sentences (in %); standard deviation reported in brackets. application scenario; and, finally, a comparison of our results to previous work to specifically contrast our model against other single and multi-attribute models.
Datasets We conduct all experiments on two datasets, YELP and IMDB reviews. Both contain sentiment labels for the reviews. From the YELP business reviews dataset (YELP, 2014), we use reviews only from the category restaurants, which results in a dataset of approx. 600'000 sentences. The IMDB movie reviews dataset (Maas et al., 2011) contains approx. 150'000 sentences. For reproducibility purposes, details about training splits and vocabulary sizes can be found in the supplementary materials (A.1.1).
Attributes For our experiments we use three attributes: sentiment as a semantic attribute; verb tense and person number as syntactic attributes.
SENTIMENT: We labeled each review as positive or negative following Shen et al. (2017). VERB TENSE: We detect past and present verb tenses using SpaCy's part-of-speech tagging model 2 . We define a sentence as present if it contains more present than past verbs. We provide the specific PoS tags used for the labeling in the supplementary materials (A.1.2). PERSON NUMBER: We also use SpaCy to detect singular or plural pronouns and nouns. Consequently, we label a sentence as singular if it contains more singular than plural pronouns or nouns, we define it plural, in the opposite case, balanced otherwise. We train our model to generate sentences of maximally 20 tokens by controlling one, two or three attributes simultaneously. The sentences are generated by the decoder as described in Equation 2. We chose to set the maximum sentence length to 20 tokens in this work, since (a) it is considerably more than previous approaches (e.g., Hu et al. (2017) presented a max length of 15 tokens), and (b) it covers more than the 99th percentile of the sentence lengths in the datasets used, which is 16.4 tokens per sentence for YELP and 14.0 for IMDB. In

Attribute
Sentences Accuracy (κ)   Table 2 shows sentences where the model controls three attributes simultaneously. Sentences with single controlled attributes can be found in the supplementary material (A.4).
Experimental Setting The encoder and decoder are single-layer GRUs with a hidden dimension of 256 and maximum sample length of 20. The discriminator is a single-layer LSTM. To avoid a vanishingly small KL term in the VAE (Bowman et al., 2016), we use a KL term weight annealing that increases from 0 to 1 during training according to a logistic scheduling. λ disc increases linearly from 0 to 20. Finally, we set the back-translation weight δ to 0.5. All hyper-parameters are provided in the supplementary material (A.2).

Quality of Generated Sentences
We quantitatively measure the sentence attribute control of our CGA model by inspecting the accuracy of generating sentences containing the designated attributes by conducting both automatic and human evaluations.
Attribute Matching For this automatic evaluation, we generate sentences given the attribute vector a as described in Section 2. To assign SEN-TIMENT attribute labels to the newly generated sentences, we apply a pre-trained TextCNN (Kim, 2014). To assign the VERB TENSE and PERSON NUMBER labels we use SpaCy's part-of-speech tagging. We calculate the attribute matching accuracy as the percentage of the predictions of these pre-trained models on the generated sentences that match the attribute labels expected to be generated by our CGA model. Table 3 shows the averaged results over five balanced sets of 6000 sentences generated by CGA models, trained on YELP and IMDB, respectively.
Human Evaluation To further understand the quality of the generated sentences we go beyond the automatic attribute evaluation and perform a human judgement analysis. We provide all generated sentences including the human judgements 3 . One of our main contributions is the generation of sentences with up to three controlled attributes. Therefore, we randomly select 120 sentences generated by the CGA model trained on YELP, which controls all three attributes. Two human annotators labeled these sentences by marking which of the attributes are included correctly in the sentence.
In addition to the accuracy we report interannotator rates with Cohen's κ. In 80% of the sentences all three attributes are included correctly and in 100% of the sentences at least two of the three attributes are present. Finally, the annotators also judged whether the sentences are grammatically correct, complete and coherent English sentences. Most of the incorrect sentences contain repeated words or incomplete endings. The results are shown in Table 4.

Ablation Study
We conduct an ablation study testing the key components of the CGA model, both on the YELP and IMDB datasets. We separately trained four different versions of CGA to assess the impact on the multi-attribute control of the disjoint and joint usage of the cyclical worddropout (Equation 4) and of the context-aware loss (Equation 7). We computed the attribute matching score following the same approach described above. As shown in Table 5, both techniques are beneficial for attribute control, especially for SEN-TIMENT and PERSON NUMBER. When the model is trained using at least one of these techniques it already shows significant improvements in all cases expect for VERB TENSE on the IMDB data. Moreover, when the cyclical word-dropout and the context-aware loss are used jointly during training, the model experiences an increase of performance between 1-6% w.r.t. the model trained without using these techniques.
Sentence Embedding Similarity Although generative models have been shown to produce outstanding results, in many circumstances they risk producing extremely repetitive examples (e.g., Zhao et al. (2017)). In this experiment, we qualitatively assess the capacity of our model to gener-  Table 5: Ablation Study of the key components, reporting attribute matching scores for three features on the YELP and IMDB datasets. L ADV + standard WD is trained with the word-dropout of Bowman et al. (2016); L ADV + cyclical WD is trained with our cyclical word-dropout; L CT X + standard WD is trained with the standard word-dropout and with context-aware loss; L CT X + cyclical WD is trained with both cyclical word-dropout and context-aware loss. ate diversified sentences to further strengthen the results obtained in this work. We sample 10K sentences from YELP (D real ) and from our generated sentences (D gen ), respectively, both labeled with the SENTIMENT attribute. We retrieve the sentence embedding for each of the sentences in D real and D gen using the Universal Sentence Encoder (Cer et al., 2018). Then, we compute the cosine simi-larity between the embeddings of all sentences of D real and, analogously, between the embeddings of our generated sentences D gen . Consequently, we obtain two similarity matrices M real and M gen (see Figure 2). Both matrices show a four cluster structure: top-left -similarity scores between negative reviews (C nn ); top-right or bottom-left -similarity scores between negative and positive reviews (C np ); and bottom-right -similarity scores between positive reviews (C pp ).
Further, for each sample of D real and D gen we compute a similarity score as follows: where c ∈ {C nn , C np , C pp }. s i, is the i-th sample of D real or D gen and c is the cluster to which s i belongs. N K,c is the set of the k-most similar neighbours of s i in cluster c, and k=50. To gain a qualitative understanding of the generation capacities of our model, we assume that an ideal generative model should produce samples that have comparable similarity scores to the ones or the real data. Figure 3 contrasts the similarity scores of D real and D gen , computed on each cluster separately.
Although our generated sentences are clearly more similar between themselves than to the original ones, our model is able to produce samples clustered according to their labels. This highlights the good attribute control abilities of our CGA model and shows that it is able to generate diverse sentences which robustly mimic the structure of the original dataset. Hence, the generated sentences are good candidates for augmenting existing datasets.
We generalized this experiment for the multiattribute case. The similarity matrices and the his-

Data Augmentation
The main application of our work is to generate sentences for data augmentation purposes. Simultaneously, the data augmentation experiments presented in this section reinforce the high quality of the sentences generated by our model. As described, we conduct all experiments on two datasets, YELP and IMDB reviews. We train an LSTM sentiment classifier on both datasets, each with three different training set sizes. We run all experiments for training sets of 500, 1000 and 10000 sentences. These training sets are then augmented with different percentages of generated sentences (10, 20, 30, 50, 70, 100, 120, 150 and 200%). This allows us to analyze the effect of data augmentation on varying original training set sizes, as well as varying increments of additionally generated data. In all experiments we average the results over 5 random seeds and we report the corresponding standard deviation.
To evaluate how beneficial our generated sentences are for the performance of downstream tasks, we compare training sets augmented with sentences generated from our CGA model to (a) real sentences from the original datasets, and (b) sentences generated with the Easy Data Augmentation (EDA) method by Wei and Zou (2019). EDA applies a transformation (e.g., synonym replacement or random deletion) to a given sentence of the training set and provides a strong baseline.
The results are presented in Figures 4 and 5, for YELP and IMDB, respectively. They show the performance of the classifiers augmented with sentences from our CGA model, from EDA and from the original datasets. Our augmentation method proved to be favorable in all six scenarios. Our model clearly outperforms EDA in all the possible scenarios, especially with larger augmentation percentages. The performance of the classifiers augmented with CGA sentences is equal to real data, and only begins to diverge when augmenting the training set with more than 100% of generated data.
In Table 6, we report the best average test accuracy as well as the percentage of data increment of real data, EDA and our CGA model for all three training set sizes and both datasets. Numerical results for all augmentation percentages including validation performance can be found in the supplementary materials (A.3.2).  Table 6: Best performance for each method independent of the augmentation percentage used. For each method we report accuracy, standard deviation, and augmentation percentage.

Comparison to Previous Work
As a final analysis, we compare our results with previous state-of-the-art models for both singleattribute and multi-attribute control. These models are specifically designed to control this single attribute by approximating the style of a sentence with its sentiment. Shen et al. achieve better sentiment matching accuracy (96% and 93.4%, respectively) in the automatic evaluation than our CGA model trained for a single attribute (93.1%). However, our CGA model obtains 96.3% in human evaluation, which is comparable with these works. Moreover, CGA offers a strong competitive advantage because it guarantees high sentiment matching accuracy while controlling additional attributes and, thus, offers major control over multiple stylistic aspects of a sentence.  2019) follow the approach of the CAAE with a two-phase training procedure. Unlike our CGA model, these works enforce content preservation and require input reference sentences. Hence, it is not straight-forward to directly compare the results. However, their reported attribute matching accuracies for the SENTIMENT and VERB TENSE attributes are considerably lower than ours (91.1% and 96.6%, respectively). CGA also yields significantly better performance in the human evaluation. Recently, Wang et al. (2019) proposed an architecture for multi-attribute control. However, they focus merely on sentiment aspect attributes, while our CGA model is able to control both semantic and syntactic attributes.

Multi-Attribute Control
These previous works reported content preservation as an additional evaluation metric. It is important to note that this metric is of no interest for our work, since, differently from these previous models, CGA generates sentences directly from an arbitrary hidden representations and it does not need a reference input sentence. Moreover, our CGA model is scalable to more attributes, while the previous architectures require multiple discriminators for controlling the attributes. Although we provide extensive evaluation analyses, it is still an open research question to define an appropriate evaluation metric for text generation to allow for neutral comparisons.

Conclusion
To the best of our knowledge, we propose the first framework for controlled natural language generation which (1) generates coherent sentences sampling from a smooth latent space, with multiple semantic and syntactic attributes; (2) works within a lean and scalable architecture, and (3) improves downstream tasks by synthesizing additional labeled data.
To sum up, our CGA model, which combines a context-aware loss function with a cyclical worddropout routine, achieves state-of-the-art results with improved accuracy on sentiment, verb tense and person number attributes in automatic and human evaluations. Moreover, our experiments show that our CGA model can be used effectively as a data augmentation framework to boost the performance of downstream classifiers.
A thorough investigation of the quality of the attribute-invariant representation in terms of inde-pendence between the context and the attribute vector will provide further insights. Additionally, a benchmark study of the maximum possible length of the generated sentences and the number of controllable attributes will deepen our understanding of the capabilities and limitations of CGA. We use YELP and IMDB for the training, validation and testing of our CGA models. The label distributions for all attributes are described in Table  7.
From the YELP business reviews dataset (YELP, 2014) 4 , we use reviews only from the category restaurants. We use the same splits for training, validation and testing as John et al. (2019), which contain 444101, 63483 and 126670, respectively. The vocabulary contains 9304 words.
We further evaluate our models on the IMDB dataset of movie reviews (Maas et al., 2011) 5 . We use reviews with less than 20 sentences and we select only sentences with less than 20 tokens. Our final dataset contains 122345, 12732, 21224 sentences for train validation and test, respectively. The vocabulary size is 15362 words.

A.1.2 Attribute Labeling
In this work we simultaneously control three attributes: SENTIMENT, VERB TENSE and PERSON NUMBER.
We use SpaCy's Part-of-Speech tagging to assign the VERB TENSE labels. Specifically, we use the tags VBP and VBZ to identify present verbs, and the tag VBD to identify past verbs.
Analogously, we use the SpaCy's PoS tags and the personal pronouns to assign PERSON NUMBER labels. In particular, we use the tag NN, which identifies singular nouns, and the following list of pronouns {i, he, she, it, myself} to identify a singular sentence. We use NNS and the list of pronouns {we, they, themselves, ourselves} to identify a plural sentence.

A.2 Training Details
All hyper-parameters were manually tuned. We report the tested ranges in square brackets.
VAE architecture Our VAE has one GRU encoder and one GRU decoder. The encoder has a hidden layer of 256 dimensions [64,128,256,512], linearly transferred to the content vector of 32 dimensions (for one or two attributes), or 50 4 Retrieved from https://github.com/ shentianxiao/language-style-transfer 5 Retrieved from https://www. kaggle.com/lakshmi25npathi/ imdb-dataset-of-50k-movie-reviews dimensions (for three attributes) [32,50,64,128]. For training the decoder we set the initial hidden state as h = Linear(z ⊕ a). Moreover, we use teacher-forcing combined with the cyclical worddropout described in Equation 4.
Discriminator The discriminator is used to create the attribute-invariant content vectors. We experimented with two architectures for the discriminator which held similar results. We tried a two-layer (64 dimensions each) fully-connected architecture with batch normalization; and a singlelayer LSTM with 50 dimensions (for one or two attributes), or 64 dimensions (for three attributes).
KL-Annealing One of the challenges during the training process was the posterior collapse of the KL term. Similar to Bowman et al. (2016), we used a logistic KL annealing: where x is the current training step. x 0 indicates how many training steps are needed to set λ kl = 1. K is a constant value given by: We set x 0 = 1000 for YELP and x 0 = 5000 for IMDB. is a constant we set to 10 −4 .

Discriminator Weight
The interaction between the VAE and the Discriminator is a crucial factor for our model. Thus, we decide to linearly increase the discriminator weight λ disc during the training process: where t is the maximum value that λ disc can have. x 0 indicates after how many training steps λ disc = t. x is the current training step. k 1 is the warm-up value and it indicates after how many training steps the L disc is included in L CGA . We set t = 20, x 0 = 6K and k 1 = 12K for YELP or x 0 = 3K and k 1 = 5K.   (a) Real Data (b) Generated Data Figure 6: Similarity matrices for real data and data generated by our CGA model controlling the SENTIMENT and VERB TENSE attributes.

A.3.1 Sentence Embedding Similarity
Following the approach described in the main paper, we report the results of the sentence embedding similarities for the multi-attribute case (SENTIMENT and VERB TENSE). Similarly to the similarity matrices for the single-attribute case, in Figure 6 we recognize the clustered structure of the similarities. These matrices can be divide into the following clusters: • Intra-class Clusters: These are the clusters which are placed over the diagonal of the matrices and show a high cosine similarity scores. They contain similarity scores between the embeddings of samples with the same labels.
• Cross-Class Clusters: These are the clusters located above the intra-class clusters. They contains the similarity scores between embeddings of samples with different labels. Indeed, they show lower similarity scores.
To gain a qualitative understanding of the generation capacities of our model, we assume that an ideal generative model should produce samples that have comparable similarity scores to the ones of the real data. We contrast the similarity scores computed on each cluster separately in the histograms in Figures 7 and 8.

A.3.2 Data Augmentation
For the data augmentation experiments we use a bidirectional LSTM with input size 300 and hidden size 256 [64,128,256,512]. The input size is given by the 300 dimensional pre-trained GloVe embeddings (Pennington et al., 2014). We set dropout to 0.8 [0.5, 0.6, 0.7, 0.8]. For the training we use early stopping, specifically we stop the training process after 8 epochs without improving the validation loss. Tables 8 and 9 show the detailed results for the data augmentation experiments on IMDB and YELP, respectively. The standard deviation of all results over 5 random seeds was always <0.025. Table 10 shows the corresponding performance on the validation set for the results on the test set presented in Table 6.

A.3.3 TextCNN
For the Attribute Matching results presented in Section 3 we use the pre-trained TextCNN (Kim, 2014). This network standardly uses 100 dimensional Glove word embeddings (Pennington et al., 2014), 3 convolutional layers with 100 filters each. The dropout rate is set to 0.5 during the training process.

A.4 Generated Sentences
Tables 11 to 13 provide example sentences generated by the CGA model for the three individual attributes. Moreover, the code repository provides all generated sentences 6 .

A.5 Computing Infrastructure
All models presented in this work were implemented in PyTorch, and trained and tested on single Titan XP GPUs with 12GB memory. 6 Link omitted for review The average runtime was 07:26:14 for the model trained on YELP. The average runtime was 04:09:54 for the model trained on IMDB.    Table 9: Detailed accuracy numbers for IMDB data augmentation results presented in Figure 5.

Sentence
Sentiment but i'm very impressed with the food and the service is great. Positive i love this place for the best sushi! Positive it is a great place to get a quick bite and a great price.
Positive it's fresh and the food was good and reasonably priced.
Positive not even a good deal.
Negative so i ordered the chicken and it was very disappointing.
Negative by far the worst hotel i have ever had in the life.
Negative the staff was very rude and unorganized. Negative Table 11: Examples of generated sentences controlling the SENTIMENT attribute.

Sentence
Tense i love the fact that they have a great selection of wines. Present they also have the best desserts ever. Present the food is good , but it's not worth the wait for it. Present management is rude and doesn't care about their patients. Present my family and i had a great time.
Past when i walked in the door , i was robbed.
Past had the best burger i've ever had.
Past my husband and i enjoyed the food. Past Table 12: Examples of generated sentences controlling the VERB TENSE attribute.

Sentence
Person it was a little pricey but i ordered the chicken teriyaki. Singular she was a great stylist and she was a sweetheart. Singular worst customer service i've ever been to. Singular this is a nice guy who cares about the customer service. Singular they were very friendly and eager to help. Plural these guys are awesome! Plural the people working there were so friendly and we were very nice. Plural we stayed here for NUM nights and we will definitely be back. Plural Table 13: Examples of generated sentences controlling the PERSON NUMBER attribute.