A Dual-generator Network for Text Style Transfer Applications

We propose DGST, a novel and simple Dual-Generator network architecture for text Style Transfer. Our model employs two generators only, and does not rely on any discriminators or parallel corpus for training. Both quantitative and qualitative experiments on the Yelp and IMDb datasets show that our model gives competitive performance compared to several strong baselines with more complicated architecture designs.


Introduction
Attribute style transfer is a task which seeks to change a stylistic attribute of text, while preserving its attribute-independent information. Sentiment transfer is a typical example of such kind, which focuses on controlling the sentiment polarity of the input text (Shen et al., 2017). Given a review "the service was very poor", a successful sentiment transferrer should covert the negative sentiment of the input to positive (e.g., replacing the phrase "very poor" with "pretty good"), while keeping all other information unchanged (e.g., the aspect "service" should not being changed to "food"). Without supervised signals from parallel data, a transferrer must be supervised in a way to ensure that the generated texts belongs to a certain style category (i.e., transfer intensity). There is a growing body of studies to intensify the target style by means of adversarial training (Fu et al., 2018), variational autoencoder (John et al., 2019;Fang et al., 2019), generative adversarial nets (Shen et al., 2017;Yang et al., 2018), or subspace matrix projection (Li et al., 2020) Furthermore, in order to boost the preservation of non-attribute information during style transformation, some works explicitly focus on modify-ing sentiment words, which is so-called the "pivot word" Wu et al., 2019). There are also works which add extra components for constraining the content from being changed too much. These include models like autoencoder (Lample et al., 2019;Dai et al., 2019), part-of-speech preservation, and the content conditional language model (Tian et al., 2018). In order to achieve highquality style transfer, existing works normally resort to adding additional inner or outer structures such as additional adversarial networks or data preprocessing steps (e.g. generating pseudo-parallel corpora). This inevitably increases the complexity of the model and raises the bar of training data requirement.
In this paper, we propose a novel and simple model architecture for text style transfer, which employs two generators only. In contrast to some of the dominant approaches to style transfer such as CycleGAN (Zhu et al., 2017), our model does not employ any discriminators and yet can be trained without requiring any parallel corpus. We achieve this by developing a novel sentence noisification approach called neighbourhood sampling, which can introduce noise to each input sentence dynamically. The nosified sentences are then used to train our style transferrers in the way similar to the training of denoising autoencoders (Vincent et al., 2008). Both quantitative and qualitative evaluation on the Yelp and IMDb benchmark datasets show that DGST gives competitive performance compared to several strong baselines which have more complicated model design. The code of DGST is available at: https://xiao.ac/proj/dgst.

Methodology
Suppose we have two non-parallel corpora X and Y with style S x and S y , the goal is training two transferrers, each of which can (i) transfer a sen- tence from one style (either S x and S y ) to another (i.e., transfer intensity); and (ii) preserve the styleindependent context during the transformation (i.e., preservation). Specifically, we denote the two transferrers f and g. f : X → Y transfers a sentence x ∈ X with style S x to y * with style S y . Likewise, g : Y → X transfers a sentence y ∈ Y with style S y to x * with S x . To obtain good style transfer performance, f and g need to achieve both a high transfer intensity and a high preservation, which can be formulated as follows: ∀x, ∀x ∈ X , ∀y, ∀y ∈ Y Here D(x||y) is a function that measures the abstract distance between sentences in terms of the minimum edit distance, where the editing operations Φ includes word-level replacement, insertion, and deletion (i.e., the Hamming distance or the Levenshtein distance). On the one hand, Eq. 1 requires the transferred text should fall within the target style spaces (i.e., X or Y). On the other hand, Eq. 2 constrains the transferred text from changing too much, i.e., to preserve the style-independent information.
Inspired by CycleGAN (Zhu et al., 2017), our model (sketched in Figure 1) is trained by a cyclic process: for each transferrer, a text is transferred to the target style, and then back-transferred to the source style using another transferrer. In order to transfer a sentence to a target style while preserving the style-independent information, we formulate two sets of training objectives: one set ensures that the generated sentences is preserved as much as possible (detailed in §2.1) and the other set is responsible for transferring the input text to the target style (detailed in §2.2).

Preserving the Content of Input Text
This section discusses our loss function which enforces our transferrers to preserve the styleindependent information of the input. A common solution to this problem is to use the reconstruction loss of the autoencoders (Dai et al., 2019), which is also known as the identity loss (Zhu et al., 2017). However, too much emphasis on preserving the content would hinder the style transferring ability of the transferrers. To balance our model's capability in content preservation and transfer intensity, we instead first train our transferrers in the way of training denoising autoencoders (DAE, Vincent et al., 2008), which has been proved to help preserving the style independent content of input text (Shen et al., 2020). More specifically, we train f (or g; we use f as an example in the rest of this section) by feeding it with a noisy sentenceẙ as input, whereẙ is noisified from y ∈ Y and f is expected to reconstruct y.
Different from previous works which use DAE in style transfer or MT (Artetxe et al., 2018;Lample et al., 2019), we propose a novel sentence noisification approach, named neighbourhood sampling, which introduces noise to each sentence dynamically. For a sentence y, we define U α (y, γ) as a neighbourhood of y, which is a set of sentences consisting of y and all variations of noisified y with the same noise intensity γ (which will be explained later). The size of the neighbourhood U α (y, γ) is determined by the proportion (denoted by m) of tokens in y that are modified using the editing operations in Φ. Here the proportion m is sampled from a Folded Normal Distribution F. We hereby define that the average value of m (i.e., the mean of F) is the noise intensity γ. Formally, m is defined as: That said, a neighbourhood U α (y, γ) would be constructed using y and all sentences that are created by modifying (m × length(y)) words in y, from which we sampleẙ, i.e., a noisified sentence of y: y ∼ U α (y, γ). Analogously, we could also construct a neighbourhood U β (x, γ) for x ∈ X and samplex from it. Using these noisified data as inputs, we then train our transferrers f and g in the way of DAE by optimising the following recon-struction objectives: With Eq. 4, we essentially encourages the generator to preserve the input as much as possible.

Transferring Text Styles
Making use of non-parallel datasets, we train f and g in an iterative process. Let M = {g(y)|y ∈ Y } be the range of g when the input is all sentences in the training set Y . Similarly, we can define N = {f (x)|x ∈ X}. During the training cycle of f , g will be kept unchanged. We first feed each sentence y (y ∈ Y ) to g, which tries to transfer y to the target style X (i.e. ideally x * = g(y) ∈ X ). In this way, we obtain M which is composed of all x * for each y ∈ Y . Next, we samplex * (a noised sentence of x * ) based on x * via the neighbourhood sampling, i.e.,x * ∼ U α (x * , γ) = U α (g(y), γ). We useM to represent the collection ofx * . Similarly, we obtains N andN using the aforementioned procedures during the training cycle for g. Instead of directly using the sentences from X for training, we useM to train f by forcing f to transfer eachx * back to the corresponding original y. In parallel,N is utilised to train g. We represent the aforementioned operation as the transfer objective.
The main difference between Eq. 4 and Eq. 5 is how U α (·, γ) and U β (·, γ) are constructed, i.e., U α (y, γ) and U β (x, γ) in Eq. 4 compared to U α (g(y), γ) and U β (f (x), γ) in Eq. 5. Finally, the overall loss of DGST is the sum of the four partial losses: During optimisation, we freeze g when optimising f , and vice versa. Also with the reconstruction objective, x * must to be sampled first, and then passedx * into f ; in contrast, it is not necessary to sample according to y when we obtain x * = g(y).

Setup
Dataset. We evaluated our model on two benchmark datasets, namely, the Yelp review dataset  (Yelp), which consists of restaurants and business reviews together with their sentiment polarity (i.e., positive or negative), and the IMDb Movie Review Dataset (IMDb), which consists of online movie reviews. For Yelp, we split the dataset following , who also provided human produced reference sentences for evaluation. For IMDb, we follow the pre-processing and data splitting protocol of Dai et al. (2019). Detailed dataset statistics is given in Table 1. Evaluation Protocol. Following the standard evaluation practice, we evaluate the performance of our model on the textual style transfer task from two aspects: (1) Transfer Intensity: a style classifier is employed for quantifying the intensity of the transferred text. In our work, we use Fast-Text (Joulin et al., 2017) trained on the training set of Yelp; (2) Content Preservation: to validate whether the style-independent context is preserved by the transferrer, we calculate self -BLEU, which computes a BLEU score (Papineni et al., 2002) by comparing inputs and outputs of a system. A higher self -BLEU score indicates more tokens from the sources are retained, henceforth, better preservation of the contents. In addition, we also use ref -BLEU, which compares the system outputs and the references written by human beings.

Experimental Results
In our experiment, the two transferrers (f and g) are Stacked BiLSTM-based sequence-to-sequence models, i.e., both 4-layer BiLSTM for the encoder and decoder. The noise intensity γ is set to 0.3 in the first 50 epochs and 0.03 in the following epochs. As shown in Table 2, for the Yelp dataset our model defeats all baselines models (apart from StyleTransformer (Multi-Class)) on both ref -BLEU and self -BLEU. In addition, as shown in Table 2, our model works remarkably well on both transfer intensity and preservation without requiring adversarial training or reinforcement learning, or external offline sentiment classifiers (as in Dai et al. (2019)). Besides, the current version of our   92.6 0.4 0.7 n/a n/a TemplateBased  84.3 13.7 44.1 n/a n/a DeleteOnly  85.7 9.7 28.6 n/a n/a DeleteAndRetrieve    Yelp positive → negative input this golf club is one of the best in my opinion . output this golf club is one of the worst in my opinion .
input i definitely recommend this place to others ! output i do not recommend this to anyone ! Yelp negative → positive input the garlic bread was bland and cold . output the garlic bread was tasty and fresh .
input my dish was pretty salty and could barely taste the garlic crab . output my dish was pretty good and could even taste the garlic crab .
IMDb positive → negative input a timeless classic , one of the best films of all time . output a complete disaster , one of the worst films of all time .
input and movie is totally backed up by the excellent music both in background and in songs by monty . output the movie is totally messed up by the awful music both in background and in songs by chimps .
IMDb negative → positive input this one is definitely one for my " worst movies ever " list . output this one is definitely one of my " best movies ever " list . input i found this movie puerile and silly , as well as predictable . output i found this movie credible and funny , as well as tragic . model is built upon fundamental BiLSTM, which is a likely explanation of why we lose to the SOTA (i.e., StyleTransformer (Multi-Class)) for a small margin, which are based on the Transformer architecture (Vaswani et al., 2017) with much higher capacity. For the IMDb dataset, comparing to other systems, our model obtained moderate accuracy but competitive self-BLEU score (70.2), i.e., slightly lower than StyleTransformer. Table 3 lists several examples for style transfer in sentiment for both datasets. By examining the results, we can see that DGST is quite effective in transferring the sentiment polarity of the input sentence while maintaining the non-sentiment information.

Ablation Study
To confirm the validity of our model, we did an ablation study on Yelp by eliminating or modifying a certain component (e.g., objective functions, or sampling neighbourhood). We tested the following variations: 1) full-model: the proposed model; 2) no-tran: the model without the transfer objective; 3) no-rec: the model without the reconstruction objective; 4) rec-no-noise: the model adding no noise when optimising the reconstruction objective; 5) tran-no-noise: the model adding no noise when optimising the transfer objective; 6) pre-noise: the model trained by adding noise to y first and then feeding the nosified sentencesẙ to g (orx to f ) in Eq. 5. In this study, the transferrers are the simplest LSTM-based sequence-to-sequence models. The hidden size and γ are set to 256 and 0.3, respec-positive → negative negative → positive input it is a cool place , with lots to see and try . so , that was my one and only time ordering the benedict there . full-model it is a sad place , with lots to see and something . so , that was my one and best time in the shopping there .
no-rec no no , , num . so , that was my one and time time over the there there . rec-no-noise it is a cool place , with me to see and try .
service was very friendly . no-tran it is a loud place , with lots to try and see . so , that was my only and first visit ordering the there ) . tran-no-noise it is a noisy place , with lots to try and see . so , that was my one and time time ordering the ordering there . pre-noise it is a cool place , with lots to see to try . so , that 's one one and my only the the day there .
input it is the most authentic thai in the valley . even if i was insanely drunk , i could n't force this pizza down . full-model it is the most overrated thai in the valley . even if i was n't hungry , i 'll definitely enjoy this pizza here . no-rec i was in the the the the food .
she was perfect . rec-no-noise it is the most authentic thai in the valley . even if i was n't , , i could n't recommend this pizza . . no-tran it is the most authentic thai in the valley . even if i was n't , , i could n't get this pizza down . tran-no-noise it is the most common thai in the valley . even if i was hungry hungry , i could n't love this pizza shop . pre-noise it is the most thai thai in the valley . even if i was n't hungry , i could n't recommend this pizza down .  tively.
Results. Table 5 depicts the results of the ablation study. As we can see, eliminating the reconstruction or transfer objectives would damage preservation and transfer intensity, respectively. As for the use of noise, the results of the rec-no-noise model shows that the noise in the reconstruction objective helps balance our model's ability in content preservation and transfer intensity. For the transfer objective, omitting noise (tran-no-sp) would reduce the transfer intensity while placing noise in the wrong position (pre-noise) reduces it yet again. Case Study. Transferred sentences produced by each model variant in the ablation study are listed in Table 4. The model without correction objective (no-corr) collapsed and as a result it generates irrelevant sentences to the inputs most of the time. When neighbourhood sampling is dropped in either corrective or transfer objectives, the transfer intensity is reduced. These models, including rec-no-noise, tran-no-noise, and pre-noise, tend to substitute random words, and result in reduced transfer intensity (i.e., style words are either not modified or still express the same sentiment after modification) and preservation. For example, when transferring from negative to positive, rec-nonoise replace "force" to "recommend" resulting "I couldn't recommend this pizza", which is still a negative review.

Conclusion
In this paper, we propose a novel and simple dualgenerator network architecture for text style transfer, which does not rely on any discriminators or parallel corpus for training. Extensive experiments on two public datasets show that our model yields competitive performance compared to several strong baselines, despite of our simpler model architecture design.