Semi-supervised Text Style Transfer: Cross Projection in Latent Space

Text style transfer task requires the model to transfer a sentence of one style to another style while retaining its original content meaning, which is a challenging problem that has long suffered from the shortage of parallel data. In this paper, we first propose a semi-supervised text style transfer model that combines the small-scale parallel data with the large-scale nonparallel data. With these two types of training data, we introduce a projection function between the latent space of different styles and design two constraints to train it. We also introduce two other simple but effective semi-supervised methods to compare with. To evaluate the performance of the proposed methods, we build and release a novel style transfer dataset that alters sentences between the style of ancient Chinese poem and the modern Chinese.


Introduction
Recently, the natural language generation (NLG) tasks have been attracting the growing attention of researchers, including response generation (Vinyals and Le, 2015), machine translation (Bahdanau et al., 2014), automatic summarization (Chopra et al., 2016), question generation (Gao et al., 2019), etc. Among these generation tasks, one interesting but challenging problem is text style transfer (Shen et al., 2017;Fu et al., 2018;Logeswaran et al., 2018). Given a sentence from one style domain, a style transfer system is required to convert it to another style domain as well as keeping its content meaning unchanged. As a fundamental attribute of text, style can have a broad and ambiguous scope, such as ancient poetry style v.s. modern language style and positive sentiment v.s. negative sentiment. * This work was done when Mingyue Shang was an intern at Tencent AI Lab. † Corresponding author: Rui Yan (ruiyan@pku.edu.cn) Building such a style transfer model has long suffered from the shortage of parallel training data, since constructing the parallel corpus that could align the content meaning of different styles is costly and laborious, which makes it difficult to train in a supervised way. Even some parallel corpora are built, they are still in a deficient scale for neural network based models. To tackle this issue, previous works utilize nonparallel data to train the model in an unsupervised way. One commonly used method is disentangling the style and content from the source sentence (John et al., 2018;Shen et al., 2017;Hu et al., 2017). For the input, they learn the representations of styles and style-independent content, expecting the later only keeps the content information. Then the contentonly representation is coupled with the representation of a style that differs from the input to produce a style-transferred sentence. The crucial part of such methods is that the encoder should accurately disentangle the style and content information. However, Lample et al. (2019) illustrated that disentanglement is not a facile thing and the existing methods are not adequate to learn the styleindependent representations.
Considering the above-discussed issues, in this paper, instead of disentangling the input, we propose a differentiable encoder-decoder based model that contains a projection layer to build a bridge between the latent spaces of different styles. Concretely, for texts in different styles, the encoder converts them into latent representations in different latent spaces. We introduce a projection function between two latent spaces that projects the representation in the latent space of one style to another style. Then the decoder generates an output using the projected representation.
In the majority of cases, nonparallel corpora of different styles are accessible. Based on these datasets, it is feasible to build small-scale paral-lel datasets. Therefore, we design two kinds of objective functions for our model so that it could be trained under both supervised settings and unsupervised settings. With the parallel data, the model learns the projection relationship and the standard representation of the target latent space from ground-truth. Without the parallel data, we train the model by back-projection between the source and target latent space so that it could learn from itself. During training, we incorporate these two kinds of signals to train the model in a semisupervised way.
We conduct experiments on an English formality transfer task that alters the sentence between formal and informal styles, and a Chinese literal style transfer task that alters between the ancient Chinese poem sentences and modern Chinese sentences. We evaluate the performance of models from the degree of content preservation, the accuracy of the transferred styles, and the fluency of sentences using both automatic metrics and human annotations. Experimental results show that our proposed semi-supervised method can generate more preferred output.
In summary, our contributions are manifolds: • We designed a semi-supervised model that crossly projects the latent spaces of two styles to each other. In addition, this model is flexible in alternating the training mode between supervised and unsupervised. • We introduce another two semi-supervised methods that are simple but effective to leverage both the nonparallel and parallel data. • We build a small-scale parallel dataset that contains ancient Chinese poem style and modern Chinese style sentences. We also collect two large nonparallel datasets of these styles. 1

Related Works
Recently, text style transfer has stimulated great interests of researchers from the area of neural language processing and some encouraging results are obtained (Shen et al., 2017;Rao and Tetreault, 2018;Prabhumoye et al., 2018;Hu et al., 2017;Jin et al., 2019). In the primary stage, due to the lacking of parallel corpus, most of the methods employ unsupervised learning paradigm to conduct the semantic modeling and transfer.  (Goodfellow et al., 2014) to improve the performance of the basic models (Shen et al., 2017;. To sum up, most of the existing unsupervised frameworks on the text style transfer focus on getting the disentangled representations of style and content. However, Lample et al. (2019) illustrated that the disentanglement not adequate to learn the style-independent representations, thus the quality of the transferred text is not guaranteed.
Supervised Learning Methods. Fortunately, Rao and Tetreault (2018) release a high-quality parallel dataset -GYAFC, which consists of two domains for Formality style transfer: Entertainment & Music and Family & Relationships. The quantity of the dataset is sufficient to train an attention-based sequence-to-sequence framework.
Nevertheless, building parallel corpus that could align the content meaning of different styles is costly and laborious, and it is impossible to build the parallel corpus for all the domains. datasets are denoted as |A|, |B| and |P |, respectively. As the nonparallel data is abundant but the parallel data is limited, size |A|, |B| |P |. Formally, two nonparallel datasets are denoted as A = {a 1 , a 2 , · · · , a |A| } and B = {b 1 , b 2 , · · · , b |B| }, where a i and b i refer to the ith sentences of s a style and s b style, respectively. It should be clear that as A and B are nonparallel datasets, the sentences in the two datasets are not aligned with each other, which means that a i and b i are not required to have the same or similar content meaning though they have the same subscript. The parallel dataset is denoted as is the i-th pair of sentences that have the same content meaning but expressed in style s a and s b separately.
Due to the situation that the limited parallel data is deficient for neural network based model to train, our goal is to train a model in a semisupervised way that could leverage the large volumes of nonparallel data to improve the performance. Given a sentence of the source style as input, the model learns to generate a target style output which preserves the meaning of the input to the greatest extent. In this paper, the style transfer process could be exerted in two directions, from s a to s b as well as from s b to s a .

Basic Model Architecture
The text style transfer task could be interpreted as transforming a sequence of source style words to a sequence of target style words, which makes the sequence-to-sequence framework a suitable architecture. In this section, we describe the formulation of sequence-to-sequence (S2S) (Sutskever et al., 2014) model, which contains an encoder and a decoder, and our proposed models are built based on such an architecture.
Given an input, the encoder first converts it into an intermediate vector, then the decoder takes the intermediate representation as input to generate a target output. In this paper, we implement the encoder by a bi-directional Long Short-Term Memory (BiLSTM) (Hochreiter and Schmidhuber, 1997) and the decoder by a one layer LSTM.
Formally, with an input sentence x = {w x 1 , · · · , w x T } of length T , and a target sentence y = {w y 1 , · · · , w y T } of length T , where w i is the embedding of the i-th word, the probability of generating the target sentence y given x is defined as: In the training process, the encoder first encodes x into an intermediate representation c x . More specifically, at each time step t, the encoder produces a hidden state vector h t represents the forward and backward hidden state, respectively, and [; ] means concatenation. The intermediate vector is formed by the concatenation of the forward and backward last hidden states, denoted as c Then this vector is fed to the decoder to generate a target sentence step by step as formulated in Equation 1.
Such neural network based models always have a large amount of parameters which empower the model with the ability to fit complex datasets. However, when trained with limited data, the model may get so closely fitted to the training data that loses the power of generalization, and thus performances poorly on new data.

Cross Projection in Latent Space
In this paper, we propose a semi-supervised style transfer method named cross projection in latent space (CPLS) to leverage the large volumes of nonparallel data as well as the limited parallel data. Based on the S2S architecture, we consider the representations in the latent space, and propose to establish projection relations between different latent spaces. To be specific, in the S2S architecture, the encoder converting the input into an intermediate vector is a process of extracting the semantic information of the input, including the style and the content information. Thus the intermediate vectors could be seen as the compressed representations in the latent space of the inputs. Since texts in different styles are in different latent spaces, we introduce a projection function to project the intermediate vector from one latent space to another. To combine the nonparallel data and parallel data in training, we design two kinds of constraints for the projection functions.
Concretely, we first train an auto-encoder for each style to learn the standard latent spaces of the style. After that, we train the projection functions which are exerted on latent vectors to establish projection relations between different latent spaces. We design a cross projection strategy and a cycle projection strategy to utilize the parallel data and nonparallel data. These two strategies are exerted iteratively in the training process, thus our model is compatible with the two types of training data. The following subsections give the details of the modules and the strategies of model training.

Denoising Auto-Encoder
To learn the latent space representations of styles, we train an auto-encoder for each style of text through reconstructing the input, which is composed of an encoder and a decoder. The encoder takes a sentence as input and maps the sentence to a latent vector representation, and the decoder reconstructs the sentence based on the vector. But a common pitfall is that the decoder may simply copy every token from the input, making the encoder lose the ability to extract the features.
To alleviate this issue, we adopt the denoising auto-encoder (DAE) following the previous work (Lample et al., 2019) that tries to reconstruct the corrupted input. Specifically, we randomly shuffle a small portion of words in the original sentence to the corrupted input which are then fed to the denoising auto-encoder. For example, the original input is "Listen to your heart and your mind.", then the corrupted version could be "Listen your to and heart your mind.". The decoder is required to reconstruct the original sentence. The training of a DAE model relies on the corpus of one style only, thus we train each DAE model using the nonparallel corpus of each style. Formally, the encoder and decoder of style s a and s b are referred as Enc (a) -Dec (a) and Enc (b) -Dec (b) . Given sentences a and b from two styles, the corresponding encoder takes the sentence as input and converts it to the latent vector c a and c b , respectively, by which each encoder therefore constructs the latent space for the corresponding style.

Cross Projection
To perform the style transfer task, we establish a projection relationship between latent spaces and cross link the encoder and the decoder from different styles. Take the transfer from style s a to s b for example. Given a sentence a that is required to transferred to b, we cross link the Enc (a) as the encoder and Dec (b) as the decoder to form the style transfer model. However, considering the DAE models for different styles are trained respectively, therefor the latent vector spaces are usually different. The latent vector produced by Enc (a) is the representation on the latent space of style s a while the Dec (b) relies on the information and features from the latent space of style s b . Therefore, we introduce a projection function to project the latent vector from the latent space of s a to s b .
Concretely, after we get the c a from Enc (a) , a projection function f (·) is employed to project c a from the latent space of style s a to the latent space of s b , denoted asc b = f (c a ). Then the decoder Dec (b) takes the the projected vectorc b as input and generates a sentenceb of style s b base on the prediction probability, denoted as p b (b|c b ).
It is worth noting that we have only exploited the nonparallel corpus for the style transfer up to now. Recall that our framework can employ both the parallel corpus and nonparallel corpus for the model training. With parallel corpus, we design the cross projection strategy. When the input a is accompanied with a ground-truth b, we can get the standard latent representation of b by the Enc (b) , denoted as c b . In order to align the latent vectors from different spaces, we design the constraints to Figure 2: The cycle projection process between the latent space of A and B when training the denoising autoencoders. The encoders and decoders are the same as illustrated in Figure 1 and are omitted in this figure. train the f (·) from two aspects: the distance betweenc b and c b in the latent space should be close; the generated sentenceb should be similar with b. Then we define two losses as follows: where l s 1 measures the Euclidean distance betweenc b and c b and l s 2 is the negative loglikelihood (NLL) loss given a as input and b as the ground-truth. α and β is the hyper-parameters that control the weight of l s 1 and l s 2 . Figure 1 shows the training process of the cross projection. Similarly, to transfer the style of text from s b to s a , we cross link Enc (b) and Dec (a) , and the projection function to project c b to the latent space of s a is denoted as g(·).

Cycle Projection Enhancement
Inspired by the concept of back-translation in machine translation (Sennrich et al., 2015;Lample et al., 2018;He et al., 2016), we design a cycle projection strategy to train the model on the nonparallel data. Given the input without the ground-truth, we train the projection function f (·) and g(·) by projecting back and forth to reconstruct the input. The cycle projection process is shown in Figure 2.
Formally, for a sentence a in style s a , after getting its latent representation c a in its own latent space by the Enc (a) , we first project it to the latent space of s b by f (·) and getc b = f (c a ). Then we exert g(·) to projectc b back to the latent space of style s a , denoted asc a = g(c b ). Finallyc a is fed to Dec (a) to produce an outputã. Though the latent vectorc b has no reference to train, the latent vectorc a and the outputã could be trained by treating c a and a as the references. The loss functions are formulated as:c a = g(f (Enc (a) (a))) (6) l c 1 = c a − c a (7) l c 2 = − log p(a|c a ) (8) l c = α * l c 1 + β * l c 2 (9)

Training Procedure
In the training stage, we first pretrain DAE models for each style separately on the nonparallel data to get the latent spaces of styles. Then we alternately apply the cross projection strategy and cycle projection enhancement on parallel data and nonparallel data with loss l s and l c .

Straightforward Semi-supervised Methods
We also introduce another two semi-supervised methods as the baselines to provide a more comprehensive comparison with the proposed CPLS model. The two semi-supervised methods are built from different perspectives.

Data Augmentation via Retrieval
One semi-supervised baseline is from the perspective of data augmentation. In order to alleviate the over-fitting issue caused by the small-scale parallel data, we propose a simple but efficient method that augments the parallel dataset by retrieving pseudo-references for the nonparallel datasets, denoted as DAR model.

Pseudo-parallel Corpus Construction.
We employ Lucene 2 to build indexes for the nonparallel corpus of each style. Then the pseudoreferences are retrieved based on the TF-IDF scores. To build the pseudo-parallel corpus of style s a and s b , we first samples 80,000 sentences in style s a as queries. Specifically, each query searches a pseudo-reference from the nonparallel corpus of style s b according to the TF-IDF based cosine similarity. The sampled query is therefore coupled with the pseudo-reference to form a training pair. We also conduct the same operation that uses sampled queries from style s b to search pseudo-references from sentences of style s a . After searching pseudo-references from two sides, we construct a pseudo-parallel corpus in the size of 150,000 pairs 3 . With the pseudo-parallel corpus, the model is exposed to more information and thus the problem of over-fitting can be mitigated to some extent. Though the relevance of the content between the input sentence and the pseudo-reference is not guaranteed, the encoder could better learn to extract the language features and the decoder could also benefit from the weak contextual correlation information.
Training Procedure. In the training stage, we first train the S2S model on the pseudo-parallel dataset and save a checkpoint every 2,000 steps. We then calculate the BLEU score of checkpoints on the validation set and select the one with the highest BLEU score as the final pretrained model. Then based on this pretrained model, we fine-tune the parameters using the true parallel data.

Shared Latent Space Model
The second semi-supervised baseline is similar to CPLS that first trains a DAE model on the corpus of each style. Instead of building a bridge between the latent spaces of two styles through projection functions, in this method, we simply share the latent representations by cross linking the encoder and decoder, denoted as SLS model. Given a pair of sentence (a p , b p ), the encoder Enc (a) encodes it into context c a , then the decoder Dec (b) directly takes c a to produceb.
Training Procedure. For this method, we first pre-train two DAE models. Then we train the DAE models and the cross-linked S2S models from two transfer directions alternately. Considering the size imbalance between the nonparallel corpora and parallel corpus, to avoid the S2S falling into over-fitting too fast, We alternate the training in the form of training 20-step DAE models and then training one step S2S model.

Experiment
We conduct experiments on two bilateral style transfer tasks that each of them has a small-scale parallel corpus and two large-scale nonparallel corpora in two styles. DAR model and SLS model are treated as two baselines of CPLS model. In addition, we also train a S2S model with attention mechanism on the parallel data as the vanilla baseline. The following subsections elaborate the construction of the datasets and the detailed experimental settings.

Datasets
We construct a Chinese literal style dataset with ancient poetry style and modern Chinese style, and an English formality style dataset with formal style and informal style. The texts of ancient Chinese poem, modern Chinese, formal English and informal English are referred as Anc.P, M.zh, F.en and Inf.en for short. The overall statistics are shown in Table 1.
Chinese Literal Style Dataset. We consider the texts of ancient Chinese poem style and modern Chinese style which differ greatly in expression. As the ancient poems from different dynasties vary in style and form, in this paper, we focus on the poem from Tang dynasty.
To build the parallel dataset, we crawl from the website named Gushiwen 4 that provides ancient Chinese poems and some are coupled with modern Chinese interpretation in paragraphs. We split the collected paragraphs into independent sentences using punctuation based rules, and manually align the poem sentence with the interpretation sentence to form the parallel pair.
For the nonparallel corpus of ancient Chinese poem style, we collect all the poems from Quan Tangshi 5 and split them into sentences. To build the nonparallel dataset in modern Chinese style, we collect data from the Chinese ballad and took the lyrics. The reason we choose the Chinese ballad is that the content domain of the two styles should be close. Considering that most of the poems are about the natural scenery and the sentiments, the most suitable literary form is the lyrics of ballad. 6 to Anc.p Source 问客人为什么来，客人说为了上山砍伐树木来买斧头。 (Ask the guests why, the guest said he want to buy an axe in order to cut the trees at the mountain.)

S2S
客问谁客中，树。 (The guest asks who is in the guest, the tree) SLS 问人何为人， (Ask people what people are,) DAR 客中何为客，山头为木头。 (What is the guest in the guest, the mountain head is the wooden head.) to F.en Source give them a chance to discover you. S2S Share them an opportunity to meet you. SLS I give them a chance to discover you. DAR Give them a chance to discover you.

CPLS
You should give them a chance to discover you.
to Inf.en Source I think it is wrong that they cannot go out with her. S2S I think it is wrong that is wrong with her. SLS I think they can't go out with her. DAR I think it is wrong that they cant go out with her.

CPLS
It is wrong that they can't go out with her. Formality Dataset. The formality dataset used in this paper is built based on the parallel dataset released by Rao and Tetreault (2018) which contains texts of formal and informal style.
With the released data, we randomly sample 5,000 sentence pairs from it as the parallel corpus with limited data volume. We then use the Yahoo Answers L6 corpus 7 as the source which is in the same content domain as the parallel data to construct the large-scale nonparallel data. To divide nonparallel dataset into two styles, we train a CNN-based classifier (Kim, 2014) on the parallel data with annotation of styles and use it to classify the nonparallel data.

Experimental Settings
We perform different data preprocessing on different datasets. The Chinese literary datasets are segmented by characters instead of word to alleviate the issue of unknown words. Our statistics show that the average length of ancient poems is 9 while 17 for modern Chinese sentences. Therefore, we set the minimum length of the ancient poem as 3, and the maximum length of the modern Chinese (news, open-domain conversations). 7 https://webscope.sandbox.yahoo.com/ catalog.php?datatype=1 sentence as 30. For the formality datasets, we use NLTK (Loper and Bird, 2002) to tokenize the texts and set the minimum length as 5 and the maximum length as 30 for both formal and informal styles.
We adopt GloVE (Pennington et al., 2014) to pretrain the embeddings, and the dimensions of the embeddings are set to 300 for all the datasets. The hidden states are set to 500 for both encoders and decoders. We adopt SGD optimizer with the learning rate as 1 for DAE models and 0.1 for S2S models. The dropout rate is 0.4. In the inference stage, the beam size is set to 5.

Baselines
We train a S2S model with attention mechanism on the parallel data as the supervised learning baseline. Since the existing works on text style transfer seldom explore the semi-supervised methods, we propose DAR and SLS model as two semisupervised baselines.

Automatic Evaluation Metrics
Following previous works (Prabhumoye et al., 2018;Zhang et al., 2018a), we employ BLEU score (Papineni et al., 2002) and style accuracy   as the automatic evaluation metrics to measure the content preservation degree and the style changing degree. BLEU calculates the N-gram overlap between the generated sentence and the references, thus can be used to measure the preservation of text content. Considering that text style transfer is a monolingual text generation task, we also use GLEU, a generalized BLEU proposed by (Napoles et al., 2015). To evaluate the extent to which the sentences are transferred to the target style, we follow Shen et al. (2017); Hu et al. (2017) that build a CNN-based style classifier and use it to measure the style accuracy.

Human Evaluation
We also adopt human evaluations to judge the quality of the transferred sentences from three aspects, namely content, style and fluency. These aspects evaluate how well the transferred text preserve the content of the input, the style strength and the fluency of the transferred text. Take the content relevance for example, the criterion is as follows: +2: The transferred sentence has the same meaning with the input sentence. +1: The transferred sentence preserves part of the content meaning of the input sentence. 0: The transferred sentence and the input sentence are irrelevant in content. The criteria for style strength and fluency are similar to the content relevance criterion.
To get the evaluation results, we first randomly sample 50 test cases from each dataset. As the style transfer is bilateral in this paper, there are 400 test cases in all. We invited four well-educated volunteers to score the results from the supervised baseline and the three semi-supervised models.
9 Results and Analysis Evaluation Results. Table 3 presents the evaluation results of automatic metrics on the models. It can be seen that the BLEU scores and GLEU scores of the semi-supervised models on almost all the datasets are better than the baseline S2S model. This result indicates that the model benefits from the nonparallel data in terms of content preservation. One interesting thing is that the overall BLEU scores on the ancient poems and modern Chinese datasets are lower than other datasets. This result may be explained by the fact that the edit distance between formal and informal texts are smaller than between ancient poems and modern Chinese texts. Therefore, it is more challenging for model to preserve the content meaning when transferring between ancient poems and modern Chinese text. Among three semi-supervised models, CPLS model achieves the greatest improvement, verifying the effectiveness of the projection functions. However, the gain of CPLS model in the aspect of style accuracy is not that significant. A possible explanation may be the bias of the style classifier. Take the transfer task from ancient poems to modern Chinese text for example. We observe that the classifier tends to classify short sentences into ancient poems as length is an obvious feature. We analyse the sentences generated by S2S model and by the CPLS model, and the statistics show that the average length of the text generated by S2S model is shorter, which may lead to the bias of the style classifier. Therefore, we also adopt human evaluation to alleviate this issue. Table 4 compares the human evaluation results of S2S model and CPLS model on all the datasets, which are calculated by the average score of the human annotations. As shown in the Table 4, the CPLS model outperforms the S2S model in the aspects of the content preservation and style strength, and is on par in terms of fluency. Case Study. We select the generated results of the S2S baseline and three semi-supervised models on two style transfer tasks due to the limited space as shown in Table 2. Compared with the Chinese literacy style datasets, the formality datasets are less challenging as discussed before. Thus it can be seen from the table that the generated sentences on the formality datasets are more fluent. For the task that transferring from the modern Chinese text to the ancient poem, the S2S generates a shorter sentence while the CPLS model generates the sentence that preserve the most content information.

Conclusion
In this paper, we design a differentiable semisupervised model that introduces a projection function between the latent spaces of different styles. The model is flexible in alternating the training mode between supervised and unsupervised learning. We also introduce another two semi-supervised methods that are simple but effective to use the nonparallel and parallel data. We evaluate our models on two datasets that have small-scale parallel data and large-scale nonparallel data, and verify the effectiveness of the model on both automatic metrics and human evaluation.