Rethinking Text Attribute Transfer: A Lexical Analysis

Text attribute transfer is modifying certain linguistic attributes (e.g. sentiment, style, author-ship, etc.) of a sentence and transforming them from one type to another. In this paper, we aim to analyze and interpret what is changed during the transfer process. We start from the observation that in many existing models and datasets, certain words within a sentence play important roles in determining the sentence attribute class. These words are referred as the Pivot Words. Based on these pivot words, we propose a lexical analysis framework, the Pivot Analysis, to quantitatively analyze the effects of these words in text attribute classification and transfer. We apply this framework to existing datasets and models and show that: (1) the pivot words are strong features for the classification of sentence attributes; (2) to change the attribute of a sentence, many datasets only requires to change certain pivot words; (3) consequently, many transfer models only perform the lexical-level modification,while leaving higher-level sentence structures unchanged. Our work provides an in-depth understanding of linguistic attribute transfer and further identifies the future requirements and challenges of this task


Introduction
The task of text attribute transfer (or text style transfer2 ) is to transform certain linguistic attributes (sentiment, style, authorship, rhetorical devices, etc.) from one type to another (Ficler and Goldberg, 2017;Fu et al., 2018;Hu et al., 2017;Li et al., 2018;Shen et al., 2017).The state-of-the-art Figure 1: Examples of pivot words in sentiment transfer.Certain words are strongly correlated with the sentiment such that a transfer model only need to modify these words to accomplish the transfer task while leaving the higher level sentence structure unchanged.
(SOTA) models have achieved inspiring transfer success rates (Zhao et al., 2018;Zhang et al., 2018;Prabhumoye et al., 2018;Yang et al., 2018).However, it is still unclear in current literature about what is transferred and what remains to be unchanged during the transfer process.To answer this question, we perform an in-depth investigation of the linguistic attribute transfer datasets and models.
Our investigation starts from a simple observation: in many transfer datasets and models, certain class-related words play very important roles in attribute transfer (Li et al., 2018;Prabhumoye et al., 2018).Figure 1 gives a sentiment transfer example from the controllable generation (CG) model (Hu et al., 2017) on the Yelp dataset.In this example, rude is strongly related to the negative sentiment and good is strongly related to the positive sentiment, thus simply substituting rude with good will transfer the sentence from negative to positive.In this work, We name these words the pivot words for a class.We use the term the pivot effect to refer the effect that certain strong words may be able to determine the class of a sentence.
Based on the observation of the pivot effect, our research questions are: (1) which words are pivot words and how do they influence the attribute class of a sentence in different datasets?(2) does the model only need to modify the pivot words to per-form the attribute transfer or it may change higherlevel sentence composationality like syntax?
To answer question (1), we propose the pivot analysis, a series of simple yet effective text mining algorithms, to quantitatively examine the pivot effects in different datasets.The basics of the datasets we investigate are listed in Table 1.We first give the algorithm to extract pivot words (Sec 3).We statistically show the stronger the pivot effect is on a dataset, the easier for a model to transfer its sentences.To further analyze the fine-grained distributional structure of these pivot words, we propose the precision-recall histogram to show to what extent the datasets may be influenced by their pivot words (Sec 4.2).
To answer question (2) and discover what is changed during the transfer process, we use the pivot words to analyze the transfer results of two SOTA models: the Controllable Generation (CG) model (Hu et al., 2017) and the Cross Alignment (CA) model (Shen et al., 2017).We show that although equipped with sophisticated modeling techniques, in many datasets, these models tend to change only a few words and most of these modified words are pivot words.When we mask out the modified words (to eliminate the lexical changes) and compare the Levenshtein string edit distance (Levenshtein, 1966) of the sentence stems before and after the transfer, we find out many of the sentence stems are the same (the distance of the masked sentences equals to 0).This means that in transfer, the model only modifies few pivot words while leaving the syntactical structure of the sentence unchanged (Sec 5).
To sum up, we show that: (1) in many datasets, words are important features in classification and transfer.But still, certain hard cases require a higher level of understanding of the sentence structures.(2) SOTA models tend to perform the transfer at the lexical level, the syntax of a sentence is generally unchanged.The understanding and modification of higher-level sentence compositionality (syntax trees and dependency graphs) is still a challenging problem.

Background
Inspired by the image style transfer task (Gatys et al., 2016;Zhu et al., 2017), the goal of text attribute(style) transfer is to transfer the stylistic attributes of the sentence from one class to another while maintaining the content of the sen-tence unchanged (Fu et al., 2018;Ficler and Goldberg, 2017;Hu et al., 2017).Because of the lack of parallel datasets, most models focus on the unpaired transfer.Although plenty of sophisticated techniques are used in this task, such as adversarial learning (Zhao et al., 2018;Chen et al., 2018), latent representations (Li and Mandt, 2018;Dai et al., 2019;Liu et al., 2019), and reinforcement learning (Luo et al., 2019;Gong et al., 2019;Xu et al., 2018), there is little discussion about what is changed and what remains unchanged.
Because of the lack of transparency and interpretability, there is some retrospection on this topic.
Such as the definition of text style (Tikhonov and Yamshchikov, 2018), and the evaluation metrics (Li et al., 2018;Mir et al., 2019).Our proposed pivot analysis aligns with these works and provides a new tool to probe the transfer datasets and models.The de facto metrics is to use a pretrained classifier to classify if the transferred sentence is in the target class.So our pivot analysis starts from the classification task and mines the words with strong predictive performance.
While many previous works focus on one-toone transfer, many recent works extend this task to one-to-many transfer (Logeswaran et al., 2018;Liao et al., 2018;Subramanian et al., 2019).For simplicity, we focus on the one-to-one setting.But it is also easy to extend the pivot analysis into oneto-many transfer settings.

Pivot Words Discovery
To study the factors influencing attribute transfer, we start from mining words strongly correlated with the attribute class i.e. pivot words.Algorithm 1 shows the procedure of mining pivot words.This algorithm is based on a simple intuition: if one single word is strong enough to determine the sentence attributes, then when we use the existence of this word to classify the attribute, we should achieve very high precision.Consider two extreme examples: when a word only exists in one class, it should achieve 100% classification precision.When a word exists evenly in two different classes, its precision is 50%.The reason we use precision instead of recall or accuracy is that only precision reveals the influence of a single word: suppose the word "awesome" only exists in 100 positive sentences, and the whole dataset size is 100K.In this case, "awesome" will have low recall and accuracy, but high precision.This algorithm  calculates the precision for each word-class pair, and choose pivot words with a predefined threshold p 0 .
For simplicity, we only consider binary classification in Algorithm 1, but one could easily extend it to multi-class settings.Also, we only consider unigrams(words), while it is also straightforward to extend it to ngrams.In practice, we find the unigram version performs quite good, as is shown in Table 2.As for the parameters in the algorithm, the precision threshold p 0 controls the confidence of a word to be a pivot, and the occurrence threshold f 0 prevents overfitting.We tune these parameters based on the classification performance on the validation set.Specifically, to get better classification performance, f 0 and p 0 should be lower to allow more vote (e.g.f 0 ≤ 10, p 0 ∈ [0.5, 0.7]).To get more confidence and filter out stronger pivot for each sentence s, each class y, and each word x in the vocabulary V with frequency higher than f0 do 4: Consider the class of s is y or 1 − y 5: Use the existence of x to classify: 6: if x is in s then 7: Classify s to be y 8: else 9: Classify s to be 1 − y 10: Calculate the classification precision p(x, y) of word x for label y over all sentences S. 11: if p(x, y) > p0 then 12: x is a pivot word for class y i.e. x ∈ Ωy 13: return Ωy, p(x, y) Algorithm 2 The Pivot Classifier Input: sentence s, the pivot words Ωy for class y ∈ {0, 1} Output: The class y(s)of sentence s 1: procedure PIVOT CLASSIFICATION 2: View s as bag of words 3: For each y ∈ {0, 1}, calculate sy = ||s ∩ Ωy|| 4: Predict the class of s to be y(s) = argmax y {sy}.Break tie randomly.5: return y(s) words, f 0 and p 0 should be higher (e.g.f 0 ≥ 100, p 0 ≥ 0.7).
Figure 2 shows the mined pivot words in different datasets.For sentences that contain pivot words, it is clear that these words are strong features for classification.Intuitively, to transfer the class of these sentences, one could directly modify these words.But there are also cases that contain no pivot words, e.g.i will be back in the Yelp dataset.To modify the sentiment of these sentences, a model needs to understand a broader context and common sense.In general, the existence of pivot words gives us a method to understand in attribute transfer, what cases are easier and what cases are more difficult.
The intuition that the existence of single words is enough to determine the linguistic attribute does not necessarily hold on all datasets.But empirically, we find out many transfer datasets tend to contain strong pivot words (Figure 5).One could compare our pivot analysis with other methods that mine the word importance, such as the weights of a logistic classifier, or more sophisticated Bayesian methods like the log-odds ratio informative Dirichlet prior (Monroe et al., 2008).Our method is more straightforward and interpretable.We further develop this method as a simple yet strong classification baseline to indicate the transfer difficulty of different datasets and use the pivot words as a tool to analyze, interpret, and visualize the text attribute transfer models.

Analysing Datasets with Pivot Analysis
In this section, we use the pivot words to analyze the transfer datasets.We first reveal the mechanisms of how pivot words affect classification and transfer by using the pivot words as the classification boundary.Then we use the precision-recall histogram to demonstrate the distributional structure of the pivot words in different portions of the datasets.

The Pivot Classifier
Algorithm 2 gives a simple method to classify a sentence based on the pivot words output from Algorithm 1.This is essentially a voting based classifier.This classifier holds strong independence assumption that the label of a sentence is only related to the bag of words, but ignore the word orders.This is to say, the decision boundary only stays at the lexical level, and does not go to the syntax level.Then it counts the pivot words of different classes contained by the sentence and predicts the label to be one of the largest pivot words overlap.Intuitively, this algorithm classifies a sentence only based on the existence of strong attribute-related words.
The pivot classifier is a simple yet strong classification baseline, as is shown in Table 2.We use it to study different datasets and compare it with (1) a logistic classifier, (2) a SOTA CNN classifier (Kim, 2014).We have balanced the test sets so the random baseline is 50%.This voting based classifier achieves comparable performance with the two models in 4 datasets (Amazon, Gender, Paper, Politics), and only loses small margins in 2 datasets (Yelp, Caption).Although the independence assumption from our pivot classifier does not necessarily hold for all datasets, empirically it performs very well.This means that these pivot words are a meaningful approximation of the true decision boundary.
If the decision boundary of a linguistic attribute stays at the lexical level, then one could cross the  boundary by simply substituting the pivot words of one class to another, thus achieving text class transfer.Intuitively, the more pivot words a dataset contains, the stronger the pivot effect is, the easier for the pivot classifier to classify, and the easier to transfer the attribute.This intuition is demonstrated in Figure 3.The pivot effect (shown by pivot classification accuracy) and the transfer difficulty (shown by the transfer success rate reported from previous models) has a strong positive correlation and is statistically significant.This mechanism is demonstrated in Figure 4.The stronger the pivot effect is, the easier to transfer.

The Precision-Recall Histogram
Now we go one step further to reveal how the pivot effect distributes in different portions of the datasets.We propose a new tool, the precisionrecall histogram based on the results from Algorithm 1 and 2. As is shown in Algorithm 3, essentially, this algorithm use pivot words with differ-ent level of confidence (precision) to classify the dataset, and output the recall.For better visualization, we set the precision interval gap to be 0.1, but it is also possible to use smaller or larger gaps.It is also important to balance the dataset in Algorithm 1 to make the baseline precision 0.5.
The histogram for all datasets gives a finegrained illustration of the pivot effect (Figure 5).We first look at the two baseline cases: a dataset with no pivot words, and a dataset full of pivots.If a dataset is full of pivots, i.e. the vocabulary of the two classes have no overlap, then all words should have precision 1.0 and they should achieve 1.0 recall, so the right-most bars are the highest.If a dataset has no pivot words, i.e. all words are distributed evenly in two classes, then all words have precision 0.5 and they should achieve 1.0 recall, so the left-most bars are the highest.The higher the right bars are, the stronger the pivot effect is.
The histograms of the datasets are somewhere between the two baseline cases.Generally, we

Algorithm 3 The Precision-Recall Histogram
Input: The sentences S, the labels Y, the pivot words for each class Ωy, y ∈ Y, the precision matrix p(x, y), x ∈ V, y ∈ Y Output: The precision-recall histogram 1: procedure THE PRECISION-RECALL HISTOGRAM 2: for The precision range pair (pi, pi+1) ∈ [(0.5, 0.6), (0.6, 0.7)...(0.9, 1.0)] do 3: For each class y, gather all pivot words of the precision in the given range: Use y to form a pivot classifier and classify the dataset S. Calculate the recall ri.

5:
Store (pi, ri) 6: return The list of (pi, ri) see two different shape distributions.In the Yelp, Paper, Politics, Reddit, and Twitter datasets, the right-most bars are the highest, meaning that in these datasets, strong pivot words exist in a large portion of the dataset.These are close to the allpivot baseline.Specially, we see that in the Reddit and Twitter dataset, the pivot effect only exists in the impolite class, while in other datasets, the pivot effect exists in both classes.Note that this phenomenon cannot be discovered simply from the overall classification accuracy.After manual inspection, we find out since the attribute of these two datasets is politeness, the pivot words for the impolite class are the common swearwords in English.These words dominate the impolite sentences.
In the Caption, Gender, and Amazon dataset, we see a decreasing height from left to right, indicating a weaker pivot effect.Highest bars exist in the 0.5 precision bars, meaning that for each class, most of them can be classified by 0.5 precision (= random guessing).This is close to the no-pivot baseline.The high-precision words still exist, but they cannot dominate the whole class.In conclusion, the precision-recall histograms give a structural examination for each class.The existence of pivots and the determination power of pivots differ from class to class, and from datasets to datasets.

Analysing Transfer Models with Pivot Analysis
In modified pivot words and compare the resulting sentence stems.
We use the two most common SOTA models, the Controllable Generation (CG) model from Hu et al. (2017), and the Cross Aligned Autoencoder (CA) model from Shen et al. (2017).The CG model uses a conditional VAE with stylediscriminator and trained with a wake-sleep algorithm.The CA model uses a cross-alignment mechanism to guide the transfer process.These are two strong models in many datasets compared to many other models.We direct the readers to the original papers for more details.
We test the models on three datasets: Yelp, Amazon, and Gender.The Yelp dataset is the most widely used benchmark in the text style transfer task.As is shown in the previous sections, it exists strong pivot effects.There are many sentiment words in this dataset.For the Amazon and the Gender dataset, there is less pivot effect.So our experiments give a minimum cover of different types of datasets.We use the released implementation for our experiments3 .All hyper-parameters are followed by their official instructions.Both models are trained until the simultaneous convergence of the reconstruction loss and the adversarial loss.We refer the readers to the implementation repositories for more details.Table 5: Masked edit distance percentage distributions.For the CG model, in most of the cases(> 74%), the masked edit distance is 0, meaning that only few words are changed while the sentence structures are exactly the same.For the CA model, still a large portion of the sentence structures are unchanged (> 37%)   Table 6: Edit distance after masking out the pivot words.In the CG model, only words are modified, while the higher-level sentence structures remain to be the same.For the CA model, it tries to modify more sentence structures.

Lexical Structures
We show that the two models tend to modify only a few words in a given sentence, and a large portion of these words are pivot words.The results are shown in Table 3 and 4. On the Yelp dataset, the CG model and the CA model only modify 1.66 and 1.61 words on average.The portion of pivot words is 91% and 72% respectively.This means on this dataset, both two models focus on word substitutions to change the sentence style.On the Amazon and the Gender dataset, the models take different transfer strategies.For the CG model, it concentrates on fewer words to modify (0.56 on Amazon and 0.79 on Gender).For the CA model, it tends to modify more words (3.54 on Amazon and 5.60 on Gender).Still, both models tend to modify the pivot words for class transfer.In general, a small portion of the sentences are modified (< 30% approximately), and a large portion of the modified words are pivots (> 60% approximately).

Syntactic Structures
If we eliminate the lexical differences by masking out the modified words, what is changed in the resulting sentence stems?We use the Levenshtein string edit distance (Levenshtein, 1966) to measure the distances of the masked sentences as an approximation to the distances of syntactic structures.Figure 7 gives an example of masked sentences.One could also consider more sophisticated metrics to measure the syntactic distances with parsing trees (Shen et al., 2018;Zhang and Shasha, 1989).Here we use the string edit distance for simplicity.In practice, it is informative enough to demonstrate the change of sentence structures.
Table 6 shows the edit distances after masking the pivot words.We see clear differences between the two models.For the CG model, it barely changes the sentence structures (0.1+ distances).This indicates that it takes the strategy to focus more on the substitution of pivot words.For the CA model, it takes the strategy that not only to modify the words, but also a portion of the sentence structures.We see a moderate percentage of the sentence structure modified on the Yelp and Amazon dataset (about 30%), and a large syntactic modification (58%) on the Gender dataset.Compared with the CG model, the CA model tries to modify the sentences more radically.
To show a fine-grained distribution of the distances among different cases, we list the distribution statistics in Table 5.We see that for the CG model, most of the cases > 74%) sentence stems are unchanged.For the CA model, although its average edit distance is larger, in a large portion of the cases (> 37%), the distance is still 0. In conclusion, both models tend to retain the sentence structures in a large portion of the datasets.

Qualitative Analysis
Now we examine the transfer cases qualitatively in Figure 6.These are cases from the CG model on the three datasets.The pivot words are highlighted.When the model tries to change the class of a sentence, it first identifies the pivot words, then substitutes them with the pivots from another class.If we mask out the highlighted pivot words, the resulting sentence stems are the same, indicating that the syntactic structures remain unchanged.Although this is not all the case, the models tend to focus on words in a large portion of the datasets.

Discussion
Implications: Our pivot classifier reveals that to a certain extent, in many transfer datasets, the decision boundary stays at the lexical level.Consequently, to cross the boundary and transfer the text class, many instances in the dataset only requires to modify certain pivot words.But still, there are cases with no pivot words.The decision boundary in such cases is higher than the word level.To transfer these cases, the model needs a deeper understanding of the sentence structures, which may include syntax, semantics, and common sense (Figure 2).

Considerations:
In our experiments, we find out the two models are both quite unstable during training.The balance between the reconstruction loss and the adversarial loss will significantly influence the convergence point.Our pivot analysis framework requires the model to converge to a meaningful local optimum with reasonable content preservation and transfer strength at the same time (Fu et al., 2018).For our pivot algorithms, it is important to balance the datasets (both training and testing) for a reasonable precision baseline(0.5).Our algorithm is mostly sensitive to the precision threshold p 0 i.e. the confidence of how pivot a word is.We tune this parameter based on the development set performance.
Limitations: All of our pivot algorithms stays at the lexical level.These algorithms hold strong Independence assumption that the class of a sentence is independent of the order of words.So this method may not be able to capture certain linguistical phenomenons, such as anastrophe 4 .One could also consider an extreme example where the pivot analysis does not work: suppose we have a corpus of sentences, we label all of them to be 0, then we reverse all sentences, and label the reversed sentences to be 1.In this dataset, both classes share the same vocabulary, and the precision of any word will be 0.5.This is an example where only the order determines the class.Further, in our work, we only consider lexical changes, and do not consider other issues with regard to more rigorous definition of linguistic style (Tikhonov and Yamshchikov, 2018), the evaluation metrics (Mir et al., 2019), and the causality in text classification (Wood-Doughty et al., 2018).These topics will be the future directions.

Conclusion
In this work, we present the Pivot Analysis, a lexical analysis framework for the examination and inspection of text style transfer datasets and models.This analysis framework consists of three text mining algorithms, pivot words discovery, the pivot classifier, and the precision-recall histograms.With these algorithms, we reveal what are the important words that influence the class of a sentence, how these words are distributed in a dataset, the mechanisms through which these words interact with a transfer model, and how the models perform the transfer.Our method serves as a probe for the transparency and the interpretability of the datasets and the transfer models.We show that a large portion of the transfer cases stays at the lexical level, while the syntactic structures are unchanged.
Since our methods stay at the lexical level, it has its own limitations in understanding higher-level sentence compositionality.These limitations are also shared by the SOTA transfer models: to understand the syntax and semantics (i.e. the structures of the sentence), and the common sense (i.e. the background and implications of the surface words).These limitations are also directions for future challenges.In the future, we need to use better inductive bias and use more powerful models towards higher-level sentence compositionality.

Figure 2 :
Figure 2: The pivot words and sentence examples in three example datasets.The vocabulary of pivot words is large so we only list typical words.Sentences without pivot words are intuitively harder to classify and transfer.

Algorithm 1
Pivot Words Discovery Input: The vocabulary V, the sentences S and the labels Y, the frequency threshold f0, the precision threshold p0 Output: The pivot words Ωy for each class y ∈ {0, 1}.The word-class precision matrix p(x, y) 1: procedure PIVOT WORDS DISCOVERY 2: Balance the dataset by down-sampling the majority class.3:

Figure 3 :
Figure 3: Pivot classification accuracy v.s.transfer success rate (correlation = 0.64, p-value = 0.003).The stronger the pivot effect is, the easier to transfer.

Figure 4 :
Figure 4: The mechanism of the pivot effect on classification and transfer.

Figure 5 :
Figure 5: The precision-recall histogram.The high right bars in Yelp, Paper, Politics, Reddit, and Twitter datasets reveal the existence of strong pivot words, Each bar at location (x, y) should be interpreted as: if use pivot words with precision x to classify the sentence, the recall will be y.The higher the right bars are, the more sentences can be classified by words accurately, the stronger the pivot effect is, the easier to transfer.The baseline cases where the dataset is full of/ has no pivot words are show on the left.

Figure 6 :
Figure 6: The transfer cases.Many of the transfered words are pivot words.The model tend to transfer only a few words while leaving the higher level sentence structure unchanged.

Figure 7 :
Figure 7: An example of the masked sentences.Edit distance = 0 after masking.

Table 1 :
The text attribute transfer datasets we investigate.

Table 2 :
Classification accuracy.The voting based pivot classifier is a strong classification baseline compared with the state of art CNN classifier, indicating that in many datasets, words are strong features for class labels.

Table 3 :
this section, we aim to analyze what is changed and what remains in linguistic attribute transfer systems.We perform our experiments from two perspectives: the lexical structures, and the syntactical structures.For the lexical structures, we show what words are modified by the transfer model.For the syntactical structures, we mask out the Average number of modified words and their percentage in the sentence length.The transfer models tend to modify only a few attribute-related words.

Table 4 :
Percentage of modified words that are pivot words.A large portion of the modified words are pivots.