Unsupervised Text Style Transfer with Padded Masked Language Models

We propose Masker, an unsupervised text-editing method for style transfer. To tackle cases when no parallel source-target pairs are available, we train masked language models (MLMs) for both the source and the target domain. Then we find the text spans where the two models disagree the most in terms of likelihood. This allows us to identify the source tokens to delete to transform the source text to match the style of the target domain. The deleted tokens are replaced with the target MLM, and by using a padded MLM variant, we avoid having to predetermine the number of inserted tokens. Our experiments on sentence fusion and sentiment transfer demonstrate that Masker performs competitively in a fully unsupervised setting. Moreover, in low-resource settings, it improves supervised methods' accuracy by over 10 percentage points when pre-training them on silver training data generated by Masker.


Introduction
Text-editing methods (Dong et al., 2019;Malmi et al., 2019;Awasthi et al., 2019;Mallinson et al., 2020), that target monolingual sequence transduction tasks like sentence fusion, grammar correction, and text simplification, are typically more dataefficient than the traditional sequence-to-sequence methods, but they still require substantial amounts of parallel training examples to work well. When parallel source-target training pairs are difficult to obtain, it is often still possible to collect nonparallel examples for the source and the target domain separately. For instance, negative and positive reviews can easily be collected based on the numerical review scores associated with them, which has led to a large body of work on unsupervised text style transfer, e.g., (Yang et al., 2018;Shen et al., 2017;Wu et al., 2019;Li et al., 2018).
The existing unsupervised style transfer methods aim at transforming a source text so that its style matches the target domain but its content stays otherwise unaltered. This is commonly achieved via text-editing performed in two steps: using one model to identify the tokens to delete and another model to infill the deleted text slots (Li et al., 2018;Xu et al., 2018;Wu et al., 2019). In contrast, we propose a more unified approach, showing that both of these steps can be completed using a single model, namely a masked language model (MLM) (Devlin et al., 2019). MLM is a natural choice for infilling the deleted text spans, but we can also use it to identify the tokens to delete by finding the spans where MLMs trained on the source and the target domain disagree in terms of likelihood. This is inspired by the recent observation that MLMs are effective at estimating (pseudo) likelihoods of texts (Wang and Cho, 2019;Salazar et al., 2020). Moreover, by using a padded variant of MLM (Mallinson et al., 2020), we avoid having to separately model the length of the infilled text span.
To evaluate the proposed approach, MASKER, we apply it to two tasks: sentence fusion, which requires syntactic modifications, and sentiment transfer, which requires semantic modifications. In the former case, MASKER improves the accuracy of state-of-the-art text-editing models by more than 10 percentage points in low-resource settings by providing silver data for pretraining, while in the latter, it yields a competitive performance compared to existing unsupervised style-transfer methods.

Method
Our approach to unsupervised style transfer is to modify source texts to match the style of the target domain. To achieve this, we can typically keep most of the source tokens and only modify a fraction of them. To determine which tokens to edit and how to edit them, we propose the following three-step approach: (1) Train padded MLMs on source domain data (Θ source ) and on target domain data (Θ target ). ( §2.1) (2) Find the text spans where the models disagree the most to determine the tokens to delete. ( §2.2) (3) Use Θ target to replace the deleted spans with text that fits the target domain.

Padded Masked Language Models
The original MLM objective in BERT (Devlin et al., 2019) does not model the length of infilled token spans since each [MASK] token corresponds to one wordpiece token that needs to be predicted at a given position. To model the length, it is possible to use an autoregressive decoder or a separate model (Mansimov et al., 2019). Instead, we use an efficient non-autoregressive padded MLM approach by Mallinson et al. (2020) which enables BERT to predict [PAD] symbols when infilling a fixed-length spans of n p [MASK] tokens.
When creating training data for this model, spans of zero to n p tokens, corresponding to whole word(s), are masked out after which the mask sequences are padded to always have n p [MASK] tokens. For example, if n p = 4 and we have randomly decided to mask out tokens from i to j = i + 2 (inclusive) from text W , the corresponding input sequence is: The targets for the first three [MASK] tokens are the original masked out tokens, i.e. w i , w i+1 , w i+2 , while for the remaining token the model is trained to output a special [PAD] token. Similar to (Wang and Cho, 2019;Salazar et al., 2020), we can compute the pseudo-likelihood (L) of the original tokens W i:j according to: where P MLM * t | W \i:j ; Θ denotes the probability of the random variable corresponding to the t-th token in W \i:j taking value w t or [PAD]. Furhermore, we can compute the maximum pseudo-likelihood infilled tokens W i:j = arg max W i:j L W i:j | W \i:j ; Θ by taking the most likely insertion for each [MASK] independently, as done by the regular BERT. These maximum likelihood estimates are used both when de-ciding which spans to edit (as described in §2.2) as well as when replacing the edited spans.
In practice, instead of training two separate models for the source and target domain, we train a single conditional model. Conditioning on a domain is achieved by prepending a special token ([SOURCE] or [TARGET]) to each token sequence fed to the model. 1 At inference time, padded MLM can decide to insert zero tokens (by predicting [PAD] for each mask) or up to n p tokens based on the bidirectional context it observes. In our experiments, we set n p = 4. 2 2.2 Where to edit?
Our approach to using MLMs to determine where to delete and insert tokens is to find text spans where the source and target model disagree the most. Here we introduce a scoring function to quantify the level of disagreement.
First, we note that any span of source tokens that has a low likelihood in the target domain is a candidate span to be replaced or deleted. That is, source tokens from index i to j should be more likely to be deleted the lower the likelihood L W i:j | W \i:j ; Θ target is. Moreover, if two spans have equally low likelihoods under the target model, but one of them has a higher maximum likelihood replacement W target i:j , then it is safer to replace the latter. For example, if a sentiment transfer model encounters a polarized word of the wrong sentiment and an arbitrary phone number, it might evaluate both of them as unlikely. However, the model will be more confident about how to replace the polarized word, so it should try to replace that rather than the phone number. Thus the first component of our scoring function is: This function can be used on its own without having access to a source domain corpus, but in some 1 The motivation for using a joint model instead of two separate models is to share model weights to give more consistent likelihood estimates. An alternative way of conditioning the model would be to add a domain embedding to each token embedding as proposed by Wu et al. (2019). 2 In early experiments, we also tested np = 8, but this resulted in fewer grammatical predictions since each token is predicted independently. To improve the predictions, we could use SpanBERT (Joshi et al., 2020), which is designed to infill spans, or an autoregressive model like T5 (Raffel et al., 2019). cases, this leads to undesired replacements. The target model can be very confident that, e.g., a rarely mentioned entity should be replaced with a more common entity, although this type of edit does not help with transferring the style of the source text toward the target domain. To address this issue, we introduce a second scoring component leveraging the source domain MLM: By adding this component to TargetScore(i, j), we can counter for edits that only increase the likelihood of a span under Θ target but do not push the style closer to the target domain. 3 Our overall scoring function is given by: To determine the span to edit, we compute arg max i,j Score(i, j), where 1 ≤ i ≤ |W | + 1 and i − 1 ≤ j ≤ i + n p − 1. The case j = i − 1 denotes an empty source span, meaning that the model does not delete any source tokens but only adds text before the i-th source token. The process for selecting the span to edit is illustrated in Figure 1, where the source text corresponds to two sentences to be fused. The source MLM has been trained on unfused sentences and the target MLM on fused sentences from the Dis-coFuse corpus (Geva et al., 2019). In this example, the target model is confident that either the boundary between the two sentences or the grammatical mistake "in the France" should be edited. However, also the source model is confident that the grammatical mistake should be edited, so the model correctly ends up editing the words ". She" at the sentence boundary. The resulting fused sentence is: Marie Curie was born in Poland and died in the France .
Efficiency. The above method is computationally expensive since producing a single edit requires O(|W | × n p ) BERT inference steps -although 3 SourceScore(i,j) is capped at zero to prevent it from dominating the overall score. Otherwise, we might obtain lowquality edits in cases where the likelihood of the source span Wi:j is high under the source model and low under the target model but no good replacements exist according to the target model. Given the lack of good replacements, W target i:j may end up being ungrammatical, pushing SourceScore(i,j) close to 1 and thus making it a likely edit, although TargetScore(i,j) remains low. these can be run in parallel. The model can be distilled into a much more efficient supervised student model without losing -and even gaining -accuracy as shown in our experiments. This is done by applying MASKER to the unaligned source and target examples to generate aligned silver data for training the student model.

Experiments
We evaluate MASKER on two different types of tasks: sentence fusion and sentiment transfer. For both experiments, we only apply MASKER once to edit a single span of at most four tokens, since the required edits are often local. 4

Sentence Fusion
Sentence fusion is the task of fusing two (or more) incoherent input sentences into a single coherent sentence or paragraph, and DiscoFuse (Geva et al., 2019) is a recent parallel dataset for sentence fusion. We study both a fully unsupervised setting as well as a low-resource setting.

Sentiment Transfer
In sentiment transfer, the task is to change a text's sentiment from negative to positive or vice versa. We use a dataset of Yelp reviews (   the modified reviews being of the target sentiment. We finetune the MLMs on the training set and apply the resulting MASKER model to the test set. Additionally, we apply the MASKER model to the non-parallel training set to create parallel silver data and train a LASERTAGGER model. Interestingly, the latter setup outperforms MASKER alone (15.3 vs. 14.5 BLEU score; 49.6 vs. 40.9 sentiment accuracy). We think this happens because LASERTAGGER employs a restricted vocabulary of 500 most frequently inserted phrases, which prevents the model from reproducing every spurious infilling that the padded MLM may have produced, effectively regularizing MASKER. In Table 3, we report these results along with baseline methods developed specifically for the sentiment transfer task by Li et al. (2018)

Related Work
Section 1 provides a high-level overview of the related work. Closest to this work is the AC-MLM sentiment transfer method by Wu et al. (2019). This method first identifies the tokens to edit based on n-gram frequencies in the source vs. target domain (as proposed by Li et al. (2018)) and based on LSTM attention scores (as proposed by Xu et al. (2018)). Then it replaces the edited tokens using a conditional MLM. In contrast to their work, our approach leverages the same MLM for both identifying the (possibly empty) span of tokens to edit and for infilling the deleted span. Moreover, our padded MLM determines the number of tokens to insert without having to pre-specify it. In that sense, it is similar to the recently proposed Blank Language Model (Shen et al., 2020).
In addition to the two applications studied in this work, it would be interesting to evaluate MASKER on other style transfer tasks. Tasks for which un-

Conclusions
We have introduced a novel way of using masked language models for text-editing tasks where no parallel data is available. The method is based on training an MLM for source and target domains, identifying the tokens to delete by finding the spans where the two models disagree in terms of likelihood, and infilling more appropriate text with the target MLM. This approach yields a competitive performance in fully unsupervised settings and substantially improves over previous works in lowresource settings.

A Examples of Model Outputs
To further illustrate how MASKER works, Table 7 shows all the input sequences and the output scores that go into computing Figure 1 in the main paper. Furthermore, Tables 5 and 6 present random samples of correct and incorrect outputs by MASKER for the DiscoFuse and Yelp datasets.

B Hyperparameter Settings
We did not perform any hyperparameter tuning, but used a fixed learning rate of 3e-5 and a batch size roughly proportionate to the training set size (see Table 4 for the chosen values). The number of training steps was determined by running the training until convergence and choosing the checkpoint with the highest validation score, shown in Table 4.

C Other Experimental Details
Code. The padded MLM implementation is based on: Padded MLM pretraining. The padded masked language model used in our experiments uses the uncased BERT-base architecture (Devlin et al., 2019) with 110M parameters. It is pretrained with the maximum pad length of n p = 4 on the Wikipedia and books corpora that the original BERT was also trained on. When creating MLM finetuning data for the source and the target domain, we always mask out only a single span of zero to four input tokens so that the masked span corresponds to whole word(s). The accuracy of the MLM at filling the masked span correctly is 44% for sentence fusion and 49% for sentiment transfer as shown in Table 4.
Computing infrastructure. The models were trained using Tensor Processing Units (TPUs). Inference was distributed to multiple CPUs using Apache Beam and Google Cloud.
Runtime. Inference time increases with the sequence length. For the example in Figure 1 of the main paper, prediction takes 52 seconds when running BERT inference on CPU. Using GPUs or TPUs can significantly reduce the runtime, but we chose to use CPUs to be able to distribute the computation more effectively.  Table 4: Hyperparameter settings for the proposed method in Table 1 and 2, along with the Exact scores on validation set. For Padded MLM, the validation score refers to the accuracy of predicting all four masked tokens correctly.

Random Sample of Correct MASKER Predictions
Source the boat was hoisted aboard the carpathia along with other titanic lifeboats . the boat was brought to new york . Prediction the boat was hoisted aboard the carpathia along with other titanic lifeboats and brought to new york .

Source
beausoleil was a good -looking and rebellious youth . by 15 , beausoleil was sent to reform school . Prediction beausoleil was a good -looking and rebellious youth . by 15 , he was sent to reform school .

Source
it is believed that in terms of antiquity , this temple pre-dates the srirangam temple , . the name aadi vellarai . Prediction it is believed that in terms of antiquity , this temple pre-dates the srirangam temple , hence the name aadi vellarai .

Source
john was in charge of the roads north of kapunda . ben had yorke peninsula and the southern routes . Prediction john was in charge of the roads north of kapunda , while ben had yorke peninsula and the southern routes .

Source
in early 2018 , the central bank re-released the l -qiaif regime . it could replicate the section 110 spv . Prediction in early 2018 , the central bank re-released the l -qiaif regime so that it could replicate the section 110 spv .

Source
he also set up trade schools . girls could earn their living . Prediction he also set up trade schools so that girls could earn their living .
Source the prime minister supplied the reason why : she said it was . we depend for our liberty and freedom upon the united states . Prediction the prime minister supplied the reason why : she said it was because we depend for our liberty and freedom upon the united states .

Source
long -distance routes are listed here . those entirely or mostly within one vdot district are at the following pages . Prediction long -distance routes are listed here , while those entirely or mostly within one vdot district are at the following pages .

Random Sample of Incorrect MASKER Predictions
Source she claimed that she did not consult the party . she believed the outcome had been assured . Prediction she claimed that she did not consult the party . however , she believed the outcome had been assured . Target she claimed that she did not consult the party because she believed the outcome had been assured .
Source the 1927 german football championship was won by 1 . the 1927 german football championship is the 20th edition of the competition . Prediction the 1927 german football championship was won by 1 . the 1927 german football championship was the 20th edition of the competition . Target the 1927 german football championship , the 20th edition of the competition , was won by 1 .
Source many other fourteeners surround elbert in all directions . it is very close to central colorado 's collegiate peaks . Prediction many other fourteeners surround elbert because it is very close to central colorado 's collegiate peaks . Target many other fourteeners surround elbert in all directions , and it is very close to central colorado 's collegiate peaks .
Source the others want to return to centralville . davidson orders them back to fight the athsheans . Prediction the others want to return to centralville , but davidson orders them back to fight the athsheans . Target although the others want to return to centralville , davidson orders them back to fight the athsheans .
Source in march 2015 , she suffered a brain aneurysm . in march 2015 , she is expected to fully recover after surgery in the summer of 2015 . Prediction in march 2015 , she suffered a brain aneurysm . however , she is expected to fully recover after surgery in the summer of 2015 . Target in march 2015 , she suffered a brain aneurysm but is expected to fully recover after surgery in the summer of 2015 .
Source johann adam weishaupt died in gotha in 1830 . johann adam weishaupt is the founder of the order of the illuminati . Prediction johann adam weishaupt died in gotha in 1830 . johann weishashaupt is the founder of the order of the illuminati . Target johann adam weishaupt , the founder of the order of the illuminati , died in gotha in 1830 .
Source ohio state senator . ohio state served a six -year term as a justice of the ohio supreme court . Prediction ohio state senator . ohio secretary of state served a six -year term as a justice of the ohio supreme court . Target ohio state senator and served a six -year term as a justice of the ohio supreme court .
Source as minister he was responsible for this event . there was considerable controversy surrounding the disaster he did not step down . Prediction as minister he was responsible for this event . however , there was considerable controversy surrounding the disaster he did not step down . Target as minister he was responsible for this event , although there was considerable controversy surrounding the disaster he did not step down .

Source
the sandwich was not that great . Prediction the sandwich was great .

Source
its also not a very clean park . Prediction its also a very clean park .

Random Sample of Incorrect MASKER Predictions
Source also , could they not bring a single pack of cheese or red peppers ? Prediction also , could they bring a single pack of cheese or red peppers ?
Target they had plenty of cheese packets and red pepper.
Source service was average but could not make up for the poor food and drink . Prediction service was good but could not make up for the poor food and drink . Target service was above average as well as the food and drink .
Source the only saving grace was the black beans . Prediction the saving grace was the black beans . Target one of several saving graces was the black beans Source the rest of their food is edible but their employees and service are horrible . Prediction the rest of their food is edible and their employees and service are horrible . Target the food is great but the employees werent moving fast enough   Table 7: The masked inputs and the scores computed by MASKER for the example shown in Figure 1 of the main paper to find the best span to edit to fuse the two input sentences. The last four columns show the following likelihoods: L