Keep Calm and Switch On! Preserving Sentiment and Fluency in Semantic Text Exchange

In this paper, we present a novel method for measurably adjusting the semantics of text while preserving its sentiment and fluency, a task we call semantic text exchange. This is useful for text data augmentation and the semantic correction of text generated by chatbots and virtual assistants. We introduce a pipeline called SMERTI that combines entity replacement, similarity masking, and text infilling. We measure our pipeline’s success by its Semantic Text Exchange Score (STES): the ability to preserve the original text’s sentiment and fluency while adjusting semantic content. We propose to use masking (replacement) rate threshold as an adjustable parameter to control the amount of semantic change in the text. Our experiments demonstrate that SMERTI can outperform baseline models on Yelp reviews, Amazon reviews, and news headlines.


Introduction
There has been significant research on style transfer, with the goal of changing the style of text while preserving its semantic content.The alternative where semantics are adjusted while keeping style intact, which we call semantic text exchange (STE), has not been investigated to the best of our knowledge.Consider the following example, where the replacement entity defines the new semantic context: Original Text: It is sunny outside!Ugh, that means I must wear sunscreen.I hate being sweaty and sticky all over.Replacement Entity: weather = rainy Desired Text: It is rainy outside!Ugh, that means I must bring an umbrella.I hate being wet and having to carry it around.
The weather within the original text is sunny, * Authors contributed equally whereas the actual weather may be rainy.Not only is the word sunny replaced with rainy, but the rest of the text's content is changed while preserving its negative sentiment and fluency.
With the rise of natural language processing (NLP) has come an increased demand for massive amounts of text data.Manually collecting and scraping data requires a significant amount of time and effort, and data augmentation techniques for NLP are limited compared to fields such as computer vision.STE can be used for text data augmentation by producing various modifications of a piece of text that differ in semantic content.
Another use of STE is in building emotionally aligned chatbots and virtual assistants.This is useful for reasons such as marketing, overall enjoyment of interaction, and mental health therapy.However, due to limited data with emotional content in specific semantic contexts, the generated text may contain incorrect semantic content.STE can adjust text semantics (e.g. to align with reality or a specific task) while preserving emotions.
One specific example is the development of virtual assistants with adjustable socio-emotional personalities in the effort to construct assistive technologies for persons with cognitive disabilities.Adjusting the emotional delivery of text in subtle ways can have a strong effect on the adoption of the technologies (Robillard et al., 2018).It is challenging to transfer style this subtly due to lack of datasets on specific topics with consistent emotions.Instead, large datasets of emotionally consistent interactions not confined to specific topics exist.Hence, it is effective to generate text with a particular emotion and then adjust its semantics.
We propose a pipeline called SMERTI (pronounced 'smarty') for STE. 1 Combining entity replacement (ER), similarity masking (SM), and text infilling (TI), SMERTI can modify the semantic content of text.We define a metric called the Semantic Text Exchange Score (STES) that evaluates the overall ability of a model to perform STE, and an adjustable parameter masking (replacement) rate threshold (MRT/RRT) that can be used to control the amount of semantic change.
We evaluate on three datasets: Yelp and Amazon reviews (He and McAuley, 2016), and Kaggle news headlines (Misra, 2018).We implement three baseline models for comparison: Noun WordNet Semantic Text Exchange Model (NWN-STEM), General WordNet Semantic Text Exchange Model (GWN-STEM), and Word2Vec Semantic Text Exchange Model (W2V-STEM).
We illustrate the STE performance of two SMERTI variations on the datasets, demonstrating outperformance of the baselines and pipeline stability.We also run a human evaluation supporting our results.We analyze the results in detail and investigate relationships between the semantic change, fluency, sentiment, and MRT/RRT.Our major contributions can be summarized as: • We define a new task called semantic text exchange (STE) with increasing importance in NLP applications that modifies text semantics while preserving other aspects such as sentiment.
• We propose a pipeline SMERTI capable of multi-word entity replacement and text infilling, and demonstrate its outperformance of baselines.
• We define an evaluation metric for overall performance on semantic text exchange called the Semantic Text Exchange Score (STES).

Word and Sentence-level Embeddings
Word2Vec (Mikolov et al., 2013a,b) allows for analogy representation through vector arithmetic.We implement a baseline (W2V-STEM) using this technique.The Universal Sentence Encoder (USE) (Cer et al., 2018) encodes sentences and is trained on a variety of web sources and the Stanford Natural Language Inference corpus (Bowman et al., 2015).Flair embeddings (Akbik et al., 2018) are based on architectures such as BERT (Devlin et al., 2019).We use USE for SMERTI as it is designed for transfer learning and shows higher performance on textual similarity tasks compared to other models (Perone et al., 2018).

Text Infilling
Text infilling is the task of filling in missing parts of sentences called masks.MaskGAN (Fedus et al., 2018) is restricted to a single word per mask token, while SMERTI is capable of variable length infilling for more flexible output.Zhu et al. (2019) uses a transformer-based architecture.They fill in random masks, while SMERTI fills in masks guided by semantic similarity, resulting in more natural infilling and fulfillment of the STE task.

Style and Sentiment Transfer
Notable works in style/sentiment transfer include (Shen et al., 2017;Fu et al., 2018;Li et al., 2018;Xu et al., 2018).They attempt to learn latent representations of various text aspects such as its context and attributes, or separate style from content and encode them into hidden representations.They then use an RNN decoder to generate a new sentence given a targeted sentiment attribute.

Review Generation
Hovy (2016) generates fake reviews from scratch using language models.(Lipton et al., 2015;Dong et al., 2017;Juuti et al., 2018) generate reviews from scratch given auxiliary information (e.g. the item category and star rating).Yao et al. (2017) generates reviews using RNNs with two components: generation from scratch and review customization (Algorithm 2 in Yao et al. (2017)).They define review customization as modifying the generated review to fit a new topic or context, such as from a Japanese restaurant to an Italian one.They condition on a keyword identifying the desired context, and replace similar nouns with others using WordNet (Miller, 1995).They require a "reference dataset" (required to be "on topic"; easy enough for restaurant reviews, but less so for arbitrary conversational agents).As noted by Juuti et al. (2018), the method of Yao et al. (2017) may also replace words independently of context.We implement their review customization algorithm (NWN-STEM) and a modified version (GWN-STEM) as baseline models.

Overview
The task is to transform a corpus C of lines of text S i and associated replacement entities RE i : C = {(S 1 , RE 1 ), (S 2 , RE 2 ), . . ., (S n , RE n )} to a modified corpus Ĉ = { Ŝ1 , Ŝ2 , . . ., Ŝn }, where Ŝi are the original text lines S i replaced with RE i and overall semantics adjusted.SMERTI consists of the following modules, shown in Figure 1: 1. Entity Replacement Module (ERM): Identify which word(s) within the original text are best replaced with the RE, which we call the Original Entity (OE).We replace OE in S with RE.We call this modified text S .2. Similarity Masking Module (SMM): Identify words/phrases in S similar to OE and replace them with a [mask].Group adjacent [mask]s into a single one so we can fill a variable length of text into each.We call this masked text S .3. Text Infilling Module (TIM): Fill in [mask] tokens with text that better suits the RE.This will modify semantics in the rest of the text.This final output text is called Ŝ.

Entity Replacement Module (ERM)
For entity replacement, we use a combination of the Universal Sentence Encoder (Cer et al., 2018) and Stanford Parser (Chen and Manning, 2014).

Stanford Parser
The Stanford Parser is a constituency parser that determines the grammatical structure of sentences, including phrases and part-of-speech (POS) labelling.By feeding our RE through the parser, we are able to determine its parse-tree.Iterating through the parse-tree and its sub-trees, we can obtain a list of constituent tags for the RE.We then feed our input text S through the parser, and through a similar process, we can obtain a list of leaves (where leaves under a single label are concatenated) that are equal or similar to any of the RE constituent tags.This generates a list of entities having the same (or similar) grammatical structure as the RE, and are likely candidates for the OE.We then feed these entities along with the RE into the Universal Sentence Encoder (USE).

Universal Sentence Encoder (USE)
The USE is a sentence-level embedding model that comes with a deep averaging network (DAN) and transformer model (Cer et al., 2018).We choose the transformer model as these embeddings take context into account, and the exact same word/phrase will have a different embedding depending on its context and surrounding words.
We compute the semantic similarity between two embeddings u and v: sim(u, v), using the angular (cosine) distance, defined as: Results are in [0, 1], with higher values representing greater similarity.
Using USE and the above equation, we can identify words/phrases within the input text S which are most similar to RE.To assist with this, we use the Stanford Parser as described above to obtain a list of candidate entities.In the rare case that this list is empty, we feed in each word of S into USE, and identify which word is the most similar to RE.We then replace the most similar entity or word (OE) with the RE and generate S .
An example of this entity replacement process is in Figure 2 As seen in Figure 2(d), we calculate semantic similarities between RE and entities within S which have noun constituency tags.Looking at the row for our RE restaurant, the most similar entity (excluding itself) is hotel.We can then generate: S = i love this restaurant ! the beds are comfortable and the service is great !

Similarity Masking Module (SMM)
Next, we mask words similar to OE to generate S using USE.We look at semantic similarities between every word in S and OE, along with semantic similarities between OE and the candidate entities determined in the previous ERM step to broaden the range of phrases our module can mask.We ignore RE, OE, and any entities or phrases containing OE (for example, 'this hotel').
After determining words similar to the OE (discussed below), we replace each of them with a We set a base similarity threshold (ST) that selects a subset of words to mask.We compare the actual fraction of masked words to the masking rate threshold (MRT), as defined by the user, and increase ST in intervals of 0.05 until the actual masking rate falls below the MRT.2Some sample masked outputs (S ) using various MRT-ST combinations for the previous example are shown in Table 1 (more examples in Appendix A).
The MRT is similar to the temperature parameter used to control the "novelty" of generated text in works such as Yao et al. (2017).A high MRT means the user wants to generate text very semantically dissimilar to the original, and may be desired in cases such as creating a lively chatbot or correcting text that is heavily incorrect se- mantically.A low MRT means the user wants to generate text semantically similar to the original, and may be desired in cases such as text recovery, grammar correction, or correcting a minor semantic error in text.By varying the MRT, various pieces of text that differ semantically in subtle ways can be generated, assisting greatly with text data augmentation.The MRT also affects sentiment and fluency, as we show in Section 6.5.

Bidirectional RNN with Attention
We use a bidirectional variant of the GRU (Cho et al., 2014), and hence two RNNs for the encoder: one reads the input sequence in standard sequential order, and the other is fed this sequence in reverse.The outputs are summed at each time step, giving us the ability to encode information from both past and future context.
The decoder generates the output in a sequential token-by-token manner.To combat information loss, we implement the attention mechanism (Bahdanau et al., 2015).We use a Luong attention layer (Luong et al., 2015) which uses global attention, where all the encoder's hidden states are considered, and use the decoder's current time-step hidden state to calculate attention weights.We use the dot score function for attention, where h t is the current target decoder state and hs is all encoder states: score(h t , hs ) = h T t hs .Transformer Our second model makes use of the transformer architecture, and our implementation replicates Vaswani et al. (2017).We use an encoder-decoder structure with a multi-head self-attention token decoder to condition on information from both past and future context.It maps a query and set of keyvalue pairs to an output.The queries and keys are of dimension d k , and values of dimension d v .To compute the attention, we pack a set of queries, keys, and values into matrices Q, K, and V , respectively.The matrix of outputs is computed as: Multi-head attention allows the model to jointly attend to information from different positions.The decoder can make use of both local and global semantic information while filling in each [mask].

Datasets
We train our two TIMs on the three datasets.The Amazon dataset (He and McAuley, 2016) contains over 83 million user reviews on products, with duplicate reviews removed.The Yelp dataset includes over six million user reviews on businesses.The news headlines dataset from Kaggle contains approximately 200, 000 news headlines from 2012 to 2018 obtained from HuffPost (Misra, 2018).
We filter the text to obtain reviews and headlines which are English, do not contain hyperlinks and other obvious noise, and are less than 20 words long.We found that many longer than twenty words ramble on and are too verbose for our purposes.Rather than filtering by individual sentences we keep each text in its entirety so SMERTI can learn to generate multiple sentences at once.We preprocess the text by lowercasing and removing rare/duplicate punctuation and space.
For Amazon and Yelp, we treat reviews greater than three stars as containing positive sentiment, equal to three stars as neutral, and less than three stars as negative.For each training and testing set, we include an equal number of randomly selected positive and negative reviews, and half as many neutral reviews.This is because neutral reviews only occupy one out of five stars compared to positive and negative which occupy two each.Our dataset statistics can be found in Appendix B.

Experiment Details
To set up our training and testing data for text infilling, we mask the text.We use a tiered masking approach: for each dataset, we randomly mask 15% of the words in one-third of the lines, 30% of the words in another one-third, and 45% in the remaining one-third.These masked texts serve as the inputs, while the original texts serve as the ground-truth.This allows our TIM models to learn relationships between masked words and relationships between masked and unmasked words.
The bidirectional RNN decoder fills in blanks one by one, with the objective of minimizing the cross entropy loss between its output and the ground-truth.We use a hidden size of 500, two layers for the encoder and decoder, teacher-forcing ratio of 1.0, learning rate of 0.0001, dropout of 0.1, batch size of 64, and train for up to 40 epochs.
For the transformer, we use scaled dotproduct attention and the same hyperparameters as Vaswani et al. (2017).We use the Adam optimizer (Kingma and Ba, 2014) with β 1 = 0.9, β 2 = 0.98, and = 10 −9 .As in Vaswani et al. (2017), we increase the learning rate linearly for the first warmup steps training steps, and then decrease the learning rate proportionally to the inverse square root of the step number.We set f actor = 1 and use warmup steps = 2000.We use a batch size of 4096, and we train for up to 40 epochs.

Baseline Models
We implement three models to benchmark against. 3First is NWN-STEM (Algorithm 2 from Yao et al. (2017)).We use the training sets as the "reference review sets" to extract similar nouns to the RE (using MIN sim = 0.1).We then replace nouns in the text similar to the RE with nouns extracted from the associated reference review set.
Secondly, we modify NWN-STEM to work for verbs and adjectives 4 , and call this GWN-STEM.From the reference review sets, we extract similar nouns, verbs, and adjectives to the RE (using MIN sim = 0.1), where the RE is now not restricted to being a noun.We replace nouns, verbs, and adjectives in the text similar to the RE with those extracted from the associated reference review set.
Lastly, we implement W2V-STEM using Gensim ( Řehůřek and Sojka, 2010).We train uni-gram Word2Vec models for single word REs, and fourgram models for phrases.Models are trained on the training sets.We use cosine similarity to determine the most similar word/phrase in the input text to RE, which is the replaced OE.For all other words/phrases, we calculate w i = w i − w OE + w RE , where w i is the original word/phrase's embedding vector, w OE is the OE's, w RE is the RE's, and w i is the resulting embedding vector.The replacement word/phrase is w i 's nearest neighbour.We use similarity thresholds to adjust replacement rates (RR) and produce text under various replacement rate thresholds (RRT).

Evaluation Setup
We manually select 10 nouns, 10 verbs, 10 adjectives, and 5 phrases from the top 10% most frequent words/phrases in each test set as our evaluation REs.We filter the verbs and adjectives through a list of sentiment words (Hu and Liu, 2004) to ensure we do not choose REs that would obviously significantly alter the text's sentiment. 5or each evaluation RE, we choose onehundred lines from the corresponding test set that does not already contain RE.We choose lines with at least five words, as many with less carry little semantic meaning (e.g.'Great!', 'It is okay').For Amazon and Yelp, we choose 50 positive and 50 negative lines per RE. 6 We repeat this process three times, resulting in three sets of 1000 lines per dataset per POS (excluding phrases), and three sets of 500 lines per dataset for phrases.Our final results are averaged metrics over these three sets.
For SMERTI-Transformer, SMERTI-RNN, and W2V-STEM, we generate four outputs per text for MRT/RRT of 20%, 40%, 60%, and 80%, which represent upper-bounds on the percentage of the input that can be masked and/or replaced.Note that NWN-STEM and GWN-STEM can only evaluate on limited POS and their maximum replace-ment rates are limited. 7We select MIN sim values of 0.075 and 0 for nouns and 0.1 and 0 for verbs, as these result in replacement rates approximately equal to the actual MR/RR of the other models' outputs for 20% and 40% MRT/RRT, respectively.

Key Evaluation Metrics
Fluency (SLOR) We use syntactic log-odds ratio (SLOR) (Kann et al., 2018) for sentence level fluency and modify from their word-level formula to character-level (SLOR c ).We use Flair perplexity values from a language model trained on the One Billion Words corpus (Chelba et al., 2013): w∈S |w| (2) = −ln(P P Ls) + w∈S |w|ln(P P LW ) w∈S |w| (3) where |S| and |w| are the character lengths of the input text S and the word w, respectively, p M (S) and p M (w) are the probabilities of S and w under the language model M , respectively, and P P L S and P P L w are the character-level perplexities of S and w, respectively.SLOR (from hereon we refer to character-level SLOR as simply SLOR) measures aspects of text fluency such as grammaticality.Higher values represent higher fluency.
We rescale resulting SLOR values to the interval [0,1] by first fitting and normalizing a Gaussian distribution.We then truncate normalized data points outside [-3,3], which shifts approximately 0.69% of total data.Finally, we divide each data point by six and add 0.5 to each result.
Sentiment Preservation Accuracy (SPA) is defined as the percentage of outputs that carry the same sentiment as the input.
We use VADER (Hutto and Gilbert, 2014) to evaluate sentiment as positive, negative, or neutral.It handles typos, emojis, and other aspects of online text.
Content Similarity Score (CSS) ranges from 0 to 1 and indicates the semantic similarity between generated text and the RE.A value closer to 1 indicates stronger semantic exchange, as the output is closer in semantic content to the RE.We also use the USE for this due to its design and strong performance as previously mentioned.

Semantic Text Exchange Score (STES)
We come up with a single score to evaluate overall performance of a model on STE that combines the key evaluation metrics.It uses the harmonic mean, similar to the F 1 score (or F-score) (Chinchor, 1992;Rijsbergen, 1979), and we call it the Semantic Text Exchange Score (STES): where A is SPA, B is SLOR, and C is CSS.STES ranges between 0 and 1, with scores closer to 1 representing higher overall performance.Like the F 1 score, STES penalizes models which perform very poorly in one or more metrics, and favors balanced models achieving strong results in all three.

Automatic Evaluation Results
Table 2 shows overall average results by model. 8 Table 3 shows outputs for a Yelp example. 9 As observed from Table 3 (see also Appendix F), SMERTI is able to generate high quality output text similar to the RE while flowing better than other models' outputs.It can replace entire phrases and sentences due to its variable length infilling.Note that for nouns, the outputs from GWN-STEM and NWN-STEM are equivalent. 10

Human Evaluation Setup
We conduct a human evaluation with eight participants, 6 males and 2 females, that are affiliated project researchers aged 20-39 at the University of Waterloo. 11We randomly choose one evaluation line for a randomly selected word or phrase for each POS per dataset.The input text and each model's output (for 40% MRT/RRT -chosen as a good middle ground) for each line is presented to participants, resulting in a total of 54 pieces of text, and rated on the following criteria from 1-5: 8 See Appendix E for tables and graphs of detailed results broken down by POS, dataset, and MRT/RRT 9 See Appendix F for many more example outputs from each model for various POS and datasets 10 See Appendix C for explanations 11 The authors are not part of the human evaluation • RE Match: "How related is the entire text to the concept of [X]", where [X] is a word or phrase (1 -not at all related, 3 -somewhat related, 5 -very related).Note here that [X] is a given RE.
• Fluency: "Does the text make sense and flow well?" (1 -not at all, 3 -somewhat, 5 -very) • Sentiment: "How do you think the author of the text was feeling?" (1 -very negative, 3 -neutral, 5 -very positive) Each participant evaluates every piece of text.
They are presented with a single piece of text at a time, with the order of models, POS, and datasets completely randomized.

Human Evaluation Results
Average human evaluation scores are displayed in Table 4. Sentiment Preservation (between 0 and 1) is calculated by comparing the average Sentiment rating for each model's output text to the Sentiment rating of the input text, and if both are less than 2.5 (negative), between 2.5 and 3.5 inclusive (neutral), or greater than 3.5 (positive), this is counted as a valid case of Sentiment Preservation.We repeat this for every evaluation line to calculate the final values per model.Harmonic means of all three metrics (using rescaled 0-1 values of RE Match and Fluency) are also displayed.

Performance by Model
As seen in Table 2, both SMERTI variations achieve higher STES and outperform the other models overall, with the WordNet models performing the worst.SMERTI excels especially on fluency and content similarity.The transformer variation achieves slightly higher SLOR, while the RNN variation achieves slightly higher CSS.The WordNet models perform strongest in sentiment preservation (SPA), likely because they modify little of the text and only verbs and nouns.They achieve by far the lowest CSS, likely in part due to this limited text replacement.They also do not account for context, and many words (e.g.proper nouns) do not exist in WordNet.Overall, the WordNet models are not very effective at STE.
W2V-STEM achieves the lowest SLOR, especially for higher RRT, as supported by the example in Table 3 (see also Appendix F).W2V-STEM and WordNet models output grammatically incorrect text that flows poorly.In many cases, words are Input text: great food , large portions !my family and i really enjoyed our saturday morning breakfast .Replacement entity: pizza MRT/RRT Generated Output SMERTI-Transformer 20% great pizza , large slices !my family and i really enjoyed our saturday morning lunch .40%,60% great pizza , large slices !service was terrific and i really enjoyed our saturday morning lunch .80% great pizza , chewy crust !nice ambiance and i really enjoyed it .SMERTI-RNN 20% great pizza , large delivery !my family and i really enjoyed our saturday morning place .40%,60% great pizza , large delivery !good beer and i really enjoyed our saturday morning place .80% great pizza , amazing pizza !reasonable and i really enjoyed everyone .W2V-STEM 20% great pizza , large portions !my family and i really enjoyed our saturday morning breakfast .40% great pizza , large slices !my family dough i crust enjoyed our saturday morning breakfast .60% awesome pizza , large slices !my mom dough i crust enjoyed our saturday morning bagel .80% awesome pizza , slices slices !my mom dough we crust liked our sunday morning bagel .GWN / NWN-STEM 20% great food , large stuff !my family and i really enjoyed our saturday i breakfast .40% great food , large stuff !my i and i really enjoyed our saturday i breakfast .repeated multiple times.We analyze the average Type Token Ratio (TTR) values of each model's outputs, which is the ratio of unique divided by total words.As shown in Table 5, the SMERTI variations achieve the highest TTR, while W2V-STEM and NWN-STEM the lowest.Note that while W2V-STEM achieves lower CSS than SMERTI, it performs comparably in this aspect.This is likely due to its vector arithmetic operations algorithm, which replaces each word with one more similar to the RE.This is also supported by the lower TTR, as W2V-STEM frequently outputs the same words multiple times.

Performance By Model -Human Results
As seen in Table 4, the SMERTI variations outperform all baseline models overall, particularly in RE Match.SMERTI-Transformer performs the best, with SMERTI-RNN second.The WordNet models achieve high Sentiment Preservation, but much lower on RE Match.W2V-STEM achieves These results correspond well with our automatic evaluation results in Table 2.We look at the Pearson correlation values between RE Match, Fluency, and Sentiment Preservation with CSS, SLOR, and SPA, respectively.These are 0.9952, 0.9327, and 0.8768, respectively, demonstrating that our automatic metrics are highly effective and correspond well with human ratings.

SMERTI's Performance By POS
As seen from Table 6 12 , SMERTI's SPA values are highest for nouns, likely because they typically carry little sentiment, and lowest for adjectives, likely because they typically carry the most.
SLOR is lowest for adjectives and highest for phrases and nouns.Adjectives typically carry less semantic meaning and SMERTI likely has more trouble figuring out how best to infill the text.In contrast, nouns typically carry more, and phrases the most (since they consist of multiple words).
SMERTI's CSS is highest for phrases then nouns, likely due to phrases and nouns carrying Overall, SMERTI appears to be more effective on nouns and phrases than verbs and adjectives.

SMERTI's Performance By Dataset
As seen in Table 7, SMERTI's SPA is lowest for news headlines.Amazon and Yelp reviews naturally carry stronger sentiment, likely making it easier to generate text with similar sentiment.
Both SMERTI's and the input text's SLOR appear to be lower for Yelp reviews.This may be due to many reasons, such as more typos and emojis within the original reviews, and so forth.SMERTI's CSS values are slightly higher for news headlines.This may be due to them typically being shorter and carrying more semantic meaning as they are designed to be attention grabbers.
Overall, it seems that using datasets which inherently carry more sentiment will lead to better sentiment preservation.Further, the quality of the dataset's original text, unsurprisingly, influences the ability of SMERTI to generate fluent text.

SMERTI's Performance By MRT/RRT
From Table 8, it can be seen that as MRT/RRT increases, SMERTI's SPA and SLOR decrease while CSS increases.These relationships are very strong as supported by the Pearson correlation values of -0.9972, -0.9183, and 0.9078, respectively.When SMERTI can alter more text, it has the opportunity to replace more related to sentiment while producing more of semantic similarity to the RE.
Further, SMERTI generates more of the text itself, becoming less similar to the human-written input, resulting in lower fluency.To further demonstrate this, we look at average SMERTI BLEU (Papineni et al., 2002) scores against MRT/RRT, shown in Table 8.BLEU generally indicates how close two pieces of text are in content and structure, with higher values indicating greater similarity.We report our final BLEU scores as the average scores of 1 to 4-grams.As expected, BLEU decreases as MRT/RRT increases, and this relationship is very strong as supported by the Pearson correlation value of -0.9960.
It is clear that MRT/RRT represents a trade-off between CSS against SPA and SLOR.It is thus an adjustable parameter that can be used to control the generated text, and balance semantic exchange against fluency and sentiment preservation.

Conclusion and Future Work
We introduced the task of semantic text exchange (STE), demonstrated that our pipeline SMERTI performs well on STE, and proposed an STES metric for evaluating overall STE performance.SMERTI outperformed other models and was the most balanced overall.We also showed a trade-off between semantic exchange against fluency and sentiment preservation, which can be controlled by the masking (replacement) rate threshold.
Potential directions for future work include adding specific methods to control sentiment, and fine-tuning SMERTI for preservation of persona or personality.Experimenting with other text infilling models (e.g.fine-tuning BERT (Devlin et al., 2019)) is also an area of exploration.Lastly, our human evaluation is limited in size and a larger and more diverse participant pool is needed.
We conclude by addressing potential ethical misuses of STE, including assisting in the generation of spam and fake-reviews/news.These risks come with any intelligent chatbot work, but we feel that the benefits, including usage in the detection of misuse such as fake-news, greatly outweigh the risks and help progress NLP and AI research.
Appendices for "Keep Calm and Switch On! Preserving Sentiment and Fluency in Semantic Text Exchange" Steven Y. Feng * Aaron W. Li * David R. Cheriton School of Computer Science University of Waterloo Waterloo, Ontario, Canada {sy2feng, w89li, jhoey}@uwaterloo.ca

A Masked Output Examples
Table 1 1 includes various example masked outputs by the ERM and SMM modules.We illustrate a variety of word-to-word, phrase-to-word, word-tophrase, and phrase-to-phrase entity replacements and similarity masking.

B Dataset Statistics
See Table 2 for our training and testing splits by dataset and sentiment.

C Details of Baseline Implementations C.1 NWN-STEM
This follows Algorithm 2 in Yao et al. (2017).For each dataset and each RE (note that this model is restricted to noun REs), we go through the dataset's training set (which acts as the "reference review set") and extract a list of text lines that contain the RE (where the RE acts as the "topic keyword").For each of these lines, with the help of the Stanford Parser, we extract all singleword nouns, and for each of them (which we call noun i ), we check if M IN sim (noun i , RE) > 0.1.If so, we add them to the list of nouns similar to the RE, which we call sim nouns .
For each evaluation line and associated RE, we extract all singular nouns within the text that are similar to the RE.For our evaluation purposes, we choose two M IN sim values of 0.075 and 0 to produce two outputs per input.These two M IN sim values result in actual replacement rates similar to the actual masking/replacement rates of other models (SMERTI and W2V-STEM) for MRT/RRT of 20% and 40%, respectively.Each similar noun is replaced with the noun in sim nouns that is most similar to it to produce the output text.* Authors contributed equally 1 Tables and figures mentioned in this appendices document refer to the tables and figures here When analyzing every word during the above procedure, we take the first default WordNet synset's definition of the word (e.g."bill" will default to "a statute in draft before it becomes law").Further, if a word does not exist in WordNet, we use WordNet's Morphy to find if a lexically similar word exists that may just be in a different form (e.g."photos" does not exist in WordNet, but using Morphy, we can find "photograph") and use the first default synset of this resulting word.
Note that since the number of nouns per input is limited, the actual replacement rates have an upper limit, and this is why we only generate two outputs per input to compare to outputs of other models for 20% and 40% MRT/RRT.

C.2 GWN-STEM
This is a modification of NWN-STEM to handle verbs and adjectives (note that WordNet only works for single words).On top of nouns, we also extract similar verbs and adjectives to the RE.This results in three lists: sim nouns , sim verbs , and sim adjs , where we still use a M IN sim value of 0.1 for determining the three lists.Further, we not only replace nouns in the original text, but also verbs and adjectives.Note that we use the same synset and Morphy procedure as for NWN-STEM.For our evaluation, we choose the same M IN sim values of 0.075 and 0 for noun REs, but M IN sim values of 0.1 and 0 for verb REs.These combinations of M IN sim values result in actual replacement rates similar to the actual masking/replacement rates of other models for MRT/RRT of 20% and 40%, respectively.
We noticed that GWN-STEM only works for noun and verb REs, as for most adjectives, Word-Net cannot calculate similarity scores.Hence, it was infeasible to evaluate on adjective REs.Further, most similarity scores only exist between noun-noun pairs and verb-verb pairs, and when we  tried to produce sim verbs and sim adjs for noun REs, almost all resulted in empty lists.Hence, GWN-STEM actually produced the same outputs as NWN-STEM for noun REs (and was also limited to two replacement rates), which is why the two models have the same outputs and resulting metrics for noun REs.Even for verb REs, we were limited to two sets of outputs (mimicking the two replacement rates above) since similarity calculations between verb-noun pairs and verbadjective pairs were limited, so few were replaced.

C.3 W2V-STEM
This uses Word2Vec (W2V) models trained using Gensim.We train six W2V models: one unigram model per dataset, and one four-gram model per dataset, where each is trained using the corresponding dataset's training set.To train the fourgram models, we begin by applying a bi-gram phrasing model on top of the original text, and then the bi-gram phrasing model again on top of this resulting text.We call this a four-gram phrasing model.We then use this to generate text that is grouped into phrases up to four-grams long.We then train W2V models on this four-gram text to generate the four-gram W2V models.For the unigram models, we use an embedding vector size of 50, a context window of 3, a minimum token count of 0, and the skip-gram model.For the four-gram models, we use an embedding vector size of 10, a context window of 1, a minimum token count of 0, and the CBOW (continuous bag-of-words) model.
For evaluation lines with noun, verb, and adjective REs, we go through the line of text and with the help of the Stanford parser, extract all words that are the same POS as the RE (which become the candidate OEs).Then, using cosine similarity between the W2V embedding vectors (from the unigram W2V models) of the RE and candidate OEs, we find the word with the maximum similarity to the RE which becomes the replaced OE.Then, we replace other words in the input text sim-ilar to the OE using vector addition and subtraction as described in Section 4.3.We start with a similarity threshold value of 0.1 that increases by 0.05 each time to generate text satisfying varying replacement rate thresholds.
For evaluation lines with phrase REs, we first generate text files of the evaluation sets that are grouped into phrases up to four-grams long using the four-gram phrasing model.Then, using the four-gram W2V models, we proceed similarly as above to determine the replaced OE, which can now be either a single word or a phrase.Other words and phrases in the input text are replaced using vector addition and subtraction.We start with a similarity threshold value of 0.3 that increases by 0.01 each time to generate text satisfying varying replacement rate thresholds.

D Evaluation Keywords/Phrases
We select ten keywords (five keyphrases) to act as our REs per dataset per POS.See Tables 3 to 6 for the chosen REs.To do so, we iterate through our test sets and with the Stanford Parser, we extract a list of nouns and noun phrases, verbs and verb phrases, and adjectives and adjective phrases.We sort these lists by frequency, and limit our selections to the top 10% most frequent.For the verbs and adjectives and their phrases, we further filter them through a list of sentiment words (Hu and Liu, 2004) to ensure the REs we choose do not carry significant sentiment-related meaning, as inserting them into the original text would obviously lead to major changes in sentiment.From these, we manually select ten per dataset per POS (except phrases, where we select five per dataset) that are significant and carry strong meaning.These work well as the REs for evaluation purposes.Note that for phrases, we choose three noun phrases, one verb phrase, and one adjective phrase per dataset.
We choose from the most frequent words and phrases as they are more common and likely hold more significant meaning compared to less frequent ones (e.g.names and typos).Manual selection was required as some of the most frequent words/phrases hold little semantic meaning (e.g.it, they, is, was, and so forth).We only choose half the number of phrases as words as we find that frequent phrases carrying significant semantic meaning with little sentiment are much rarer.

E Detailed Evaluation Results -Tables and Graphs
See Figure 1 for a graph of overall average results, Table 7 and Figure 2 for average results by POS, Table 8 and Figure 3 for average results by dataset, and Table 9 and Figure 4 for average results by MRT/RRT.Note that the bolded values in the tables show which model performs better on that particular metric, on average, for the category.

F Model Output Examples
See Tables 10 to 21 for example outputs from every model for all datasets and POS.

Figure 1 :
Figure 1: Overall architecture and example, showing the three modules: Entity Replacement (ERM), Similarity Masking (SMM), and Text Infilling (TIM) . Two parse-trees are shown: for RE (a) and S (b) and (c).Figure 2(d) is a semantic similarity heat-map generated from the USE embeddings of the candidate OEs and RE, where values are similarity scores in the range [0, 1].

Figure 2 :
Figure 2: ERM example with S = i love this hotel ! the beds are comfortable and the service is great ! and RE = restaurant showing (a) Parse tree for RE; (b) and (c) Parse tree for S; (d) Semantic similarity heat map

Table 3 :
Chosen evaluation noun REs.*Obama does not exist in WordNet, so we instead use the word President for NWN-STEM and GWN-STEM.

Figure 1 :
Figure 1: Graph of overall average results (referring to the data found inTable 2 of the main body)

Figure 3 :
Figure 3: Graph of average results by dataset

Table 1 :
Masked outputs for different masking rate thresholds (MRT) and base similarity thresholds (ST)

Table 2 :
Overall average results by model (with % changes from the input)

Table 3 :
Generated output text by model for various masking rates on a Yelp evaluation example

Table 4 :
Average human evaluation scores by model

Table 5 :
Average TTR values by model

Table 8 :
SMERTI's avg.SPA, SLOR, CSS, STES, and BLEU by MRT/RRT more semantic meaning, making it easier to generate semantically similar text.Both SMERTI's and the input text's CSS are lowest for adjectives, likely because they carry little semantic meaning.

S
family enjoyed the food quite a bit especially the sweet and sour chicken .RE bitter S 1 family enjoyed the food quite a bit especially the sweet and bitter chicken .S 2 family [mask] the [mask] a bit especially the sweet and bitter [mask] .My son took his math test yesterday and failed.He cried all day and I hate him now.RE medical examination S 1 [mask] took medical examination yesterday and [mask] .he cried all day and i hate [mask] now .S 2 [mask] took medical examination yesterday and [mask] .he cried all day and i hate [mask] now .S 3 [mask] took medical examination yesterday and [mask] .he cried all day and i hate [mask] now .

Table 1 :
Example masked outputs.S is the original input text; RE is the replacement entity; S 1 corresponds to M RT = 0.2, base ST = 0.4; S 2 corresponds to M RT = 0.4, base ST = 0.3; S 3 corresponds to M RT = 0.6, base ST = 0.2; S 4 corresponds to M RT = 0.8, base ST = 0.1

Table 2 :
Training and testing splits by dataset

Table 4 :
Chosen evaluation verb REs

Table 5 :
Chosen evaluation adjective REs

Table 6 :
Chosen evaluation phrase REs Table 2 of the main body)

Table 7 :
Average results by POS Figure 2: Graph of average results by POS

Table 8 :
Average results by dataset

Table 10 :
Example outputs for an Amazon evaluation line with noun

Table 11 :
Example outputs for an Amazon evaluation line with verb RE

Table 12 :
Example outputs for an Amazon evaluation line with adjective RE

Table 13 :
Example outputs for an Amazon evaluation line with phrase RE

Table 14 :
Example outputs for a Yelp evaluation line with noun RE

Table 15 :
Example outputs for a Yelp evaluation line with verb RE

Table 16 :
Example outputs for a Yelp evaluation line with adjective RETable 17: Example outputs for a Yelp evaluation line with phrase RE Table 18: Example outputs for a news headlines evaluation line with noun RE Table 19: Example outputs for a news headlines evaluation line with verb RE Table 20: Example outputs for a news headlines evaluation line with adjective RE Table 21: Example outputs for a news headlines evaluation line with phrase RE