Facilitating the Communication of Politeness through Fine-Grained Paraphrasing

Aided by technology, people are increasingly able to communicate across geographical, cultural, and language barriers. This ability also results in new challenges, as interlocutors need to adapt their communication approaches to increasingly diverse circumstances. In this work, we take the first steps towards automatically assisting people in adjusting their language to a specific communication circumstance. As a case study, we focus on facilitating the accurate transmission of pragmatic intentions and introduce a methodology for suggesting paraphrases that achieve the intended level of politeness under a given communication circumstance. We demonstrate the feasibility of this approach by evaluating our method in two realistic communication scenarios and show that it can reduce the potential for misalignment between the speaker's intentions and the listener's perceptions in both cases.


Introduction
Technological developments have greatly enhanced our communication experience, providing the opportunity to overcome geographic, cultural and language barriers to interact with people from different backgrounds in diverse settings (Herring, 1996). However, this opportunity brings additional challenges for the interlocutors, as they need to adjust their language to increasingly diverse communication circumstances.
As humans, we often make conscious attempts to account for the communication setting. For instance, we may simplify our expressions if we know our listener has relatively limited language proficiency, and we tend to be more polite towards people with higher status. However, managing these stylistic adjustments can be cognitively taxing, especially when we are missing relevant information-e.g., the language proficiency or the status of a conversational partner we meet online. If we do not adjust our language, we risk not properly conveying our pragmatic intentions (Thomas, 1983). In particular, Berlo's Sender-Message-Channel-Receiver model (Berlo, 1960) points to two potential circumstance-specific causes for misalignments between intentions and perceptions ( Figure 1). In this work we explore a method for assisting speakers to avoid such misalignments by suggesting for each message a paraphrase that is more likely to convey the original pragmatic intention when communicated in a given circumstance, as determined by the properties of the sender, channel, and receiver.
As a case study, in this work, we focus on one particular pragmatic aspect: politeness. It is important to assist people to accurately transmit their intended politeness, as this interpersonal style (Biber, 1988) plays a key role in social interactions (Burke and Kraut, 2008;Murphy, 2014;Hu et al., 2019;Maaravi et al., 2019). Furthermore, politeness is known to be a circumstance-sensitive phenomenon (Kasper, 1990;Herring, 1994;Forgas, 1998;Mousavi and Samar, 2013), making it a good case for our study. Concretely, we propose the task of generating a paraphrase for a given message that is more likely to deliver the intended level of politeness after transmission (henceforth intention-preserving), considering the properties of the sender, channel, and receiver (Section 3).
Taking the properties of the channel into account is important because communication channels may not always faithfully deliver messages ( Figure 1A). For example, in translated communication, politeness signals can often be lost or corrupted (Allison and Hardin, 2020). To demonstrate the potential of our framework in mitigating channel-induced misunderstandings, we apply it to suggest paraphrases that are safer to transmiti.e., less likely to have their politeness alteredover a commercial machine translation service.
We also need to account for the fact that the sender and receiver can have different interpretations of the same message ( Figure 1B). For example, people may perceive politeness cues differently depending on their cultural background (Thomas, 1983;Riley, 1984). In our second application scenario, the interlocutors' perceptions of politeness are misaligned, and we aim to suggest paraphrases that reduce the potential for misinterpretation.
To successfully produce such circumstancesensitive paraphrases, we need to depart from existing style transfer methodology (see Li et al., 2020 for a survey, and Madaan et al., 2020 for politeness transfer in particular). First, since we must account for arbitrary levels of misalignment, we need fine-grained control over the target stylistic level, as opposed to binary switches (e.g., from impolite to polite). Second, we need to determine the target stylistic level at run time, in an ad hoc fashion, rather than assuming predefined targets.
To overcome these new technical challenges, we start from the intuition that the same level of politeness can be conveyed through different combinations of pragmatic strategies (Lakoff, 1973;Brown and Levinson, 1987), with some being more appropriate to the given circumstance than others. We consider a classic two-step approach (Section 4), separating planning-choosing a viable combination of strategies that can achieve a desired stylistic level in a particular circumstance-, from the step of realization-incorporating this plan into generation outputs. For a given fine-grained target stylistic level (i.e., the level intended by the sender), we find the optimal strategy plan via Integer Linear Programming (ILP). We then realize this plan using a modification of the 'Delete-Retrieve-Generate' (DRG) paradigm (Li et al., 2018) that allows for strategy-level control in generation.
Our experimental results indicate that in both our application scenarios, our method can suggest paraphrases that narrow the potential gap between the intended and perceived politeness, and thus better preserve the sender's intentions. These results show that automated systems have the potential to help people better convey their intentions in new communication circumstances, and encourage further work exploring the feasibility and implications of such communication assistance applications.
To summarize, in this work, we motivate and formulate the task of circumstance-sensitive intentionpreserving paraphrasing (Section 3). Focusing on the case of pragmatic intentions, we introduce a model for paraphrasing with fine-grained politeness control (Section 4). We evaluate our method in two realistic communication scenarios to demonstrate the feasibility of the approach (Section 5).

Further Related Work
Style transfer. There has been a wide range of efforts in using NLP techniques to generate alternative expressions, leading to tasks such as text simplification (see Shardlow, 2014 for a survey), or more generally, paraphrase generation (Meteer and Shaked, 1988;Quirk et al., 2004;Fu et al., 2019, inter alia). When such paraphrasing effort is focused on the stylistic aspect, it is also referred to as text style transfer, which has attracted a lot of attention in recent years (Xu et al., 2012;Ficler and Goldberg, 2017;Fu et al., 2018;Prabhumoye et al., 2018, inter alia). While these tasks are focused on satisfying specific predefined linguistic properties at the utterance-level, they are not designed for fine-grained adjustments to changing non-textual communication circumstances. Controllable generation. Style transfer or paraphrasing can both be seen as a special case of the broader task of controllable text generation (Hu et al., 2017;Keskar et al., 2019;Dathathri et al., 2020, inter alia). While not focused on paraphrasing, relevant work in this area aims at controling the level of politeness for translation (Sennrich et al., 2016) or dialog response (Niu and Bansal, 2018). AI-assisted communications or writing. Beyond paraphrasing, AI tools have been used to provide communication or writing assistance in diverse settings: from the mundane task of grammar and spell checking (Napoles et al., 2019;Stevens, 2019), to creative writing (Clark et al., 2018), to negotiations (Zhou et al., 2019), and has led to discussions of ethical implications (Hancock et al., 2020). Models of communication. While Berlo's model provides the right level of abstraction for inspiring our application scenarios, many other models exist (Velentzas and Broni, 2014;Barnlund, 2017), most of which are under the influence of the Shannon-Weaver model (Shannon and Weaver, 1963).

Task Formulation
Given a message that a sender attempts to communicate to a receiver over a particular communication channel, the task of circumstance-sensitive intention-preserving paraphrasing is to generate a paraphrase that is more likely to convey the intention of the sender to the receiver after transmission, under the given communication circumstance. Formulation. To make this task more tractable, our formulation considers a single gradable stylistic aspect of the message that can be realized through a collection of pragmatic strategies (denoted as S). While in this work we focus on politeness, other gradable stylistic aspects might include formality, humor and certainty.
We can then formalize the relevant features of the communication circumstance as follows: 1. For the communication channel, we consider whether it can safely transmit each strategy s ∈ S. In particular, f c (s) = 1 indicates that strategy s is safe to use, whereas f c (s) = 0 implies that s is at-risk of being lost.
2. For the sender and receiver, we quantify the level of the stylistic aspect each of them perceive in a combination of pragmatic strategies via two mappings f send : P(S) → R and f rec : P(S) → R, respectively, with P(S) denoting the powerset of S.
With our focus on politeness, our task can then be more precisely stated as follows: given an input message m, we aim to generate a politeness paraphrase for m, under the circumstance specified by (f send , f c , f rec ), such that the level of politeness perceived by the receiver is as close to that intended by the sender as possible.
We show that our theoretically-grounded formulation can model naturally-occurring challenges in communication, by considering two possible application scenarios, each corresponding to a source of misalignment highlighted in Figure 1.
Application A: translated communication. We first consider the case of conversations mediated by translation services, where channel-induced misunderstandings can occur ( Figure 1A): MT models may systematically drop certain politeness cues due to technical limitations or mismatches between the source and target languages.
For instance, despite the difference in intended politeness level (indicated in parentheses) of the following two versions of the same request, 1 Could you please proofread this article? (POLITE) Can you proofread this article? (SOMEWHAT POLITE) Microsoft Bing Translator would translate both versions to the same utterance in Chinese. 2 By dropping the politeness marker 'please', and not making any distinction between 'could you' and 'can you', the message presented to the Chinese receiver is likely to be more imposing than originally desired by the English sender.
To avoid such channel-induced misunderstandings, the sender may consider using only strategies that are known to be safe with the specific MT system they use. 3 However, since the inner mechanics of such systems are often opaque (and in constant flux), the sender would benefit from automatic guidance in constructing such paraphrases.
Application B: misaligned perceptions. We then consider the case when senders and receivers with differing perceptions interact. Human perceptions of pragmatic devices are subjective, and it is not uncommon to observe different interpretations of the same utterance, or pragmatic cues within, leading to misunderstandings (Thomas, 1983;Kasper, 1990) ( Figure 1B). For instance, a study comparing Japanese speakers' and American native English speakers' perceptions of English requests find that while the latter group takes the request 'May I borrow a pen? ' as strongly polite, their Japanese counterparts regard the expression as almost neutral (Matsuura, 1998). In this case, if a native speaker still wishes to convey their good will, they need to find a paraphrase that would be perceived as strongly polite by Japanese speakers.
When compared to style transfer tasks, our circumstance-sensitive intention-preserving paraphrasing task gives rise to important new technical challenges. First, in order to minimize the gap in perceptions, we need to have fine-grained control over the stylistic aspect, as opposed to switching between two pre-defined binarized targets (e.g., polite vs. impolite). Second, the desired degree of change is only determined at run-time, depending on the speaker's intention and on the communication circumstance. We address these challenges by developing a method that allows for ad hoc and fine-grained paraphrase planning and realization.
Our solution starts from a strategy-centered view: instead of aiming for monolithic style labels, we think of pragmatic strategies as (stylistic) LEGO bricks. These can be stacked together in various combinations to achieve similar stylistic levels. Depending on the circumstance, some bricks might, or might not, be available. Therefore, given a message with an intended stylistic level, our goal is to find the optimal collection of available bricks that can convey the same level-ad hoc fine-grained planning. Given this optimal collection, we need to assemble it with the rest of the message into a valid paraphrase-fine-grained realization. Politeness strategies. In the case of politeness, we derive the set of pragmatic strategies from prior work (Danescu-Niculescu-Mizil et al., 2013;Voigt et al., 2017;Yeomans et al., 2019). We focus on strategies that are realized through local linguistic markers. For instance, the Subjunctive strategy can be realized through the use of markers like could you or would you. In line with prior work, we further assume that markers realizing the same strategy has comparable strength in exhibiting politeness and are subject to the same constraints. The full list of 18 strategies we consider (along with their example usages) can be found in Table 1. Strategy extraction code is available in ConvoKit. 4 Ad hoc fine-grained planning. Our goal is to find a target strategy combination that is estimated to provide a comparable pragmatic force to the sender's intention, using only strategies appropriate in the current circumstance. To this end, we devise an Integer Linear Programming (ILP) formulation that can efficiently search for the desired strategy combination to use (Section 4.1). Fine-grained realization. To train a model that learns to merge the strategy plan into the original message in the absence of parallel data, we take inspirations from the DRG paradigm (Li et al., 2018), originally proposed for style transfer tasks. We adapt this paradigm to allow for direct integration with strategy-level planning, providing finergrained control over realization (Section 4.2).

Fine-Grained Strategy Planning
Formally, given a message m using a set of strategies S in , under a circumstance specified by (f send , f c , f rec ), the planning goal is to find the set of strategies S out ⊆ S such that f c (s) = 1, ∀s ∈ S out -i.e., they can be safely transmitted through the communication channel-and f send (S in ) ≈ f rec (S out )-i.e., the resultant receiver perception is similar to the intention the sender had when crafting the original message.
Throughout, we assume that both perception mappings f send and f rec are linear functions: where the linear coefficients a s and b s are reflective of the strength of a strategy, as perceived by the sender and receiver, respectively. 5 Naive approach. One greedy type of approach to this problem is to consider each at-risk strategy s ∈ S in at a time, and replace s with a safe strategy s that is closest in strength. Mathematically, this can be written as s = arg minŝ ∈S,fc(ŝ)=1 |a s − bŝ|. In our analogy, this amounts to reconstructing a LEGO model by replacing each 'lost' brick with the most similar brick that is available. Our approach: ILP formulation. The greedy approach, while easy to implement, can not consider solutions that involve an alternative combination of strategies. In order to more thoroughly search for an appropriate strategy plan in the space of possible solutions in a flexible and efficient manner, we translate this problem into an ILP formulation. 6 Our objective is to find a set of safe strategies S out that will be perceived by the receiver as close as possible to the sender's intention, i.e., one that that minimizes |f send (S in ) − f rec (S out )|.
To this end, we introduce a binary variable x s for each strategy s in S, where we take x s = 1 to mean that strategy s should be selected to be present in the suggested alternative strategy combination S out . We can identify the optimal value of x s (and thus the optimal strategy set S out ) by solving the following ILP problem: 7 which is a rewriting our objective to minimize |f send (S in ) − f rec (S out )| that satisfies the linearity requirement of ILP via an auxiliary variable y, and where our target variables x s replace the indicator function 1 Sout (s) in the linear expression of f rec .
The channel constraints are encoded by the additional constraints x s ≤ f c (s), allowing only safe strategies (i.e., those for which f c (s) = 1) to be included. Additional strategy-level constraints can be similarly specified through this mechanism to obtain strategy plans that are easier to realize in natural language (Section C in the Appendix).

Fine-Grained Realization
To transform the ILP solutions into natural language paraphrases, we build on the general DRG frame-6 A brute force alternative would inevitably be less scalable. 7 All summations are over the entire strategy set S. Throughout, we use the PuLP package (Mitchell et al., 2011) with GLPK solver to obtain solutions. work, which has shown strong performance in style transfer without parallel data. 8 We modify this framework to allow for the fine-grained control needed to realize strategy plans.
As the name suggests, the vanilla DRG framework consists of three steps. With delete, lexical markers (n-grams) that are strongly indicative of style are removed, resulting in a 'style-less' intermediate text. In the retrieve step, target markers are obtained by considering those used in training examples that are similar to the input but exhibit the desired property (e.g., target sentiment valence). Finally, in the generate step, the generation model merges the desired target markers with the styleless intermediate text to create the final output.
Importantly, the DRG framework is primarily designed to select to-be-inserted markers based on pre-defined binary style classes. As such, it cannot directly allow the ad hoc fine-grained control needed by our application. We now explain our modifications in detail (follow the sketch of our pipeline in Figure 2): Plan (instead of Retrieve). We first perform a Plan step, which substitutes the Retrieve step in DRG, but it is performed first in our pipeline as our version of the Delete step is dependent on the planning results. For an input message, we identify the politeness strategies it contains and set up the corresponding ILP problem (Section 4.1) to obtain their functional alternatives. By factoring in the communication circumstance into the ILP formulation, we obtain an ad hoc strategy plan to achieve the intended level of politeness. This is in contrast with the Retrieve step in DRG, in which target markers from similar-looking texts are used for direct lexical substitution. Delete. Instead of identifying style-bearing lexical markers to delete with either frequency-based heuristics (Li et al., 2018), or sentence context (Sudhakar et al., 2019), we rely on linguistically informed politeness strategies. To prepare the input message for the new strategy plan, we compare the strategy combination from the ILP solution with those originally used. We then selectively remove strategies that do not appear in the ILP solution by deleting the corresponding markers found in the input message. As such, in contrast with DRG, our post-deletion context is not necessarily style-less, and it is also possible that no deletion is performed. Figure 2: Sketch of our pipeline for generating politeness paraphrases. Given an input message, we first identify the politeness strategies (S in ) and the corresponding markers it contains. In the plan step, we use ILP to compute a target strategy combination (S out ) that is appropriate under the circumstance. We then delete markers corresponding to strategies that need to be removed to obtain the post-deletion context. Finally, we sequentially insert the new strategies from the ILP solution into this context to generate the final output.
Generate. Finally, we need to generate fluent utterances that integrate the strategies identified by the Plan step into the post-deletion context. To this end, we adapt G-GST (Sudhakar et al., 2019), whose generation model is fine-tuned to learn to integrate lexical markers into post-deletion context. To allow smooth integration of the ILP solution, we instead train the generation model to incorporate politeness strategies directly.
Concretely, training data exemplifies how each target strategy can be integrated into various postdeletion contexts. This data is constructed by finding GROUNDTRUTH utterances containing markers corresponding to a certain STRATEGY, and removing them to obtain the post-deletion CONTEXT. These training instances are represented as (STRATEGY, CON-TEXT, GROUNDTRUTH) tuples separated by special tokens (examples in Figure 2). The model is trained to minimize the reconstruction loss. 9 At test time, we sequentially use the model to integrate each STRATEGY from the plan into the postdeletion CONTEXT. We perform beam search of size 3 for each strategy we attempt to insert and select the output that best matches the intended level of politeness as the paraphrase suggestion. 10 9 We adapted the implementation from Sudhakar et al.
(2019) to incorporate our modification described above, and we use their default training setup. 10 We set an upper bound of at most 3 new strategies to be introduced to keep sequential insertion computationally manageable. This is a reasonable assumption for short utterances.

Evaluation
To test the feasibility of our approach, we set up two parallel experiments with different circumstance specifications, so that each illustrates one potential source of misalignment as described in Section 3. 11

Experiments
Data. We use the annotations from the Wikipedia section of the Stanford Politeness Corpus (henceforth annotations) to train perception models that will serve as approximations of f send and f rec . In this corpus, each utterance was rated by 5 annotators on a 25-point scale from very impolite to very polite, which we rescale to the [−3, 3] range.
To train the generation model, we randomly sample another (unannotated) collection of talk-page messages from WikiConv (Hua et al., 2018). For each strategy, we use 1,500 disjoint instances for training (27,000 in total, 2000 used for validation) and additionally resource 200 instances per strategy as test data. Both the Stanford Politeness Corpus and WikiConv are retrieved from ConvoKit (Chang et al., 2020b). Experiment A: translated communication. We first consider MT-mediated English to Chinese communication using Microsoft Translator, where channel-induced misunderstandings may occur.
For this specific channel, we estimate its f c by performing back-translation 12 (Tyupa, 2011) on a sampled set of utterances from the collection of Stack Exchange requests from the Stanford Politeness Corpus. We consider a strategy s to be at-risk under this MT-mediated channel if the majority of messages using s have back-translations that no longer uses it. We identify four at-risk strategies, leading to the following channel specification: f c (s) = 0, if s ∈ {Subjunctive, Please, Filler, Swearing}; f c (s) = 1 otherwise.
For the sender and the receiver, we make the simplifying assumption that they both perceive politeness similar to a prototypical 'average person' (an assumption we address in the next experiment), and take the average scores from the annotations to train a linear regression model f avg to represent the perception model, i.e., f send = f rec = f avg .
We retrieve test data corresponding to the four at-risk strategy types as test messages (4 × 200 in total). We estimate the default perception gap (i.e., when no intervention takes place) by comparing the intended level of politeness in the original message and the level of politeness of its backtranslation, which roughly approximates what the receiver sees after translation, following Tyupa (2011). This way, we can avoid having to compare politeness levels across different languages. Experiment B: misaligned perceptions. We then consider communication between individuals with misaligned politeness perceptions. Under this circumstance, we assume a perfect channel, which allows any strategy to be safely transmitted, i.e., f c (s) = 1, ∀s ∈ S. We then consider the top 5 most prolific annotators as potential senders and receivers. To obtain f send (and f rec ), we use the respective annotator's annotations to train an individual linear regression model. 13 We take all permutations of (sender, receiver) among the chosen annotators, resulting in 20 different directed pairs. For each pair, we select as test data the top 100 utterances with the greatest (expected) perception gap in the test set. We take the default perception gap within the pair (with no intervention) as the difference between the sender's intended level of politeness (as judged by f send ) and the receiver's perceived level of politeness (as judged by f rec ). translated text back into the source language.
13 Details about the choice of annotators and their perception models are described in Section B in the Appendix. While in practice individual perception models may not be available, they could potentially be approximated based on annotations from people with similar (cultural) backgrounds.
Baselines. Beyond the base case with no intervention, we consider baselines with different degrees of planning. We first consider binary-level planning by directly applying vanilla DRG in our setting: for each message, we retrieve from the generation training data the most similar utterance that has the same politeness polarity as the input message, 14 and take the strategy combination used within as the new strategy plan. We then consider a finer-grained strategy planning based on the naive greedy search, for which we substitute each atrisk strategy by an alternative that is the closest in strength. To make fair comparisons among different planning approaches, we apply the same set of constraints (either circumstance-induced or generation-related) we use with ILP. 15 Evaluation. We compare the paraphrasing outputs using both automatic and human evaluations. First, we consider our main objective: how effective each model is at reducing the potential gap between intended and perceived politeness. We compare the predicted perceived politeness levels of paraphrases generated by each model with the intended politeness levels of the original inputs in terms of mean absolute error (MAE gen ), with smaller values corresponding to smaller gaps. We additionally evaluate the (pre-generation) quality of the planned strategy set (MAE plan ) to account for cases in which the plan is not perfectly realized.
To check the extent to which the generated paraphrases could be readily used, we assess how natural they sound to humans. We sample 100 instances from each set of the generated outputs and ask one non-author native English speaker to judge their naturalness on a scale of 1 to 5 (5 is very natural). The task is split among two annotators, and we obtain one annotation for each utterance. Each annotator was presented with an even distribution of retrieval-based, greedy-based and ILP-based generation outputs, and was not given any information on how the outputs are obtained. 16 To validate that the original content is not drastically altered, we report BLEU scores (Papineni et al., 2002) obtained by comparing the generation outputs with the original message (BLEU-s), Additionally, we provide a rough measure of how 'ambitious' the paraphrasing plan is by counting the number of new strategies that are ADDED.  hi, would you please reply to me at the article talk page? thanks. 0.97 good idea . sorry , would you please reply to me at the article talk page for you ? 0.01 Table 3: Example generation outputs (we highlight the original and newly introduced markers through which the strategies are realized). For reference, we also show the (estimated) gap between the sender's intention and the receiver's perception after transmission. More example outputs and error cases are shown in Tables A3 and A4 in the Appendix.
Results. Table 2 shows that our ILP-based method is capable of significantly reducing the potential gap in politeness perceptions between the sender and the receiver, in both experiments (t-test p < 0.001). The comparison with the baselines underlines the virtues of supporting fine-grained planning: the effectiveness of the eventual paraphrase is largely determined by the quality of the strategy plan. This can be seen by comparing across the MAE plan column which shows misalignments that would result if the plans were perfectly realized. Furthermore, when planning is done too coarsely (e.g., at a binary granularity for vanilla DRG), the resultant misalignment can be even worse than not intervening at all (for translated communication).
At the same time, the paraphrases remain mostly natural, with the average annotator ratings generally fall onto 'mostly natural' category for all generation models. The exact average ratings are 4.5, 4.2, and 4.2 for the retrieval-based, greedy-based, and ILP-based generation respectively. These generation outputs also largely preserve the content of the original message, as indicated by the relatively high BLEU-s scores. 17 Considering that the ILP-based method (justifiably) implements a more ambitious plan than the baselines (compare #-ADDED), it is expected to depart more from the original input; in spite of this, the difference in naturalness is small. Tables 3,  A3 and A4), we identify a few issues that are preventing the model to produce ideal paraphrases, opening avenues for future improvements: Available strategies. Between the two experimental conditions reported in Table 2, we notice that the performance (MAE gen ) is worse for the case of translated communication. A closer analysis reveals that this is mostly due to a particularly hardto-replace at-risk strategy, Swearing, which is one of the few available strategies that have strong negative politeness valence. The strategy set we operationalize is by no means exhaustive. Future work can consider a more comprehensive set of strategies, or even individualized collections, to allow more diverse expressions. Capability of the generation model. From a cursory inspection, we find that the generation model has learned to incorporate the planned strategies, either by realizing simple maneuvers such as appending markers at sentence boundaries, to the more complex actions such as inserting relevant markers in reasonable positions within the messages (both exemplified in Table 3). However, the generation model does not always fully execute the strategy plan, and can make inappropriate insertions, especially in the case of the more ambitious ILP solutions. We anticipate more advanced generation models may help further improve the quality and naturalness of the paraphrases. Alternatively, dynamically integrating the limitations of the generation model as explicit planning constraints might lead to solutions that are easier to realize.

Discussion
In this work, we motivate and formulate the task of circumstance-sensitive intention-preserving paraphrasing and develop a methodology that shows promise in helping people more accurately communicate politeness under different communication settings. The results and limitations of our method open up several natural directions for future work. Modeling politeness perceptions. We use a simple linear regression model to approximate how people internally interpret politeness and restrict our attention to only the set of local politeness strategies. Future work may consider more comprehensive modeling of how people form politeness perceptions or obtain more reliable causal estimates for strategy strength (Wang and Culotta, 2019). Task formulation. We make several simplifying assumptions in our task formulation. First, we focus exclusively on a gradable stylistic aspect that is mostly decoupled from the content (Kang and Hovy, 2019), reducing the complexity required from both the perception and the generation models. Future work may consider more complex stylistic aspects and strategies that are more tied to the content, such as switching from active to passive voice. Second, we consider binary channel constraints, but in reality, the channel behavior is often less clear-cut. Future work can aim to propose more general formulations that encapsulate more properties of the circumstance.
Forms of assistance. While we have focused on offering paraphrasing options as the form of assistance, it is not the only type of assistance possible. As our generation model may not (yet) match the quality of human rewrites, there can be a potential trade-off. While an entirely automatic assistance option may put the least cognitive load on the user, it may not produce the most natural and effective rewrite, which may be possible if humans are more involved. Hence, while we work towards providing fully automated suggestions, we might also want to utilize the language ability humans possess and consider assistance approaches in the form of interpretable (partial) suggestions.
Evaluation. In our experiments, we have relied exclusively on model predictions to estimate the level of misalignment in politeness perceptions. Given the fine-grained and individualized nature of the task, using humans to ascertain the politeness of the outputs would require an extensive and relatively complex annotation setup (e.g., collecting finegrained labels from annotators with known backgrounds for training and evaluating individualized perception models). Furthermore, to move towards more practical applications, we would also need to conduct communication-based evaluation (Newman et al., 2020) in addition to annotating individual utterances. Future work can consider adapting experiment designs from prior work (Gao et al., 2015;Hohenstein and Jung, 2018) to establish the impact of offering such intention-preserving paraphrases in real conversations, potentially by considering downstream outcomes.
Bridging the gaps in perceptions. While we focus on politeness strategies, they are not the only circumstance-sensitive linguistic signals that may be lost or altered during transmission, nor the only type that are subject to individual or culturalspecific perceptions. Other examples commonly observed in communication include, but are not limited to, formality (Rao and Tetreault, 2018) and emotional tones (Chhaya et al., 2018;Raji and de Melo, 2020). As we are provided with more opportunities to interact with people across cultural and language barriers, the risk of misunderstandings in communication also grows (Chang et al., 2020a). Thus, it is all the more important to develop tools to mitigate such risk and help foster mutual understandings.

B Prolific Annotators
For experiment B, we sample the top five most prolific annotations from the Wikipedia section of the Stanford Politeness Corpus, with the most prolific one having annotated 2,063 instances, and the least prolific among the five having 715 annotations.
When training individual perception models, we note that some less frequently used strategies tend to be under annotated at the individual level, and may thus create artificially high difference in coefficients. We thus use the coefficient from the average model for any strategy that is annotated for less than 15 times by the individual annotator.

C Additional Details on ILP
We consider a few linguistic constraints to help exclude some counter-intuitive strategy combinations. It should be noted that, with increased quality of a generation model, or by dynamically integrating the limitation of the generation model into the planning step, the process of inserting such additional constraints may be automated: Negativity constraint. While our simple linear model estimates the level of politeness by the aggregated effects of all strategies used regardless of their polarity, humans are known to have a negativity bias (Baumeister et al., 2001): while the presence of polite markers in an otherwise impolite utterance may soften the tone, the use of a negative marker in an otherwise polite utterance may be overshadowing. As a result, when an input is judged to be positive in politeness, we consider the additional constraint to exclude use of negative strategies, i.e., x s = 0, ∀s ∈ {s : b s < 0}. Subjunctive and Indicative constraint. Admittedly, among the set of markers we consider, some are more decoupled from contents than otherswhile removing just is almost guaranteed to keep the original meaning of the sentence intact, for an utterance that starts with either Subjunctive or Indicative, e.g., could you clarify?, simply removing could you would have already made its meaning ambiguous. 18 To account for this, we add the constraint that the use of Subjunctive and Indicative should be substituted within themselves, i.e., x Subjunctive +

D Details on Human Evaluations
To evaluate on the naturalness of the generated text, we ask two non-author native speaker for naturalness ratings on a scale of 1 (very unnatural) to 5 (very natural). The exact instruction is shown in Table A2.

E Additional Generation Examples
We show additional generation outputs in Table A3, and a categorization of failure cases in Table A4.

Input (upper) / Output (lower) Score
Please, Subjunctive, Gratitude could you then please make some contributions in some of your many areas of expertise? thanks.  Table A3: Additional examples from the generation outputs, together with strategy information (original strategy combination for inputs in italics, realized strategies underlined for outputs) and naturalness scores. We also highlight the original and newly introduced markers through which the strategies are realized. Refer to Table A4 for common types of failure cases.

Error type Input (upper) / Output (lower) Score
Grammatical the bot seems to be down again. could you give it a nudge? mistake the bot seems to be down again . maybe can you give it a nudge for me ? 3 i see you blocked could you provide your rationale? thanks -() i see you blocked provide your rationale ? ( please ) 2 Strategy hello, this image has no license info, could you please add it? thank you. misfit hello , this image has no license info , sorry . could you add it for you ? thank you .
3 can you please review this or block or have it reviewed at ani? thank you no worries . sorry , can you review this or block or have it reviewed for me at ani ? 3 Table A4: Examples demonstrating two representative error types with naturalness scores. Grammatical mistake represents cases when the markers are in inappropriate positions or introduce errors to the sentence structure. Strategy misfit represents cases when the use of suggested strategies (regardless of choice of markers to realize them) do not seem appropriate. Problematic portions of the outputs are in bold.