It’s Morphin’ Time! Combating Linguistic Discrimination with Inflectional Perturbations

Training on only perfect Standard English corpora predisposes pre-trained neural networks to discriminate against minorities from non-standard linguistic backgrounds (e.g., African American Vernacular English, Colloquial Singapore English, etc.). We perturb the inflectional morphology of words to craft plausible and semantically similar adversarial examples that expose these biases in popular NLP models, e.g., BERT and Transformer, and show that adversarially fine-tuning them for a single epoch significantly improves robustness without sacrificing performance on clean data.


Introduction
In recent years, Natural Language Processing (NLP) systems have gotten increasingly better at learning complex patterns in language by pretraining large language models like BERT, GPT-2, and CTRL (Devlin et al., 2019;Radford et al., 2019;Keskar et al., 2019), and fine-tuning them on taskspecific data to achieve state of the art results has become a norm. However, deep learning models are only as good as the data they are trained on.
Existing work on societal bias in NLP primarily focuses on attributes like race and gender (Bolukbasi et al., 2016;May et al., 2019). In contrast, we investigate a uniquely NLP attribute that has been largely ignored: linguistic background.
Current NLP models seem to be trained with the implicit assumption that everyone speaks fluent (often U.S.) Standard English, even though twothirds (>700 million) of the English speakers in the world speak it as a second language (L2) (Eberhard et al., 2019). Even among native speakers, a significant number speak a dialect like African American Vernacular English (AAVE) rather than Standard English (Crystal, 2003). In addition, these World Englishes exhibit variation at multiple levels of linguistic analysis (Kachru et al., 2009). Therefore, putting these models directly into production without addressing this inherent bias puts them at risk of committing linguistic discrimination by performing poorly for many speech communities (e.g., AAVE and L2 speakers). This could take the form of either failing to understand these speakers (Rickford and King, 2016;Tatman, 2017), or misinterpreting them. For example, the recent mistranslation of a minority speaker's social media post resulted in his wrongful arrest (Hern, 2017).
Since L2 (and many L1 dialect) speakers often exhibit variability in their production of inflectional morphology 2 (Lardiere, 1998;Prévost and White, 2000;Haznedar, 2002;White, 2003;Seymour, 2004), we argue that NLP models should be robust to inflectional perturbations in order to minimize their chances of propagating linguistic discrimination. Hence, in this paper, we: • Propose MORPHEUS, a method for generating plausible and semantically similar adversaries by perturbing the inflections in the clean examples ( Figure 1). In contrast to recent work on adversarial examples in NLP (Belinkov and Bisk, 2018;Ebrahimi et al., 2018;Ribeiro et al., 2018), we exploit morphology to craft our adversaries.
• Demonstrate its effectiveness on multiple machine comprehension and translation models, including BERT and Transformer (Tables 1 & 2).
• Show that adversarially fine-tuning the model on an adversarial training set generated via weighted random sampling is sufficient for it to acquire significant robustness, while preserving performance on clean examples (Table 5).
To the best of our knowledge, we are the first to investigate the robustness of NLP models to inflectional perturbations and its ethical implications.

Related Work
Fairness in NLP. It is crucial that NLP systems do not amplify and entrench social biases (Hovy and Spruit, 2016). Recent research on fairness has primarily focused on racial and gender biases within distributed word representations (Bolukbasi et al., 2016), coreference resolution (Rudinger et al., 2018), sentence encoders (May et al., 2019), and language models . However, we posit that there exists a significant potential for linguistic bias that has yet to be investigated, which is the motivation for our work.
Adversarial attacks in NLP. First discovered in computer vision by Szegedy et al. (2014), adversarial examples are data points crafted with the intent of causing a model to output a wrong prediction. In NLP, this could take place at the character, morphological, lexical, syntactic, or semantic level. Jia and Liang (2017) showed that question answering models could be misled into choosing a distractor sentence in the passage that was created by replacing key entities in the correct answer sentence. Belinkov and Bisk (2018) followed by demonstrating the brittleness of neural machine translation systems against character-level perturbations like randomly swapping/replacing characters. However, these attacks are not optimized on the target models, unlike Ebrahimi et al. (2018), which makes use of the target model's gradient to find the character change that maximizes the model's error.
Since these attacks tend to disrupt the sentence's semantics, Ribeiro et al. (2018) and Michel et al. (2019) propose searching for adversaries that preserve semantic content. Alzantot et al. (2018) and Jin et al. (2019) explore the use of synonym substitution to create adversarial examples, using word embeddings to find the n nearest words. Eger et al. (2019) take a different approach, arguing that adding visual noise to characters leaves their semantic content undisturbed.  propose to create paraphrase adversaries by conditioning their generation on a syntactic template, while Zhang et al. (2019b) swap key entities in the sentences. Zhang et al. (2019a) provide a comprehensive survey of this topic.
Adversarial training. In order to ensure our NLP systems are not left vulnerable to powerful attacks, most existing work make use of adversarial training to improve the model's robustness (Goodfellow et al., 2015). This involves augmenting the training data either by adding the adversaries to or replacing the clean examples in the training set.
Summary. Existing work in fairness mostly focus on tackling bias against protected attributes like race and gender, while those in adversarial NLP primarily investigate character-and word-level perturbations and seek to improve the models' robustness by retraining them from scratch on the adversarial training set. Our work makes use of perturbations in inflectional morphology to highlight the linguistic bias present in models such as BERT and Transformer, before showing that simply fine-tuning the models for one epoch on the adversarial training set is sufficient to achieve significant robustness while maintaining performance on clean data.

Generating Inflectional Perturbations
Inflectional perturbations inherently preserve the general semantics of a word since the root remains unchanged. In cases where a word's part of speech (POS) is context-dependent (e.g., duck as a verb or a noun), restricting perturbations to the original POS further preserves its original meaning.
Additionally, since second language speakers are prone to inflectional errors (Haznedar, 2002;White, 2003), adversarial examples that perturb the inflectional morphology of a sentence should be less perceivable to people who interact heavily with non-native speakers or are themselves non-native speakers. Hence, we present MORPHEUS, our proposed method for crafting inflectional adversaries.

MORPHEUS: A Greedy Approach
Problem formulation. Given a target model f and an original input example x for which the ground truth label is y, our goal is to generate the adversarial example x that maximizes f 's loss. Formally, we aim to solve the following problem: where x c is an adversarial example generated by perturbing x, f (x) is the model's prediction, and L(·) is the model's loss function. In this setting, f is a neural model for solving a specific NLP task.
Proposed solution. To solve this problem, we propose MORPHEUS (Algorithm 1), an approach that greedily searches for the inflectional form of each noun, verb, or adjective in x that maximally increases f 's loss (Eq. 1). For each token in x, MORPHEUS calls MAXINFLECTED to find the inflected form that caused the greatest increase in f 's loss. 3 Table 1 presents some adversarial examples obtained by running MORPHEUS on state-of-theart machine reading comprehension and translation models: namely, BERT (Devlin et al., 2019), Span-BERT (Joshi et al., 2019), and Transformer-big (Vaswani et al., 2017;Ott et al., 2018).
There are two possible approaches to implementing MAXINFLECTED: one is to modify each token independently from the others in parallel, and the other is to do it sequentially such that the increase in loss is accumulated as we iterate over the tokens. A major advantage of the parallel approach is that it is theoretically possible to speed it up by t times, where t is the number of tokens which are nouns, verbs, or adjectives. However, since current state-of-the-art models rely heavily on contextual representations, the sequential approach is likely to be more effective in finding combinations of inflectional perturbations that cause major increases in loss. We found this to be the case in our preliminary experiments (see Table 6 in Appendix D).
Assumptions. MORPHEUS treats the target model as a black box and maximally requires only access to the model's logits to compute the loss. As mentioned, task-specific metrics may be used in-stead of the loss as long as the surface is not overly "flat", like in a step function. Examples of inappropriate metrics are the exact match and F 1 scores for extractive question answering, which tend to be 1 for most candidates but drop drastically for specific ones. This may affect MORPHEUS' ability to find an adversary that induces absolute model failure.
While the black box assumption has the advantage of not requiring access to the target model's gradients and parameters, a limitation is that we need to query the model for each candidate inflection's impact on the loss, as opposed to Ebrahimi et al. (2018)'s approach. However, this is not an issue for inflectional perturbations since each word usually has less than 5 possible inflections.
Candidate generation. We make use of lemminflect 4 to generate candidate inflectional forms in the GETINFLECTIONS method, a simple process in which the token is first lemmatized before being inflected. In our implementation of GETINFLECTIONS, we also allow the user to specify if the candidates should be constrained to the same universal part of speech.
Semantic preservation. MORPHEUS constrains its search to inflections belonging to the same universal part of speech. For example, take the word "duck". Depending on the context, it may either be a verb or a noun. In the context of the sentence "There's a jumping duck", "duck" is a noun andMORPHEUS may only choose alternate inflections associated with nouns.
This has a higher probability of preserving the sentence's semantics compared to most other approaches, like character/word shuffling or synonym swapping, since the root word and its position in the sentence remains unchanged.
Early termination. MORPHEUS selects an inflection if it increases the loss. In order to avoid unnecessary searching, it terminates once it finds an adversarial example that induces model failure. In our case, we define this as a score of 0 on the task's evaluation metric (the higher, the better).
Other implementation details. In order to increase overall inflectional variation in the set of adversarial examples, GETINFLECTIONS shuffles the generated list of inflections before returning it (see Figure 4 in Appendix). Doing this has no 4 https://github.com/bjascob/LemmInflect effect on MORPHEUS' ability to induce misclassification, but prevents overfitting during adversarial fine-tuning, which we discuss later in Section 6. Additionally, since MORPHEUS greedily perturbs each eligible token in x, it may get stuck in a local maximum for some x values. To mitigate this, we run it again on the reversed version of x if the early termination criterion was not fulfilled during the forward pass.
4 Experiments NLP tasks. To evaluate the effectiveness of MORPHEUS at inducing model failure in NLP models, we test it on two popular NLP tasks: question answering (QA) and machine translation (MT). QA involves language understanding (classification), while MT also involves language generation. Both are widely used by consumers of diverse linguistic backgrounds and hence have a high chance of propagating discrimination.
Baseline. In the below experiments, we include a random baseline that randomly inflects each eligible word in each original example.
Measures. In addition to the raw scores, we also report the relative decrease for easier comparison across models since they perform differently on the clean dataset. Relative decrease (d r ) is calculated using the following formula:

Extractive Question Answering
Given a question and a passage containing spans corresponding to the correct answer, the model is expected to predict the span corresponding to the answer. Performance for this task is computed using exact match or average F 1 (Rajpurkar et al., 2016). We evaluate the effectiveness of our attack using average F 1 , which is more forgiving (for the target model). From our experiments, the exact match score is usually between 3-9 points lower than the average F 1 score.  based on Wikipedia articles. SQuAD 1.1 guarantees that the passages contain valid answers to the questions posed (Rajpurkar et al., 2016). SQuAD 2.0 increases the task's difficulty by including another 50,000 unanswerable questions, and models are expected to identify when a passage does not contain an answer for the given question (Rajpurkar et al., 2018). Since the test set is not public, we generate adversarial examples from and evaluate the models on the standard dev set. In addition, the answerable questions from SQuAD 2.0 are used in place of SQuAD 1.1 to evaluate models trained on SQuAD 1.1. This allows for easy comparison between the performance of the SQuAD 1.1-fine-tuned models and SQuAD 2.0-fine-tuned ones for answerable questions. We found performance on the answerable questions from SQuAD 2.0 to be comparable to SQuAD 1.1.
Models. We evaluate MORPHEUS on 's implementation of BiDAF (Seo et al., 2017), a common baseline model for SQuAD 1.1, ELMo-BiDAF , the transformers implementation (Wolf et al., 2019) of BERT, and SpanBERT, a pre-training method focusing on span prediction that outperforms BERT on multiple extractive QA datasets.

Results and Discussion
From Table 2, we see that models based on contextual embeddings (e.g., ELMo and BERT variants) tend to be more robust than those using fixed word embeddings (GloVe-BiDAF). This difference is likely due to the pre-training process, which gives them greater exposure to a wider variety of contexts in which different inflections occur. Removing the POS constraint further degrades the models' per-formance by another 10% of the original score, however, this difference is likely due to changes in the semantics and expected output of the examples.
BiDAF vs. BERT. Even after accounting for the performance difference on clean data, the BiDAF variants are significantly less robust to inflectional adversaries compared to the BERT variants. This is likely a result of BERT's greater representational power and masked language modeling pre-training procedure. Randomly masking out words during pre-training could have improved the models' robustness to small, local perturbations (like ours).
BERT vs. SpanBERT. In the context of question answering, SpanBERT appears to be slightly more robust than vanilla BERT when comparing overall performance on the two SQuAD datasets. However, the difference becomes significant if we look only at the SQuAD 2.0-fine-tuned models' performance on answerable questions (7% difference). This indicates that BERT has a stronger bias towards predicting "no answer" when it encounters inflectional perturbations compared to SpanBERT. SQuAD 1.1 vs. SQuAD 2.0. The ability to "know what you don't know" (Rajpurkar et al., 2018) appears to have been obtained at a great cost. The SQuAD 2.0-fine-tuned models are not only generally less robust to inflectional errors than their SQuAD 1.1 equivalents (6.5% difference), but also significantly less adept at handling answerable questions (12-18% difference). This discrepancy suggests a stronger bias in SQuAD 2.0 models towards predicting "no answer" upon receiving sentences containing inflectional errors (see Table 1).
As we alluded to earlier, this is particularly troubling: since SQuAD 2.0 presents a more realistic  scenario than SQuAD 1.1, it is fair to conclude that such models will inadvertently discriminate against L2 speakers if put into production as is.
Transferability. Next, we investigate the transferability of adversarial examples found by MOR-PHEUS across different QA models and present some notable results in Table 3. The adversarial examples found for GloVe-BiDAF transfer to a limited extent to other models trained on SQuAD 1.1, however, they have a much greater impact on BERT SQuAD 2 and SpanBERT SQuAD 2 (3-4x more). We observe a similar pattern for adversarial examples found for SpanBERT SQuAD 1.1 . Of the two, BERT is more brittle in general: the SpanBERT SQuAD 1.1 adversaries have a greater effect on BERT SQuAD 2 's performance on answerable questions than on SpanBERT SQuAD 2 's.
Discussion. One possible explanation for the SQuAD 2.0 models' increased fragility is the difference in the tasks they were trained for: SQuAD 1.1 models expect all questions to be answerable and only need to contend with finding the right span, while SQuAD 2.0 models have the added burden of predicting whether a question is answerable.
Therefore, in SQuAD 1.1 models, the feature space corresponding to a possible answer ends where the space corresponding to another possible answer begins, and there is room to accommodate slight variations in the input (i.e., larger individual spaces). We believe that in SQuAD 2.0 models, the need to accommodate the unanswerable prediction forces the spaces corresponding to the possible answers to shrink, with unanswerable spaces potentially filling the gaps between them. For SQuAD 2.0 models, this increases the probability of an adversarial example "landing" in the space corresponding to the unanswerable prediction. This would explain the effectiveness of adversarial fine-tuning in Section 6, which intuitively creates a "buffer" zone and expands the decision boundaries around each clean example.
The diminished effectiveness of the transferred adversaries at inducing model failure is likely due to each model learning slightly different segmentations of the answer space. As a result, different small, local perturbations have different effects on each model. We leave the in-depth investigation of the above phenomena to future work.

Machine Translation
We now demonstrate MORPHEUS' ability to craft adversaries for NMT models as well, this time without access to the models' logits. The WMT'14 English-French test set (newstest2014), containing 3,003 sentence pairs, is used for both evaluation and generating adversarial examples. We evaluate our attack on the fairseq implementation of both the Convolutional Seq2Seq (Gehring et al., 2017) and Transformer-big models, and report the BLEU score (Papineni et al., 2002) using fairseq's implementation (Ott et al., 2019).
From our experiments (Table 2), ConvS2S and Transformer-big appear to be extremely brittle even to inflectional perturbations constrained to the same part of speech (56-57% decrease). In addition, some adversarial examples caused the models to regenerate the input verbatim instead of a translation: 1.4% of the test set for Transformer-big, 3% for ConvS2S (see Table 9 in the Appendix for some examples). This is likely due to the joint source/target byte-pair encoding (Sennrich et al., 2016) used by both NMT systems to tackle rare word translation.
We experimented with both BLEU and chrF (Popović, 2015) as our optimizing criterion 6 and achieved comparable results for both, however, MORPHEUS found more adversarial examples that caused the model to output random sentences about Nicolas Sarkozy when optimizing for chrF.

Human Evaluation
To test our hypothesis that inflectional perturbations are likely to be relatively natural and semantics preserving, we randomly sample 130 adversar-  To ensure the quality of our results, only Turkers who completed >10,000 HITs with a ≥99% acceptance rate could access our task. For comparison, we also report ratings by native U.S. English speakers, who were selected via a demographic survey and fluency test adapted from Hartshorne et al. (2018). Workers were paid a rate of at least $12/hr. 9 Table 4 shows that Turkers from our unrestricted sample judged ∼95% of our adversaries to be plausibly written by a human and 92% generally likely to be semantically equivalent to the original examples 92% of the time, hence validating our hypothesis. Qualitative analysis revealed that "is/are"→"am/been" changes accounted for 48% of the implausible adversaries.
Discussion. We believe that non-native speakers may tend to rate sentences as more human-like for the following reasons: • Their exposure to another language as a native speaker leads them to accept sentences that mimic errors made by L2 English speakers who share their first language.
• Their exposure to the existence of these abovementioned errors may lead them to be more forgiving of other inflectional errors that are uncommon to them; they may deem these errors as 7 Only adversarial examples that degraded the F1 score by > 50 and the BLEU score by > 15 were considered. 8 We define a beginner as one who has just started learning the language, and an L2 speaker to be an experienced speaker. 9 Each task was estimated to take 20-25s to be comfortably completed, but they were routinely completed in under 20s. plausibly made by an L2 speaker who speaks a different first language from them.
• They do not presume mastery of English, and hence may choose to give the higher score when deciding between 2 choices.

Adversarial Fine-tuning
In this section, we extend the standard adversarial training paradigm (Goodfellow et al., 2015) to make the models robust to inflectional perturbations. Since directly running MORPHEUS on the entire training dataset to generate adversaries would be far too time-consuming, we use the findings from our experiments on the respective dev/test sets (Section 4) to create representative samples of good adversaries. This significantly improves robustness to inflectional perturbations while maintaining similar performance on the clean data. We first present an analysis of the inflectional distributions before elaborating on our method for generating the adversarial training set.   Table 2 reproduced here for ease of comparison. Figure 2a illustrates the overall distributional differences in inflection occurrence between the original and adversarial examples found by MORPHEUS for SQuAD 2.0. Note that these distributions are computed based on the Penn Treebank (PTB) POS tags, which are finer-grained than the universal POS (UPOS) tags used to constrain MORPHEUS' search (Section 4). For example, a UPOS VERB may be actually be a PTB VBD, VBZ, VBG, etc.

Distributional Analysis
We can see obvious differences between the global inflectional distributions of the original datasets and the adversaries found by MORPHEUS. The differences are particularly significant for the NN, NNS, and VBG categories. NNS and VBG also happen to be uncommon in the original distribution. Therefore, we conjecture that the models failed (Section 4) because MORPHEUS is able to find the contexts in the training data where these inflections are uncommon.

Adversarial Training Set Generation
Since there is an obvious distributional difference between the original and adversarial examples, we hypothesize that bringing the training set's inflectional distribution closer to that of the adversarial examples will improve the models' robustness.
To create the adversarial training set, we first isolate all the adversarial examples (from the dev/test set) that caused any decrease in F 1 /BLEU score and count the number of times each inflection is used in this adversarial dataset, giving us the inflectional distribution in Figure 2a.
Next, we randomly select an inflection for each eligible token in each training example, weighting the selection with this inflectional distribution instead of a uniform one. To avoid introducing unnecessary noise into our training data, only inflections from the same UPOS as the original word are chosen. We do this 4 times per training example, resulting in an adversarial training set with a clean-adversarial ratio of 1 : 4. This can be done in linear time and is highly scalable. Algorithm 2 in Appendix C details our approach and Figure 2b depicts the training set's inflectional distribution before and after this procedure.
Fine-tuning vs. retraining. Existing adversarial training approaches have shown that retraining the model on the augmented training set improves robustness (Belinkov and Bisk, 2018;Eger et al., 2019;Jin et al., 2019). However, this requires substantial compute resources. We show that finetuning the pre-trained model for just a single epoch is sufficient to achieve significant robustness to inflectional perturbations yet still maintain good performance on the clean evaluation set (Table 5).

Experiments
SpanBERT. Following Joshi et al. (2019), we fine-tune SpanBERT SQuAD 2 for another 4 epochs on our adversarial training set. Table 5 shows the effectiveness of our approach for SpanBERT SQuAD 2 .
After just a single epoch of fine-tuning, SpanBERT SQuAD 2 becomes robust to most of the initial adversarial examples with a < 2-point drop in performance on the clean dev set. More importantly, running MORPHEUS on the robust model fails to significantly degrade its performance.
After 4 epochs, the performance on the clean SQuAD 2.0 dev set is almost equivalent to the original SpanBERT SQuAD 2 's, however this comes at a slight cost: the performance on the answerable questions is slightly lower than before. In fact, if performance on answerable questions is paramount, our results show that fine-tuning on the adversarial training set for 1 epoch would be a better (and more cost effective) decision. Retraining SpanBERT adversarially did not result in better performance.
We also found that weighting the random sampling with the adversarial distribution helped to improve the robust model's performance on the answerable questions (refer to Table 7 in Appendix).
Transformer-big. Similarly, model robustness improves dramatically (56.25% to 20.20% decrease) after fine-tuning for 1 epoch on the adversarial training set with a ∼3 BLEU point drop in clean data performance (Table 5). Fine-tuning for a further 3 epochs reduced the difference but made the model less robust to new adversarial examples.
We also experimented with using randomly sampled subsets but found that utilizing the entire original training set was necessary for preserving performance on the clean data (see Table 8 in Appendix).

Discussion
Our anonymous reviewers brought up the possibility of using grammatical error correction (GEC) systems as a defense against inflectional adversaries. Although we agree that adding a GEC model before the actual NLU/translation model would likely help, this would not only require an extra model-often another Transformer (Bryant et al., 2019)-and its training data to be maintained, but would also double the resource usage of the combined system at inference time.
Consequently, institutions with limited resources may choose to sacrifice the experience of minority users rather than incur the extra maintenance costs. Adversarial fine-tuning only requires the NLU/translation model to be fine-tuned once and consumes no extra resources at inference time.

Limitations and Future Work
Although we have established our methods' effectiveness at both inducing model failure and robustifying said models, we believe they could be further improved by addressing the following limitations: 1. MORPHEUS finds the distribution of examples that are adversarial for the target model, rather than that of real L2 speaker errors, which produced some unrealistic adversarial examples.
2. Our method of adversarial fine-tuning is analogous to curing the symptom rather than addressing the root cause since it would have to be performed for each domain-specific dataset the model is trained on.
In future work, we intend to address these limitations by directly modeling the L2 and dialectal distributions and investigating the possibility of robustifying these models further upstream.

Conclusion
Ensuring that NLP technologies are inclusive, in the sense of working for users with diverse linguistic backgrounds (e.g., speakers of World Englishes such as AAVE, as well as L2 speakers), is especially important since natural language user interfaces are becoming increasingly ubiquitous. We take a step in this direction by revealing the existence of linguistic bias in current English NLP models-e.g., BERT and Transformer-through the use of inflectional adversaries, before using adversarial fine-tuning to significantly reduce it. To find these adversarial examples, we propose MOR-PHEUS, which crafts plausible and semantically similar adversaries by perturbing an example's inflectional morphology in a constrained fashion, without needing access to the model's gradients. Next, we demonstrate the adversaries' effectiveness using QA and MT, two tasks with direct and wide-ranging applications, before validating their plausibility and semantic content with real humans.
Finally, we show that, instead of retraining the model, fine-tuning it on a representative adversarial training set for a single epoch is sufficient to achieve significant robustness to inflectional adversaries while preserving performance on the clean dataset. We also present a method of generating this adversarial training set in linear time by making use of the adversarial examples' inflectional distribution to perform weighted random sampling.

A Examples of Inflectional Variation in English Dialects
African American Vernacular English (Wolfram, 2004) • They seen it.
• They run there yesterday.
• The folks was there.
• It cover up everything in the floss. It's not nice. It look very cheap.
• I want to shopping only.  Figure 3 contains a screenshot of the UI we present to crowd workers. We intentionally prime Turkers by asking if the sentence could be written by an L2 speaker instead of directly asking for acceptability/naturalness ratings in order to ensure that they consider these possibilities.

B More Details on Human Evaluation
We also do not use the Semantic Textual Similarity evaluation scheme (Agirre et al., 2013); during preliminary pilot studies, we discovered that annotators interpreted certain words in the scheme (e.g., "information", "details", and "topics") considerably differently, introducing substantial noise into an already subjective judgement task.
Possible limitations. It is possible that seeing the original sentence could affect the worker's judgment of the perturbed sentence's plausibility. However, we argue that this is not necessarily negative since seeing the original sentence would make it easier to spot perturbations that are just outright wrong (i.e., a human will not make that error regardless of their level of fluency).     Figure 4: Effect of shuffling the inflection list on the adversarial distribution. We observe that shuffling the inflection list induces a more uniform inflectional distribution by reducing the higher frequency inflections and boosting the lower frequency ones.

C Adversarial Training Set Generation
Original Source According to Detroit News, the queen of Soul will be performing at the Sound Board hall of MotorCity Casino Hotel on 21 December. Adversarial Source Accorded to Detroit News, the queen of Soul will be performing at the Sound Board hall of MotorCity Casino Hotel on 21 December. Original Translation Selon Detroit News, la reine de Soul se produira au Sound Board Hall de l'hôtel MotorCity Casino le 21 décembre.
Original Source Intersex children pose ethical dilemma. Adversarial Source Intersex child posing ethical dilemma. Original Translation Les enfants intersexuels posent un dilemmeéthique.

Original Source
The Guangzhou-based New Express made a rare public plea for the release of journalist Chen Yongzhou. Adversarial Source The Guangzhou-based New Expresses making a rare public plea for the release of journalist Chen Yongzhou. Original Translation Le New Express, baséà Guangzhou, a lancé un rare appel public en faveur de la libération du journaliste Chen Yongzhou.
Original Source Cue stories about passport controls at Berwick and a barbed wire border along Hadrian's Wall.

Adversarial Source
Cue story about passport controls at Berwick and a barbed wires borders along Hadrian's Walls. Original Translation Cue histoires sur le contrôle des passeportsà Berwick et une frontière de barbelés le long du mur d'Hadrien. Table 9: Some of the adversaries that caused Transformer-big to output the source sentence instead of a translation.