Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack

The detection of offensive language in the context of a dialogue has become an increasingly important application of natural language processing. The detection of trolls in public forums (Galán-García et al., 2016), and the deployment of chatbots in the public domain (Wolf et al., 2017) are two examples that show the necessity of guarding against adversarially offensive behavior on the part of humans. In this work, we develop a training scheme for a model to become robust to such human attacks by an iterative build it, break it, fix it scheme with humans and models in the loop. In detailed experiments we show this approach is considerably more robust than previous systems. Further, we show that offensive language used within a conversation critically depends on the dialogue context, and cannot be viewed as a single sentence offensive detection task as in most previous work. Our newly collected tasks and methods are all made open source and publicly available.


Introduction
The detection of offensive language has become an important topic as the online community has grown, as so too have the number of bad actors (Cheng et al., 2017). Such behavior includes, but is not limited to, trolling in public discussion forums (Herring et al., 2002) and via social media (Silva et al., 2016;Davidson et al., 2017), employing hate speech that expresses prejudice against a particular group, or offensive language specifically targeting an individual. Such actions can be motivated to cause harm from which the bad actor derives enjoyment, despite negative consequences to others (Bishop, 2014). As such, some bad actors go to great lengths to both avoid detection and to achieve their goals (Shachaf and Hara, 2010). In that context, any attempt to automatically detect this behavior can be expected to be adversarially attacked by looking for weaknesses in the detection system, which currently can easily be exploited as shown in (Hosseini et al., 2017;Gröndahl et al., 2018). A further example, relevant to the natural langauge processing community, is the exploitation of weaknesses in machine learning models that generate text, to force them to emit offensive language. Adversarial attacks on the Tay chatbot led to the developers shutting down the system (Wolf et al., 2017).
In this work, we study the detection of offensive language in dialogue with models that are robust to adversarial attack. We develop an automatic approach to the "Build it Break it Fix it" strategy originally adopted for writing secure programs (Ruef et al., 2016), and the "Build it Break it" approach consequently adapting it for NLP (Ettinger et al., 2017). In the latter work, two teams of researchers, "builders" and "breakers" were used to first create sentiment and semantic role-labeling systems and then construct examples that find their faults. In this work we instead fully automate such an approach using crowdworkers as the humansin-the-loop, and also apply a fixing stage where models are retrained to improve them. Finally, we repeat the whole build, break, and fix sequence over a number of iterations.
We show that such an approach provides more and more robust systems over the fixing iterations. Analysis of the type of data collected in the iterations of the break it phase shows clear distribution changes, moving away from simple use of profanity and other obvious offensive words to utterances that require understanding of world knowledge, figurative language, and use of negation to detect if they are offensive or not. Further, data collected in the context of a dialogue rather than a sentence without context provides more sophisticated attacks. We show that model architectures that use the dialogue context efficiently perform much bet-ter than systems that do not, where the latter has been the main focus of existing research (Wulczyn et al., 2017;Davidson et al., 2017;Zampieri et al., 2019).
Code for our entire build it, break it, fix it algorithm is made open source, complete with model training code and crowdsourcing interface for humans. Our data and trained models will also be made available for the community via ParlAI 1 .

Related Work
The task of detecting offensive language has been studied across a variety of content classes. Perhaps the most commonly studied class is hate speech, but work has also covered bullying, aggression, and toxic comments (Zampieri et al., 2019).
To this end, various datasets have been created to benchmark progress in the field. In hate speech detection, recently Davidson et al. (2017) compiled and released a dataset of over 24,000 tweets labeled as containing hate speech, offensive language, or neither. In order to benchmark toxic comment detection, The Wikipedia Toxic Comments dataset (which we study in this work) was collected and extracted from Wikipedia Talk pages and featured in a Kaggle competition (Wulczyn et al., 2017;Google, 2018). Each of these benchmarks examine only single-turn utterances, outside of the context in which the language appeared. In this work we recommend that future systems should move beyond classification of singular utterances and use contextual information to help identify offensive language.
Many approaches have been taken to solve these tasks -from linear regression and SVMs to deep learning (Noever, 2018). The best performing systems in each of the competitions mentioned above (for aggression and toxic comment classification) used deep learning approaches such as LSTMs and CNNs (Kumar et al., 2018;Google, 2018). In this work we consider a large-pretrained transformer model which has been shown to perform well on many downstream NLP tasks (Devlin et al., 2018).
The broad class of adversarial training is currently a hot topic in machine learning (Goodfellow et al., 2014). Use cases include training image generators (Brock et al., 2018) as well as image classifiers to be robust to adversarial examples (Liu et al., 2019). These methods find the break-ing examples algorithmically, rather than by using humans breakers as we do. Applying the same approaches to NLP tends to be more challenging because, unlike for images, even small changes to a sentence can cause a large change in the meaning of that sentence, which a human can detect but a lower quality model cannot. Nevertheless algorithmic approaches have been attempted, for example in text classification (Ebrahimi et al., 2018), machine translation (Belinkov and Bisk, 2018), dialogue generation tasks (Li et al., 2017) and reading comprehension (Jia and Liang, 2017). The latter was particularly effective at proposing a more difficult version of the popular SQuAD dataset.
As mentioned in the introduction, our approach takes inspiration from "Build it Break it" approaches which have been successfully tried in other domains (Ruef et al., 2016;Ettinger et al., 2017). Those approaches advocate finding faults in systems by having humans look for insecurities (in software) or prediction failures (in models), but do not advocate an automated approach as we do here. Our work is also closely connected to the "Mechanical Turker Descent" algorithm detailed in (Yang et al., 2018) where language to action pairs were collected from crowdworkers by incentivizing them with a game-with-a-purpose technique: a crowdworker receives a bonus if their contribution results in better models than another crowdworker. We did not gamify our approach in this way, but still our approach has commonalities in the round-based improvement of models through crowdworker interaction.

Baselines: Wikipedia Toxic Comments
In this section we describe the publicly available data that we have used to bootstrap our build it break it fix it approach. We also compare our model choices with existing work and clarify the metrics chosen to report our results.

Wikipedia Toxic Comments The Wikipedia
Toxic Comments dataset (WTC) has been collected in a common effort from the Wikimedia Foundation and Jigsaw (Wulczyn et al., 2017) to identify personal attacks online. The data has been extracted from the Wikipedia Talk pages, discussion pages where editors can discuss improvements to articles or other Wikipedia pages. We considered the version of the dataset that corresponds to the Kaggle competition: "Toxic Comment Classification Challenge" (Google, 2018) which features 7 classes of toxicity: toxic, severe toxic, obscene, threat, insult, identity hate and non-toxic. In the same way as in (Khatri et al., 2018), every label except non-toxic is grouped into a class OFFENSIVE while the non-toxic class is kept as the SAFE class. In order to compare our results to (Khatri et al., 2018), we similarly split this dataset to dedicate 10% as a test set. 80% are dedicated to train set while the remaining 10% is used for validation. Statistics on the dataset are shown in Table 1.
Models We establish baselines using two models. The first one is a binary classifier built on top of a large pre-trained transformer model. We use the same architecture as in BERT (Devlin et al., 2018). We add a linear layer to the output of the first token ([CLS]) to produce a final binary classification. We initialize the model using the weights provided by (Devlin et al., 2018) corresponding to "BERT-base". The transformer is composed of 12 layers with hidden size of 768 and 12 attention heads. We fine-tune the whole network on the classification task. We also compare it the fastText classifier (Joulin et al., 2017) for which a given sentence is encoded as the average of individual word vectors that are pre-trained on a large corpus issued from Wikipedia. A linear layer is then applied on top to yield a binary classification.
Experiments We compare the two aforementioned models with (Khatri et al., 2018) who conducted their experiments with a BiLSTM with GloVe pre-trained word vectors (Pennington et al., 2014). Results are listed in Table 2 and we compare them using the weighted-F1, i.e. the sum of F1 score of each class weighted by their frequency in the dataset. We also report the F1 of the OFFENSIVE-class which is the metric we favor within this work, although we report both. (Note that throughout the paper, the notation F1 is always referring to OFFENSIVE-class F1.) Indeed, in the case of an imbalanced dataset such as Wikipedia Toxic Comments where most samples are SAFE, the weighted-F1 is closer to the F1 score of the SAFE class while we focus on detecting OFFENSIVE content. Our BERT-based model outperforms the method from Khatri et al. (2018); throughout the rest of the paper, we use the BERTbased architecture in our experiments. In particular, we used this baseline trained on WTC to bootstrap our approach, to be described subsequently.    Figure 1: The build it, break it, fix it algorithm we use to iteratively train better models A 0 , . . . , A N . In experiments we perform N = 3 iterations of the break it, fix it loop for the single-turn utterance detection task, and a further iteration for the multi-turn task in a dialogue context setting.
conversation, without imposing our own definitions of what that means. The phrase "with someone you just met online" was meant to mimic the setting of a public forum.
Crowderworker Task We ask crowdworkers to try to "beat the system" by submitting messages that our system marks as SAFE but that the worker considers to be OFFENSIVE. For a given round, workers earn a "game" point each time they are able to "beat the system," or in other words, trick the model by submitting OFFENSIVE messages that the model marks as SAFE. Workers earn up to 5 points each round, and have two tries for each point: we allow multiple attempts per point so that workers can get feedback from the models and better understand their weaknesses. The points serve to indicate success to the crowdworker and motivate to achieve high scores, but have no other meaning (e.g. no monetary value as in (Yang et al., 2018)). More details regarding the user interface and instructions can be found in Appendix B.
Models to Break During round 1, workers try to break the baseline model A 0 , trained on Wikipedia Toxic Comments. For rounds i, i > 1, workers must break both the baseline model and the model from the previous "fix it" round, which we refer to as A i 1 . In that case, the worker must submit messages that both A 0 and A i 1 mark as SAFE but which the worker considers to be OFFENSIVE.

Fix it Details
During the "fix it" round, we update the models with the newly collected adversarial data from the "break it" round. The training data consists of all previous rounds of data, so that model A i is trained on all rounds n for n  i, as well as the Wikipedia Toxic Comments data. We split each round of data into train, validation, and test partitions. The validation set is used for hyperparameter selection.
The test sets are used to measure how robust we are to new adversarial attacks. With increasing round i, A i should become more robust to increasingly complex human adversarial attacks.

Single-Turn Task
We first consider a single-turn set-up, i.e. detection of offensive language in one utterance, with no dialogue context or conversational history.

Data Collection
Adversarial Collection We collected three rounds of data with the build it, break it, fix it algorithm described in the previous section. Each round of data consisted of 1000 examples, leading to 3000 single-turn adversarial examples in total. For the remainder of the paper, we refer to this method of data collection as the adversarial method.
Standard Collection In addition to the adversarial method, we also collected data in a nonadversarial manner in order to directly compare the two set-ups. In this method -which we refer to as the standard method, we simply ask crowdworkers to submit messages that they consider to be OFFENSIVE. There is no model to break. Instructions are otherwise the same. In this set-up, there is no real notion of "rounds", but for the sake of comparison we refer to each subsequent 1000 examples collected in this manner as a "round". We collect 3000 examples -or three rounds of data. We refer to a model trained on rounds n  i of the standard data as S i .

Task Formulation Details
Since all of the collected examples are labeled as OFFENSIVE, to make this task a binary classification problem, we will also add SAFE examples to it. The "safe data" is comprised of utterances from the ConvAI2 chit-chat task (Dinan et al., 2019; which consists of pairs of humans getting to know each other by discussing their interests. Each utterance we used was reviewed by two independent crowdworkers and labeled as SAFE, with the same characterization of SAFE as described before. For each partition (train, validation, test), the final task has a ratio of 9:1 SAFE to OFFENSIVE examples, mimicking the division of the Wikipedia Toxic Comments dataset used for training our baseline models. Dataset statistics for the final task can be found in Table 5. We refer to these tasks -with both SAFE and OFFENSIVE examples -as the adversarial and standard tasks.

Model Training Details
Using the BERT-based model architecture described in Section 3, we trained models on each round of the standard and adversarial tasks, multi-tasking with the Wikipedia Toxic Comments task. We weight the multi-tasking with a mixing parameter which is also tuned on the validation set. Finally, after training weights with the cross entropy loss, we adjust the final bias also using the validation set. We optimize for the sensitive class (i.e. OFFENSIVE-class) F1 metric on the standard and adversarial validation sets respectively. For each task (standard and adversarial), on round i, we train on data from all rounds n for n  i and optimize for performance on the validation sets n  i.

Experimental Results
We conduct experiments comparing the adversarial and standard methods. We break down the results into "break it" results comparing the data col-% with % with avg. # avg. # profanity "not" chars tokens Std. (Rnds 1-3 Table 4: Percent of OFFENSIVE examples in each task containing profanity, the token "not", as well as the average number of characters and tokens in each example. Rows 1-4 are the single-turn task, and the last row is the multi-turn task. Later rounds have less profanity and more use of negation as human breakers have to find more sophisticated language to adversarially attack our models.  Table 5: Dataset statistics for the single-turn rounds of the adversarial task data collection. There are three rounds in total all of identical size, hence the numbers above can be divided for individual statistics. The standard task is an additional dataset of exactly the same size as above.
lected and "fix it" results comparing the models obtained.

Break it Phase
Examples obtained from both the adversarial and standard collection methods were found to be clearly offensive, but we note several differences in the distribution of examples from each task, shown in Table 4. First, examples from the standard task tend to contain more profanity. Using a list of common English obscenities and otherwise bad words 2 , in  Table 6: Test performance of best standard models trained on standard task rounds (models S i for each round i) and best adversarial models trained on adversarial task rounds (models A i ). All models are evaluated using OFFENSIVE-class F1 on each round of both the standard task and adversarial task. A 0 is the baseline model trained on the existing Wiki Toxic Comments (WTC) dataset. Adversarial models prove to be more robust than standard ones against attack (Adversarial Task 1-3), while still performing reasonably on Standard and WTC tasks.
at least seven times as many as each round of the adversarial task. Additionally, in previous works, authors have observed that classifiers struggle with negations (Hosseini et al., 2017). This is borne out by our data: examples from the single-turn adversarial task more often contain the token "not" than examples from the standard task, indicating that users are easily able to fool the classifier with negations.
We also anecdotally see figurative language such as "snakes hiding in the grass" in the adversarial data, which contain no individually offensive words, the offensive nature is captured by reading the entire sentence. Other examples require sophisticated world knowledge such as that many cultures consider eating cats to be offensive. To quantify these differences, we performed a blind human annotation of a sample of the data, 100 examples of standard and 100 examples of adversarial round 1. Results are shown in Table 3. Adversarial data was indeed found to contain less profanity, fewer non-profane but offending words (such as "idiot"), more figurative language, and to require more world knowledge.
We note that, as anticipated, the task becomes more challenging for the crowdworkers with each round, indicated by the decreasing average scores in Table 7. In round 1, workers are able to get past A 0 most of the time -earning an average score of 4.56 out of 5 points per round -showcasing how susceptible this baseline is to adversarial attack despite its relatively strong performance on the Wikipedia Toxic Comments task. By round 3, however, workers struggle to trick the system, earning an average score of only 1.6 out of 5. A finer-grained assessment of the worker scores can Single-Turn Multi Round 1 2 3 ("4") Avg. score (0-5) 4.56 2.56 1.6 2.89 Table 7: Adversarial data collection worker scores.
Workers received a score out of 5 indicating how often (out of 5 rounds) they were able to get past our classifiers within two tries. In later single-turn rounds it is harder to defeat our models, but switching to multi-turn makes this easier again as new attacks can be found by using the dialogue context. be found in Table 11 in the appendix.

Fix it Phase
Results comparing the performance of models trained on the adversarial (A i ) and standard (S i ) tasks are summarized in Table 6, with further results in Table 13 in Appendix A.2. The adversarially trained models A i prove to be more robust to adversarial attack: on each round of adversarial testing they outperform standard models S i . Further, note that the adversarial task becomes harder with each subsequent round. In particular, the performance of the standard models S i rapidly deteriorates between round 1 and round 2 of the adversarial task. This is a clear indication that models need to train on adversariallycollected data to be robust to adversarial behavior.
Standard models (S i ), trained on the standard data, tend to perform similarly to the adversarial models (A i ) as measured on the standard test sets, with the exception of training round 3, in which A 3 fails to improve on this task, likely due to being too optimized for adversarial tasks. The standard models S i , on the other hand, are improving with subsequent rounds as they have more training data of the same distribution as the evaluation set. Similarly, our baseline model performs best on its own test set, but other models are not far behind.
Finally, we remark that all scores of 0 in Table  6 are by design, as for round i of the adversarial task, both A 0 and A i 1 classified each example as SAFE during the 'break it' data collection phase.

Multi-Turn Task
In most real-world applications, we find that adversarial behavior occurs in context -whether it is in the context of a one-on-one conversation, a comment thread, or even an image. In this work we focus on offensive utterances within the context of two-person dialogues. For dialogue safety we posit it is important to move beyond classifying single utterances, as it may be the case that an utterance is entirely innocuous on its own but extremely offensive in the context of the previous dialogue history. For instance, "Yes, you should definitely do it!" is a rather inoffensive message by itself, but most would agree that it is a hurtful response to the question "Should I hurt myself?"

Task Implementation
To this end, we collect data by asking crowdworkers to try to "beat" our best single-turn classifier (using the model that performed best on rounds 1-3 of the adversarial task, i.e., A 3 ), in addition to our baseline classifier A 0 . The workers are shown truncated pieces of a conversation from the ConvAI2 chit-chat task, and asked to continue the conversation with OFFENSIVE responses that our classifier marks as SAFE. The resulting conversations -including the newly provided OFFENSIVE responses -are between 3 and 6 turns long. As before, workers have two attempts per conversation to try to get past the classifier and are shown five conversations per round. They are given a score (out of five) at the end of each round indicating the number of times they successfully fooled the classifier.
We collected 3000 offensive examples in this manner. As in the single-turn set up, we combine this data with SAFE examples with a ratio of 9:1 SAFE to OFFENSIVE for classifier training.
The safe examples are dialogue examples from ConvAI2 for which the responses were reviewed by two independent crowdworkers and labeled as SAFE, as in the s single-turn task set-up. We refer to this overall task as the multi-turn adversarial task. Dataset statistics are given in Table 9.

Models
To measure the impact of the context, we train models on this dataset with and without the given context. We use the fastText and the BERT-based model described in Section 3. In addition, we build a BERT-based model variant that splits the last utterance (to be classified) and the rest of the history into two dialogue segments. Each segment is assigned an embedding and the input provided to the transformer is the sum of word embedding and segment embedding, replicating the setup of the Next Sentence Prediction that is used in the training of BERT (Devlin et al., 2018).

Break it Phase
During data collection, we observed that workers had an easier time bypassing the classifiers than in the single-turn set-up. See Table 7. In the singleturn set-up, the task at hand gets harder with each round -the average score of the crowdworkers decreases from 4.56 in round 1 to 1.6 in round 3. Despite the fact that we are using our best single-turn classifier in the multi-turn set-up (A 3 ), the task becomes easier: the average score per round is 2.89. This is because the workers are often able to use contextual information to suggest something offensive rather than say something offensive outright. See examples of submitted messages in Table 8. Having context also allows one to express something offensive more efficiently: the messages supplied by workers in the multi-turn setting were significantly shorter on average, see Table 4.

Fix it Phase
During training, we multi-tasked the multi-turn adversarial task with the Wikipedia Toxic Comments task as well as the single-turn adversarial and standard tasks. We average the results of our best models from five different training runs. The results of these experiments are given in Table 10.
As we observed during the training of our baselines in Section 3, the fastText model architecture is ill-equipped for this task relative to our BERT-based architectures. The fastText model performs worse given the dialogue context (an average of 23.56 OFFENSIVE-class F1 relative to 37.1) than without, likely because its bag-of-    Table 10: Results of experiments on the multi-turn adversarial task. We denote the average and one standard deviation from the results of five runs. Models that use the context as input ("with context") perform better. Encoding this in the architecture as well (via BERT dialogue segment features) gives us the best results. embeddings representation is too simple to take the context into account. We see the opposite with our BERT-based models, indicating that more complex models are able to effectively use the contextual information to detect whether the response is SAFE or OFFEN-SIVE. With the simple BERT-based architecture (that does not split the context and the utterance into separate segments), we observe an average of a 3.7 point increase in OFFENSIVE-class F1 with the addition of context. When we use segments to separate the context from the utterance we are trying to classify, we observe an average of a 7.4 point increase in OFFENSIVE-class F1. Thus, it appears that the use of contextual information to identify OFFENSIVE language is critical to making these systems robust, and improving the model architecture to take account of this has large impact.

Conclusion
We have presented an approach to build more robust offensive language detection systems in the context of a dialogue. We proposed a build it, break it, fix it, and then repeat strategy, whereby humans attempt to break the models we built, and we use the broken examples to fix the models. We show this results in far more nuanced language than in existing datasets. The adversarial data includes less profanity, which existing classifiers can pick up on, and is instead offensive due to figurative language, negation, and by requiring more world knowledge, which all make current classifiers fail. Similarly, offensive language in the context of a dialogue is also more nuanced than standalone offensive utterances. We show that classifiers that learn from these more complex examples are indeed more robust to attack, and that using the dialogue context gives improved performance if the model architecture takes it into account.
In this work we considered a binary problem (offensive or safe). Future work could consider classes of offensive language separately (Zampieri et al., 2019), or explore other dialogue tasks, e.g. from social media or forums. Another interesting direction is to explore how our build it, break it, fix it strategy would similarly apply to make neural generative models safe (Henderson et al., 2018).