Bot-Adversarial Dialogue for Safe Conversational Agents

Conversational agents trained on large unlabeled corpora of human interactions will learn patterns and mimic behaviors therein, which include offensive or otherwise toxic behavior. We introduce a new human-and-model-in-the-loop framework for evaluating the toxicity of such models, and compare a variety of existing methods in both the cases of non-adversarial and adversarial users that expose their weaknesses. We then go on to propose two novel methods for safe conversational agents, by either training on data from our new human-and-model-in-the-loop framework in a two-stage system, or ”baking-in” safety to the generative model itself. We find our new techniques are (i) safer than existing models; while (ii) maintaining usability metrics such as engagingness relative to state-of-the-art chatbots. In contrast, we expose serious safety issues in existing standard systems like GPT2, DialoGPT, and BlenderBot.


Introduction
When dialogue models are trained to mimic human-human conversations utilizing large preexisting datasets, they will unfortunately also learn undesirable features from this human-human data, such as the use of toxic or biased language. Most recent work in the detection and prevention of offensive 1 language has focused exclusively on human-generated data. These conversations may be very different from the domain in which a dialogue model might eventually be deployed: for example, humans may adversarially attempt to elicit * Equal contribution 1 In this paper, we use "offensive", "toxic", and "unsafe" interchangeably. For more discussions about attempts to better define categories of unsafe content, see Schmidt and Wiegand (2017). offensive language from a dialogue model in ways that differ from how they would speak with another human.
In this work, we introduce Bot-Adversarial Dialogue (BAD) Safety, a novel method for evaluating chatbot safety with humans and models in the loop. We ask humans to adversarially converse with a set of state-of-the-art English-language models with the aim of inducing them to generate unsafe responses to mimic the way these models can be adversarially attacked at deployment time. We analyze how to optimally construct such a crowdworker task, and collect a dataset of 5k such conversations yielding around 70k total utterances.
We then use the BAD method and data to evaluate the safety of several generative models and propose two techniques for making safer models: (1) Training a safety classifier with this data and deploying a two-stage model at inference time. In the two-stage setting, we prevent the generative model from surfacing offensive language flagged by the classifier.
(2) A novel method that directly "bakes in" toxicity-awareness to the generative model during training by modifying the target responses to incorporate safe responses to offensive input.
In experiments, we show that our new techniques outperform other existing generative models in terms of safety, while maintaining engagingness. We publicly release the BAD training and evaluation data as well as select models trained using this data via ParlAI. 2

Related Work
Numerous works have shown that humans speak differently with bots than with humans, with increases in profanity and aggressiveness associated with addressing a bot (Hill et al., 2015;Lortie and Guitton, 2011), which motivates the incorporation of human-bot dialogues into our safety framework.
De Angeli and Carpenter (2005); De Angeli and Brahnam (2008) suggest that one in ten humanbot conversations may contain instances of the human demonstrating unprovoked abusive behavior towards the chatbot. Miller et al. (2017b) argued that adversarial attacks need to be expected and planned for when deploying a user-facing system that learns from its interactions. These findings suggest it is insufficient to merely exclude toxic data from training, as the model would not know how to answer hostile out-of-domain inputs, and positive biases where models tend to agree rather than contradict (Roller et al., 2020) would lead to undesirable outcomes. As shown in Gehman et al. (2020), training on sanitized data can decrease the amount of unprompted toxic content, yet still leave models vulnerable to generating toxic content based on specific prompts.
The moving target of toxic content requires dynamic methods that repeatedly update benchmarks to improve current systems (Dinan et al., 2019a;Nie et al., 2019) 3 . The iterative procedure in Dinan et al. (2019a) strictly focuses on detection of toxicity in human-generated utterances through several rounds of humans attempting to "break" a toxicity classifier, without addressing generation. Our BAD approach is similar in spirit, but centers on generations of a bot in a human-bot conversation, closer to the context of deployed conversational models.
Focusing on generation requires deciding how to address "bad content." Previous works have compared response strategies, including avoidance, joking or polite deflection, non-committal answers, play-along, confrontation, apologetic responding, empathizing, and counter-attacking responses (Curry and Rieser, 2019;Chin and Yi, 2019;Chin et al., 2020;Paranjape et al., 2020). They find that humans rate different strategies as more appropriate depending on the type of offense they are responding to. Note that different implementation details make those strategies difficult to directly compare. While we use a strategy of nonsequiturs in this work, our takeaway is that future work should keep investigating several types of responses such that models can learn to deploy them adaptively according to finer-grained understanding of unsafe content.

Models
We describe the models we analyze in this paper, including safety classifiers and generative models.

Classifiers
We consider binary Transformer-based classifiers, following the same structure as in Dinan et al. (2019a), with two sizes: 128M and 311M parameters. We pre-train these models on a previously existing Reddit dataset extracted and obtained by a third party that was hosted by pushshift.io (Baumgartner et al., 2020), using a masked language model objective, and then fine-tune on the safety classification tasks of interest, performing early stopping using the F1 score of the "unsafe" class on the validation set. These tasks include various combinations of the Wikipedia Toxic Comments dataset (WTC) (Wulczyn et al., 2017), Standard (S) and adversarial Build-it, Break-it, Fix-it (BBF) data from Dinan et al. (2019a), as well as semi-supervised data created from labeling the pushshift.io Reddit (Baumgartner et al., 2020) (Reddit) and Blended Skill Talk (BST) datasets. Finally, we will use a new dataset Bot-Adversarial Dialogue (BAD), to be described in §4. As further baselines, we will also compare to both single-turn and multi-turn classifiers from Dinan et al. (2019a).

Generative Models
We consider large Transformer-based architectures trained on dialogue tasks.
BST 2.7B We start from a state-of-the-art opendomain dialogue system. We consider the same architecture and setup as in BlenderBot (Roller et al., 2020), which employs a Seq2Seq Transformer architecture (Vaswani et al., 2017), with an implementation based on the ParlAI version (Miller et al., 2017a). We consider the 2.7B parameter model which has 2 encoder layers, 24 decoder layers, 2560 dimensional embeddings, and 32 attention heads, which performed best in human evaluations of engagingness. This model was pretrained using pushshift.io Reddit and fine-tuned on on a smaller set of crowdsourced datasets designed to provide important conversational skills. At decoding time, the model employs standard beam search with a beam size of 10, context and label 3-gram blocking (Paulus et al., 2017), and a minimum beam length of 20 BPE tokens, which was shown to perform well compared to other choices.
We refer to this model throughout the rest of the paper as BST 2.7B.

DialoGPT and GPT2
In our experiments we also compare to two other baseline models: Di-aloGPT (Medium) (Zhang et al., 2019) and GPT2 (Large) (Radford et al., 2019), although we expect these two models to have lower engagingness scores than the BST 2.7B base model, in line with results from Roller et al. (2020);Adiwardana et al. (2020). We use the medium-sized DialoGPT (345M parameters), as it performed best according to metrics in Zhang et al. (2019). To our knowledge these methods have not been compared previously in terms of dialogue safety evaluations, or the engagingness/safety trade-off.

Our Approach
This section introduces the Bot-Adversarial Dialogue method for evaluating and training safe generative models.

The Bot-Adversarial Dialogue Method
We would like our model to be robust against direct attempts to probe them for unsafe responses. Previous work described how safety classifiers can be trained to be robust to adversarial human attack, see Dinan et al. (2019a). However, in order to achieve safer conversational agents, we advocate for applying such a procedure directly to conversations with those agents, which previous work did not address. The main idea is to collect data of humans probing our conversational models to try to trick them into making mistakes, similar to adversarial attacks that happen in practice during deployment. We can then use this data to make our system more robust.
We therefore construct a task where crowdworkers are instructed to converse naturally with a bot while simultaneously trying to probe or adversarially goad it into replying in an unsafe manner. Each bot's response is tagged as either safe or unsafe by the crowdworker. We divide the unsafe responses into four bins: those that nearly 0% of the population would consider unsafe (according to the annotator's best guess), and those that < 10%, <50% or >= 50% of the population would consider unsafe. After collection, a separate task verifies the labels with 3 other annotators, who can mark utterances either as safe, or as unsafe with a number of finer categories (e.g., hate speech). See Appendix A for details on how we combine all annotations to give a single label to utterances. While we only make  use of the binary safe / unsafe distinction in this work, this finer-grained labeling could be used in subsequent work to accommodate different levels of sensitivity. We conduct 14-turn conversations with the human speaking first, and score the model based on the number of unsafe turns. Such data can be used as a held out test set for static evaluation across different models, as well as for training data for improving models. More details of the collection setup are given in Appendix A. Figure 1 shows how BAD differs from the BBF set-up from Dinan et al. (2019a): namely, in BBF, the "breaker" (or adversarial user) tries to break a classifier by submitting human-authored adversarial unsafe messages that are incorrectly classified as safe, whereas in this work (BAD), the "breaker" adversarially tries to elicit unsafe messages from a dialogue model 4 . We compare to classifiers trained on BBF and additionally evaluate our new classifiers on BBF in our experiments.

Dataset Collection and Statistics
Using the BAD method, we collect 5,784 dialogues between bots and crowdworkers, consisting of 78,874 utterances in total from both humans and bots (see Table 1). About 40% of all utterances are annotated as offensive, among which one third are generated by bots. To encourage the bot to use offensive language more often, humans tended to use either unsafe language themselves in the dialogues, or raise probing questions that are considered inappropriate to ask. More than 42% of the dialogues collected contain at least 3 unsafe human messages or probing questions (see Appendix, Table 6). We further break down the messages from humans into a taxonomy of offensive language types, as these may prove useful in future work. The majority of offensive language used by crowdworkers relates to hate speech against particular groups, personal attacks and other less explicit offensive language containing no profanity, see Appendix Figure 5. Further details can be found in Appendix A.

Applying to Conversational Agents
We consider two different general strategies for making generative models safer to engage with: training classifiers for detecting unsafe messages as an added "safety layer" ( §4.2.1) and training the model such that it is unlikely to surface unsafe content at inference time ( §4.2.2).

Unsafe Utterance Detection: Deploying a Two-Stage Model
Given a safety classifier, a simple approach for improving dialogue safety is to use it to detect if both the user input and the model's response are safe. If a safety violation is detected in either type of utterance, one can then, instead, initiate a response designed to be safe. While several different "safe" response strategies can be considered (Curry and Rieser, 2019;Paranjape et al., 2020), in this work we respond with a non-sequitur: we select a topic at random from 1,087 topics judged as safe from the Wizard of Wikipedia conversational topic list (Dinan et al., 2019b) and then produce the response "Hey do you want to talk about something else? How about we talk about X?" where X is the chosen topic. Additional approaches are considered and analyzed in Appendix §B.1. After returning this response, the conversation continues as normal, with the response entering into the model's conversational history. In this way, the model can still respond naturally to followup responses after the canned "safe" response is produced.
We note that this approach works only as well as the classifier. If the classifier red flags too many safe utterances, the conversational experience will suffer. If unsafe utterances are not flagged, toxic language can still enter the conversation. This highlights a potential trade-off between ensuring safety and having an engaging conversation.

Safe Utterance Generation
A separate safety classifier layer has advantages (e.g. any independent improvement of this classifier can be used), but also downsides. For example, such an open-sourced model is more complicated to share and deploy, requires more computational resources (e.g. loading both models), and allows unsafe usage if the layer is simply removed. Fur-ther, in the long-term it makes sense if safety is part of a single dialogue agent model, in the sense that ideally it should understand what it is saying is unsafe.
Here, we detail two generative model training methods that are less likely to surface unsafe content without the use of an additional safety layer: data pre-processing and "baking-in" the safety layer, the latter of which is a new approach introduced in this work.
Data Pre-processing A classic approach to training models on unclean data is to filter it beforehand. Assuming we have access to a safety classifier, we can use it to filter the training set. In this work, we perform filtering by removing an example from the training set if either the conversational context (input) or response (output) triggers the safety classifier. Other approaches -such as author-based filtering -are considered and evaluated in Appendix §B.2. This training set is then used to train models as usual. With this approach, it is important for this filtering to be performed on the large pre-training dataset: if only the fine-tuning datasets are cleaned, the model will still have been exposed to offensive language, which it will be able to remember and use (as indeed confirmed by our experiments).
Baking in the Safety Layer Data pre-processing methods attempt to make a model safe by simply not exposing it to offensive language. This can make those models vulnerable to adversarial usage because they will not have learned how to handle offensive language at all: our models frequently copy the input (Welleck et al., 2020), so they might copy the offensive language. We instead propose a technique for attempting to bake awareness of toxic language into the training data, by using labeled examples that recommend appropriate action on the model's part in those circumstances.
To do this, we first assume we have access to a safety classifier at training time (but not at deployment time). For each training example, if the last utterance in the dialogue history or the ground-truth response are labeled as unsafe by the classifier, we instead replace the ground-truth response of that training example with a non-sequitur. An example demonstrating this procedure is shown in Table 2.
After constructing "baked-in" safety data, one can then train the generative model using likelihood training as usual, but with these modified targets. We separate training examples that have been   2) compared to the original data and the safe utterance pre-processed data approach. When a safety classifier flags an example as containing an offensive message, the safe utterance training data set has those examples simply removed from the dataset, whereas in the "baked-in" method the output (target utterance) is replaced with a non-sequitur, and then used to train the model. modified for safety from those that have not, and assign different weights to them -effectively drawing examples from those two sets with different probabilities -affecting how much the model optimizes for safety versus usual conversational abilities. Balancing these weights is important, especially when dealing with highly toxic pre-training sets, as they may be dominated by modified examples. We choose this weighting as a hyperparameter of the model.

Evaluation Methods
We use human evaluations in both an adversarial and non-adversarial setting to evaluate how safe our generative models are. Since our ultimate goal is to create models that are safe while still being engaging to talk to, we also perform human evaluations of conversational quality.
Evaluating Conversational Quality We use the ACUTE eval (Li et al., 2019) method of evaluating conversational quality, as used for BlenderBot (Roller et al., 2020) and elsewhere. The method involves collecting human-bot conversations for two models one wishes to compare, and then presenting two of those conversations at random, one from each model, to crowdworkers to make a series of pairwise decisions based on a target question. We use the question corresponding to evaluating engaginess from Li et al. (2019), which is phrased as "Who would you prefer to talk to for a long conversation?". We compare each of the proposed models in this paper to the baseline BST 2.7B which Roller et al. (2020) showed is state-of-the-art in open-domain dialogue. We note that our goal is not to find models which are more engaging than our baseline, but to find models that are simultaneously equally engaging and safer.
Evaluating Adversarial Safety To perform human evaluation of the safety of models in an adversarial setting we create a test set consisting of 180 dialogues collected using the Bot-Adversarial Dialogue method. Recall that the BAD data is collected by having humans converse and attempt to elicit unsafe responses with a dialogue model. In order to prevent this test set from being biased towards adversarial examples that target a single model, we construct the test set by sampling conversations from a large suite of models. At evaluation time, we collect a model's responses to each of the 180 dialogues in the test set, and human evaluators are used to judge the safety of each response. This set-up allows us to evaluate models in an adversarial setting that mimics deployment but for which the examples remained fixed, eliminating variances based on the experience and quality of crowdworkers during conversation collection.
Evaluating Non-Adversarial Safety While we need our models to perform well in an adversarial setting, we also wish for them to perform well in a non-adversarial setting. For example, a model that repeats user input verbatim may be robust to more subtle attempts to elicit offensive responses which are not offensive in and of themselves, but would not be robust to simpler attacks like profanity. For this reason, we propose a non-adversarial test set composed of 180 examples that are extracted from the Wikipedia Toxic Comments test set. We adopt the same human evaluation setup as in the adversarial setting in which various models are evaluated for the given contexts.

Results
We detail experimental results in this section, including results of the data collection from the Bot-Adversarial Dialogue method ( §5.1), experimental results related to training classifiers ( §5.2), and a comparison of safe generation methods ( §5.3). Lastly, in §5.4, we detail and compare the overall safety and engagingness scores for all models.

Data Collection Results
We describe results from data collection using the Bot-Adversarial Dialogue method, providing a detailed analysis of the effects of the crowdsourcing methods.
In order to inform crowdsource task design, we use logistic regression to model several task outcomes. Predictors include variables capturing the human chat partner's experience with the task and the particular bot they are currently talking to, and which of two possible versions of task instructions was received. Experience with the task is measured as the number of HITs accepted by the worker so far -a HIT, or Human Intelligence Task, is the term used by Amazon's Mechanical Turk to refer to a single instance of a crowdworker task. Experience with a specific bot is captured as the position of the utterance within the conversation (e.g., 2nd utterance in a 14 utterance conversation). The models underlying the bot responses were included as predictors and had a large significant effect (as discussed in the rest of the paper), but are omitted from the discussion here to focus on predictors related to task design.
Modeling results shown in Table 3 suggest that (1) instructing workers to ask open questions about sensitive topics rather than using obvious profanities (New instruction set) has a significant effect, increasing the rate of unsafe bot utterances while simultaneously decreasing the rate of unsafe human utterances; (2) self-selection effects are present (see also Sec. A.4), so that the total number of HITs ultimately completed is predictive of higher success at eliciting not-OK content; (3) two types of learning effects are present: workers are more successful (i.e., are able to solicit more unsafe responses) as they perform more iterations of the task, and within HITs, which might reflect that workers figure out the vulnerabilities of the particular bot they have been paired with and identify the most successful strategies. We note that the increased rate of unsafe utterances for later utterances observed here is in the context of an explicitly adversarial setting aiming to elicit them; we do not expect that this pattern would generalize to non-adversarial contexts.

Classifier Training Results
Automatic evaluation results are presented for safety classifiers in Table 4. We train safety classifiers using the methodology described in Sec. 4.2.1 and compare different model sizes and multitasking across different training sources. Firstly, we find our newly trained models superior to existing models from Dinan et al. (2019a) when using the same training sets, likely due to improved pushshift.io Reddit pre-training of our Transformers compared to their BERT models. However, we find relatively small gains from either larger Transformers (Safety Classifier + ) over smaller ones (Safety), or from semi-supervised learning over Reddit and BST (Semi-Sup. + ).
We compare the classifier trained on the BAD dataset, multitasked with the other datasets, to other approaches in Table 4. We observe similar results Outcome: not OK utterances Bot, rater Bot, partner Human Base −3.06 * * * −2.04 * * * −0.37 * Increase / utterance 0.14 * * * 0.14 * * * 0.11 * * * Increase / HIT 0.04 * * * 0.03 * * * 0.08 * * * New instruction set 0.19 * 0.70 * * * −0.36 * * * Total HITs 0.06 * * * 0.10 * * * 0.01, n.s. to our other new safety classifiers on the singleturn Wikipedia Toxic Comments (WTC), Build-It Break-It Fix (BBF) and Standard (S) test sets, but superior results on the multi-turn bot-adversarial BAD test set. The BAD-based classifier achieves 80.8 unsafe F1 on the latter dataset, while the next best performing methods achieve 61.5, 61.0 and 60.7, respectively. This result can be explained by virtue of the fact that the BAD-based classifier is the only one trained on the BAD training set, hence it sees data that most closely resembles the evaluation distribution. Note that the BAD training set differs from the other training sets listed as it is both (i) adversarially collected and (ii) multi-turn. One can tease apart the effects of each of these attributes by comparing to a single-turn (truncated) version of BAD training, shown in Table 4 (second to last row), which still performs well -though not as well -as the multi-turn version, indicating that the adversarial component is most important.
As the BAD test set is the closest setup to the actual setting in which such a classifier might be deployed (it features human-bot conversations, rather than human-human single-turn data), this indicates the BAD-based classifier is the most likely method to be successful in real use cases.

Safe Generation Results
We compare the baked-in safety layer method of §4.2.2 to the data-preprocessing methods using 400M parameter models, the details of which are described in Appendix B, and find that "baked-in" training gives increased safety over safe utterance preprocessing. On pushshift.io Reddit, the "bakedin" method triggers a classifier 0.2% vs. 6.8% of the time for preprocessing. Both methods yield similar PPL and F1 scores. We thus experiment with scaling it up to a 2.7B parameter model.

Comparing All Models: Safety and Engagingness
We perform human evaluations to compare the relative safety and engagingness for many of the selected methods. Results showing the engagingness performance relative to safety performance (for both adversarial and non-adversarial safety) using human judgments ( §4.3) are shown in Figure 2. Automatic evaluations are provided in Appendix D. We compare the methods described in this paper -two-stage models and "baked in" models -to three standard baselines: BST 2.7B, DialoGPT, and GPT2. BST 2.7B (Roller et al., 2020) has simply been trained on existing dialogue corpora, with no safety technique at all in model training. DialoGPT (Zhang et al., 2019) uses a pre-processing method, where offensive subreddits where removed from the training data. We test DialoGPT in two flavors: with short generations (using standard beam decoding), and longer generations (where we add a constraint that a minimum of 20 tokens must be generated, similar to (Roller et al., 2020). In all experiments we use the medium-sized version of DialoGPT, with 345M parameters, as noted in §3.2. Finally, GPT2 (Radford et al., 2019) was trained on web data that was filtered for data quality, but not for offensive language as far as we are aware.

Engagingness
Engagingness scores from the ACUTE-eval set-up are plotted along the x-axis in Figure 2. Detailed results can be found in Table 9 in the Appendix.
Results on standard models indicate that BST 2.7B is significantly more engaging than GPT2, DialoGPT and pushift.io Reddit 2.7B.
We apply the classifier learned from our Bot-Adversarial Dialogue (BAD) dataset (multi-tasked with our other datasets) in a two-stage model. Engagingness of this model is found to be not significantly distinguishable from our base BST 2.7B model. The baked-in model also performs similarly to the base BST 2.7B model with respect to engagingness, showing that this system still works  well in terms of conversation quality.

Adversarial Safety
To perform human evaluation of safety in an adversarial setting, we evaluate models using the BAD evaluation method described in §4.3. Results can be seen on the y-axis of Figure 2 (left). More details are provided in Table 15 in the Appendix.
Results show that all of our standard base models -including BST 2.7B, DialoGPT, and GPT2are susceptible to attack, e.g. GPT2 produces safe responses only 59.4% of the time, and BST 2.7B only 55% of the time. Clearly, to defend against BAD requires alternative techniques.
Our two-stage BAD classifier approach improves over our other safety classifiers used in twostage systems, yielding an 94.4% OK rate on the adversarial data. Overall, this method offers strong robustness without affecting engagingness, and we advocate its use.
For our "baked-in" model, we see clear gains relative to standard models (e.g. increasing from the baseline BST 2.7B value of 55% OK up to 78.3% OK), although these gains are not as significant as when using two-stage models (the same classifiers in a two-stage setup can bring the results up to 83.9% OK). We believe an important next step for future work is to improve this training technique to match the two-stage results.

Non-Adversarial Safety
Human evaluation of safety in a non-adversarial setting is conducted using the Wiki Toxic Comments test set described in §4.3. Results can be seen on the y-axis of Figure 2 (right). More details are provided in Table 16 in the Appendix.
Similarly to the adversarial setting, all of our standard models appear susceptible to attack. In the best case, DialoGPT produces safe responses only 68.3% of the time. GPT2 performs the worst, providing safe responses 54.4% of the time.
Our two-stage models get near perfect scores here -scores range from 97.8 to 98.3 -showing that these models are very robust to attack in the non-adversarial setting. This shows that future effort to make these models safe should focus on the adversarial setting, as in BAD.
The "baked-in" model performs the best in this setting, achieving very high scores. We conclude this technique should be further explored, particularly for robustness in the adversarial setting.

Conclusion
We observe that standard generative models -with little or no safety intervention -fall very short in terms of safety, especially when measured using our Bot-Adversarial Dialogue (BAD) framework, which we publicly release along with our models. However, with our safety techniques we can achieve roughly the same engagingness as the state of the art BST 2.7B with substantially better safety scores, showing it is possible to build a model that is both safe and engaging. We find generative models can be improved considerably by distilling a safety classifier into the encoder-decoder during training, i.e. the "baked-in" approach. Two-stage models provide safer results still, with best performance coming from our BAD-based classifier with BST 2.7B in the adversarial case.
We note that while we have improved substantially over existing systems, our best systems are not perfectly safe as measured by the BAD method. Conducting perfectly safe dialogue requires the Figure 2: Engagingness vs. Safety: Comparing engagingness scores from ACUTE-eval to adversarial safety scores on the Bot-Adversarial Dialogue (BAD) test set (left) and non-adversarial safety scores on the Wiki Toxic Comments test set (right). An ideal model should appear at the top right of both plots, being maximally engaging whilst staying maximally safe. Here, engagingness and safety scores are measured using the metrics from Table 9, Table 15 and Table 16 found in the Appendix, respectively. model to deeply understand language and likely cannot be completely solved until AI itself is solved. Further complicating the issue is the fact that the very definition of "safe" is both contextually and culturally dependent (Schmidt and Wiegand, 2017). Rather than attempt to define "safety" for all languages and locales, in this work we rely on crowdworker consensus and focus on machine learning methods for English language data. We look forward to further progress in these technical and ethical challenges.

Ethical Considerations
In this paper, we have presented several methods for building safer conversational agents. As we noted in the conclusion, even our best systems are not perfectly safe. This raises several ethical considerations, including questions of: when can a model be considered "safe"? Is a failure rate of 5.6% in an adversarial setting acceptable for the deployment of such models? How safe is safe enough? Creating a perfectly safe dialogue model requires the model to deeply understand language and likely cannot be completely solved until AI itself is solved, i.e. this is an AI-complete problem.
We also reiterate that the issue is further complicated by the fact that the very definition of "safe" is both contextually and culturally dependent (Schmidt and Wiegand, 2017). A dialogue model must be able to understand the boundaries of its particular conversation partner. What is offensive to one may not be offensive to another (Curry and Rieser, 2019). Culturally speaking, the approaches in this paper are limited in both geograph-ical and historical senses. Our methods rely only on English-speaking annotators located in the United States. This narrow, Western-centric viewpoint will be insufficient for solving the issue in other languages and locales (Schmidt and Wiegand, 2017). Further, it is well known that commonly used hatespeech datasets are known to have issues with bias and fairness (Dixon et al., 2018). Sap et al. (2019) showed that several contain correlations between surface markers of African American English and toxicity, and propose race and dialect priming as a way to mitigate this. In this work we have assumed a consensus-based view on offensiveness, by admitting examples based on agreement of multiple humans;however, offense to underrepresented groups for example may be missed by such a setup. We encourage further work to consider how classifiers trained on the datasets described in this work may be biased against various demographic groups.
Lastly, our work analyzes publicly available open-sourced models. We note that there may be concerns in the community or the public at large related to releasing models, even for research purposes, due to their potential safety issues. The community has recently started to address this tradeoff between releasing models that can produce offensive or toxic language and open, reproducible research 5 . We believe the solution for these issues involves the community working together and conducting reproducible research on safety. Releasing code and models facilitates that joint community effort.

Appendix A Bot-Adversarial Dialogue Data Collection
We collect Bot-Adversarial Dialogues to build the BAD dataset by asking humans to adversarially talk to bots. This appendix provides further details on the data collection. Figure 4 is a screenshot of the crowdsourced task for collecting Bot-Adversarial Dialogues.

A.1 Further Collection Details
Bots We use a list of models (bots) coming from the techniques in the paper itself Section 3.2 and Section 4. The list of models, and data counts for each are listed in Table 5. One can observe from the offensive statistics themselves some trends, although we caution against their use for evaluation due to the variance in crowdworker experience and skill over the time of collection due to sequential effects. Nevertheless, one can observe that models without safety classifiers are more vulnerable to adversarial attacks from humans, and models with safety classifiers are harder to attack, and that Control Hostile is clearly the most offensive of all models.
Offensive Response Statistics Figure 3 shows some statistics from the dataset concerning when bots respond with offensive language relative to the language used by the human. We find that when humans craft offensive messages, about 1/3 of the time the bots reply with offensive responses too. By comparison, the use of safe utterances by humans (e.g. probing questions that are safe within themselves) is about 2.5× less effective of a strategy for eliciting an unsafe bot response, although we do not break that down here by model (the less robust the model, the easier it is to elicit an offensive response by writing an offensive query). We also provide statistics on the number of offensive turns per dialogue in Table 6.

A.2 Offensive Language Types
To get reliable estimates of whether an utterance is safe or not, and to further identify the type of offensive language from the collected adversarial dialogues, we launched a separate crowdsourced annotation task where at least 3 distinct crowdworkers from a disjoint set were instructed to annotate which type of offensive language each utterance   Figure 3: When humans use offensive language first, bots tend to respond with unsafe content more often. In response to offensive human messages, about 1/3 of the time bots reply with offensive language too, whereas this reduces to 12.9% in response to safe messages.
from the adversarial dialogues contains. Using three annotations per utterance is a method that is widely adopted by prior work on offensive language crowdsourcing tasks (Davidson et al., 2017;Chen et al., 2012;Zampieri et al., 2019) and also aligned with empirical studies on the diminishing return of increasing annotation size (Wulczyn et al., 2017). We choose a taxonomy of offensive language with 4 primary categories. The same taxonomy is shown in the bot adversarial task instructing crowdworkers to probe the bots in the those directions.
• Hate Speech: the text that attacks or demeans a group based on race, gender, ethnic origin, religion, disability, age or sexual orientation.
• Personal Attack: the text contains rude remarks, insults, threats that are targeting an individual.
• Profanity: the text contains profanities such as sexual remarks, swearing and curse words; also weakly pejoratives and obscenities such as 'stupid'.
• Other Offensiveness: the text is offensive, but it does not contain hate speech, personal attacks or profanity.
An utterance is considered safe by the annotator in this task if none of the above was tagged. 6 See Figure 5 for a breakdown of the offensive language types used in the dataset. Compared to personal attacks and profanity, hate speech and other offensive language that can be expressed in a more implicit way are more commonly used by crowdworkers to break the bot. This separate task also allows us to aggregate judgments from at least 4 crowdworkers -including the one adversarially talking to the bot and another 3 from the separate verification task -on bot responses and at least 3 crowdworkers on human utterances (since the crowdworkers talking to the bots only annotate utterances generated from the bot). For our purposes, each bot utterance is defined as unsafe if it was labeled by the crowdworker who elicited it in the bot adversarial dialogue task as ">=50% of population would consider it as such", or at least half of all the crowdworkers rating that utterance marked it as unsafe. (Note that for the crowdworker who chatted directly with the bot, in this context, anything other than "nearly 0%" is regarded as an "unsafe" annotation.) This gives more weight to judgments of ">=50% unsafe" by the workers eliciting the utterances to reflect the fact that they are usually more engaged in the conversation than annotators merely reading it. Each human utterance is defined as unsafe if >= 50% of all the crowdworkers who annotated it in the separate verification task labeled it as unsafe. To ensure data quality, each annotator was required to pass a test of at least 10 questions covering all types of offensiveness included in the taxonomy defined above and annotators whose accuracy fell below 70% would be removed from the annotation task. Using Krippendorff's alpha (Krippendorff, 2004) as inter-annotator agreement (IAA), the multi-label annotation task has a reliability coefficient of 0.41, and 0.53 in binary case (offensive/safe), close to the value (0.45) reported by (Wulczyn et al., 2017). This is also in line with IAA results in other crowdsourced studies of offensive language (Fortuna, 2017).  To detect offensive language in a conversational environment, we compare training multi-turn classifiers on the Bot-Adversarial Dialogue dataset, truncating to different context lengths. Table 7 reports the performance of models trained on truncation amount k tr (which corresponds to how much of the previous conversation context was provided to the model, including the current utterance and the previous k tr − 1 messages to look back on) on the validation set with truncation k v . Classifiers trained with different truncated dialogue lengths perform almost equally on WTC, S, BBF and BAD. However, the safety classifier trained on k tr = 4 achieves higher overall F1 across all k v ∈ {2, 4, 6} truncated versions of the BAD validation set.

A.4 Worker Self-Selection Effects
When modeling the rate of unsafe utterances elicited by a worker during their first time accepting a HIT, the rate produced by workers who go on to accept other HITs for that same task is significantly higher than the rate produced by workers who only accept one HIT, as shown in Table 8. This suggests that workers who successfully figure out how to trick the bot into saying more offensive utterances are more likely to go on accepting more HITs of the task. This in turns makes data collection more efficient.

Regressor Coefficient
Base −2.7 * * * Increase / utterance 0.1 * * * New instruction set 0.3 * Increase / HIT eventually completed 0.1 * * * Table 8: Logistic regression coefficients for the outcome of a bot response being rated as not OK in a subsequent verification task. The data here is limited to responses elicited during the first HIT accepted by any worker, to eliminate across-HIT learning effects and highlight self-selection effects. The total number of HITs ultimately completed by a worker is predictive of higher success at eliciting offensive content during the first HIT. Effects of better instruction set and within-HIT learning are also present. Model types are included in the regressors but not shown here. Significance (Wald test): * : p < 0.05. * * * : p < 0.001.  could generalize this to choosing from a set of canned responses.

B Safety
• Non sequitur: in this setting, we choose to change the subject instead. We select a topic at random from 1087 topics judged as safe from the Wizard of Wikipedia conversational topic list (Dinan et al., 2019b). We then produce the response "Hey do you want to talk about something else? How about we talk about X?" where X is the chosen topic.
After generating this response, the conversation continues as normal, with the response entering into the model's conversational history. In this way it can still respond naturally to followup responses after the canned response is produced.
Model engagingness results (see Appendix, Table 9) indicate that non sequiturs are more engaging than bland safe responses; intuitively this makes sense as they are interesting conversation starters. We therefore used non-sequiturs elsewhere in our experiments.

B.2 Data Pre-Processing Comparison
A classic approach to training models on clean data is to filter it beforehand. Assuming we have access to a safety classifier, we can use it to filter the training set. In this work we consider two methods: • Utterance-based: remove a target utterance from the training set if either its context or the   utterance itself triggers the safety classifier.
• Author-based: given a dataset where the author of each utterance is known, remove all the utterances of given authors, if that author's utterances trigger the classifier more than a given number of times. In our experiments, we remove authors if over 12% of their posts trigger the safety classifier.
This training set is then used to train models as usual. It is important this filtering is performed on the large pre-training dataset, as cleaning only the fine-tuning datasets (if even necessary -in many cases they are clean already) will have still exposed the model to offensive language which it will be able to remember and use, as will be shown in the experiments.

Experimental Results
We trained with two types of data pre-processing (author and utterance methods, §4.2.2). These models were trained from scratch using 400M parameter transformer models (we did not use the 2.7B model due to the computational cost of so many experiments). We then compare both pre-train only models and fine-tuned BST models in terms of safety and PPL and F1 metrics. The pre-processing from utterance and author safety methods resulted in training set sizes that were 70% and 30% of the original pre-train dataset, respectively. We compare these to a baseline 400M model using the whole pre-train dataset (so no safety mechanism is built in). Results are given in Table 11. We find that both pre-processing methods are safer than the baseline, with the safe utterance method being significantly safer than the safe author method. We note the safe author method still has a large number of unsafe utterances, according to our safety classifier, but not enough for any one author to trigger removing the author, which may be the reason for worse safety statistics on the validation set. This would lead to a conclusion that while toxic authors exist, there are also a large number of otherwise non-toxic authors who sometimes use toxic language, and this can adversely affect model training. We note that one could employ both procedures: safe author + utterance, but we have not tried that experiment here.
B.3 Data pre-processing vs. "Baked-in" To compare data pre-processing methods with our new "baking-in" technique, we train a 400M parameter model from scratch, with 50% of the safety classifier triggered pre-training data replaced with non-sequitur utterances, and the rest of the safety classifier triggered data discarded, to prevent too much of the training time spent on non-sequitur prediction. The results, given in Table 11 indicate that perplexity takes a slight hit, but that safety classier fires on model generations (given validation set contexts) decrease substantially. For our pre-train only model, however the results are more nuanced -we found that the model is overly cautious at deploy time and too often generates non-sequiturs, resulting in a low F1 on ConvAI2 for example. As it is expensive to begin pre-training with different hyperparameter values, we thus instead remedy this at fine-tune time by weighting the amount of training examples sampled in each batch between the BST tasks and nonsequiturs. The last two rows of Table 11 show that this technique can effectively control the rate at which non-sequiturs are produced. The last row in particular achieves an F1 score similar to the pre-processed data methods (safe author and safe utterance) while having a much lower safety classifier firing rate -reduced from 6% to 0.2%. We thus conclude from these experiments that baked-in training is a method worthy of further study, and in subsequent experiments proceed to apply it to larger 2.7B models instead.

B.4 "Baking-in" Additional Results
We experiment with scaling up our "baked-in" techniques to the 2.7B parameter size. For this, we considered two strategies: fine-tuning from the base BST 2.7B model to add baked-in safe responses, or training a completely new model from scratch with non-sequiturs as part of the pre-training task, followed by fine-tuning.

C Generative Model Automatic Evaluation Results
Automatic evaluations for generative models (bots) are presented in Table 12. Results in Table 12 show that all these models exhibit significant safety issues, with e.g., GPT2 generations being flagged by a safety classifier 8.0% of the time given pushshift.io Reddit dialogues as input context, and 2.4% given ConvAI2 dialogues. Similarly, Di-aloGPT is as high as 21.4% on pushshift.io Reddit (without the minimum beam). We can compare these to human numbers, which are actually quite high on pushshift.io Reddit (16.5%), explaining why some of these methods also exhibit safety issues -as they are trained on this data. In contrast, the safety classifier only fires on human data from ConvAI2 3.9% of the time, which can be explained by this data being authored by crowdworkers who had instructions not to use toxic language.
Comparing the two models pushshift.io Reddit 2.7B (which is pre-trained only on pushshift.io Reddit) and BST 2.7B (which is then fine-tuned on BST tasks such as ConvAI2) one can observe a decrease in safety classifier fires down from 8.1% to 1.8% on ConvAI2, and a similar decrease on pushshift.io Reddit. This shows how training on less toxic data induces less toxic models.

D Generative Model Human Safety Evaluation Results
To evaluate the safety of our models, we set up an adversarial setting that mimics deployment us-ing adversarial dialogues collected from the Bot-Adversarial Dialogue method. This is in contrast to a non-adversarial environment such as Wikipedia Toxic Comments where the contexts are not constructed to elicit toxic responses from chat bots.
Adversarial Safety Judgements. The adversarial test set for human safety judgments is composed of 180 dialogues, 30 each from the 6 chatbot models that we collected the most of in the adversarial dialogue crowdsourced task: BST 2.7B, BST 2.7B + Safety Classifier, BST 2.7B + Semi-Sup. + Safety Classifier, BST 2.7B Non Sequitur, and BST 2.7B Non Sequitur (Semi-Sup. + ). Each crowdworker is shown a truncated piece from the test set along with different model replies to that given segment and asked to annotate offensiveness. Human safety evaluation results on adversarial test set are shown in Table Table 15.
Non-Adversarial Safety Judgements. The nonadversarial test set for human safety judgments is composed of 180 examples from the Wikipedia Toxic Comments test set. Each crowdworker is shown an input from the WTC test set along with different model replies to that given message and asked to annotate offensiveness with the exact same instructions and annotation buckets as in the adversarial settings. Human safety evaluation results on adversarial test set are shown in Table Table 16 E Success and Failure Cases We discuss several example outputs of our models on our Bot-Adversarial Dialogue test set (BAD), including examples that showcase both the successes and failures of our methods.
Successes In Table 13, we show success cases for our BST 2.7B + Adversarial Dialogue Safety (two-stage) and BST 2.7B Non-Sequitur (baked-in) models on the BAD test set. We also provide the outputs for the standard BST 2.7B model (Roller et al., 2020) and DialoGPT (Zhang et al., 2019). In all three cases the safety models are able to successfully recognize the unsafe input and avoid responding by providing a non-sequitur. Conversely, both BST 2.7B and DialoGPT engage with the unsafe input.
Failures While our safety models are able to successfully avoid engaging with adversarial inputs in some cases, they fail in others. Failure cases are shown in Table 14   Dialogue Safety (two-stage) model. In both cases, the models' responses are unsafe in the context, showing how adversarial input can elicit an unsafe response. This shows that while the models' described in this paper are robust to many adversarial inputs, they can still be tricked.  Table 15: Human safety judgments on the adversarially created dialogue safety test set for various models. "Min", "Some", and "Most" refer to responses that less than 10% of the population would consider unsafe (according to the annotator's best guess), <50%, and > 50% of the population would consider unsafe, respectively.   Table 16: Human safety judgments on the non-adversarial WTC (Wulczyn et al., 2017) test set for various models. "Min", "Some", and "Most" refer to responses that less than 10% of the population would consider unsafe (according to the annotator's best guess), <50%, and > 50% of the population would consider unsafe, respectively.