Conversations Gone Awry: Detecting Early Signs of Conversational Failure

One of the main challenges online social systems face is the prevalence of antisocial behavior, such as harassment and personal attacks. In this work, we introduce the task of predicting from the very start of a conversation whether it will get out of hand. As opposed to detecting undesirable behavior after the fact, this task aims to enable early, actionable prediction at a time when the conversation might still be salvaged. To this end, we develop a framework for capturing pragmatic devices—such as politeness strategies and rhetorical prompts—used to start a conversation, and analyze their relation to its future trajectory. Applying this framework in a controlled setting, we demonstrate the feasibility of detecting early warning signs of antisocial behavior in online discussions.

Our goal is crucially different: instead of identifying antisocial comments after the fact, we aim to detect warning signs indicating that a civil conversation is at risk of derailing into such undesirable behaviors. Such warning signs could provide potentially actionable knowledge at a time when the conversation is still salvageable.
As a motivating example, consider the pair of conversations in Figure 1. Both exchanges took place in the context of the Wikipedia discussion page for the article on the Dyatlov Pass Incident, and both show (ostensibly) civil disagreement between the participants. However, only one of these conversations will eventually turn awry and devolve into a personal attack ("Wow, you're coming off as a total d**k. [...] What the hell is wrong with you?"), while the other will remain civil.
As humans, we have some intuition about which conversation is more likely to derail. 2 We may note the repeated, direct questioning with which A1 opens the exchange, and that A2 replies with yet another question. In contrast, B1's softer, hedged approach ("it seems", "I don't think") appears to invite an exchange of ideas, and B2 actually addresses the question instead of stonewalling. Could we endow artificial systems with such intuitions about the future trajectory of conversations?
In this work we aim to computationally capture linguistic cues that predict a conversation's future health. Most existing conversation modeling approaches aim to detect characteristics of an observed discussion or predict the outcome after the discussion concludes-e.g., whether it involves a present dispute (Allen et al., 2014;Wang and Cardie, 2014) or contributes to the even-A1: Why there's no mention of it here? Namely, an altercation with a foreign intelligence group? True, by the standards of sources some require it wouln't even come close, not to mention having some really weak points, but it doesn't mean that it doesn't exist.
A2: So what you're saying is we should put a bad source in the article because it exists? B1: Is the St. Petersberg Times considered a reliable source by wikipedia? It seems that the bulk of this article is coming from that one article, which speculates about missile launches and UFOs. I'm going to go through and try and find corroborating sources and maybe do a rewrite of the article. I don't think this article should rely on one so-so source.
B2: I would assume that it's as reliable as any other mainstream news source. Figure 1: Two examples of initial exchanges from conversations concerning disagreements between editors working on the Wikipedia article about the Dyatlov Pass Incident. Only one of the conversations will eventually turn awry, with an interlocutor launching into a personal attack. tual solution of a problem . In contrast, for this new task we need to discover interactional signals of the future trajectory of an ongoing conversation.
We make a first approach to this problem by analyzing the role of politeness (or lack thereof) in keeping conversations on track. Prior work has shown that politeness can help shape the course of offline (Clark, 1979;Clark and Schunk, 1980), as well as online interactions (Burke and Kraut, 2008), through mechanisms such as softening the perceived force of a message (Fraser, 1980), acting as a buffer between conflicting interlocutor goals (Brown and Levinson, 1987), and enabling all parties to save face (Goffman, 1955). This suggests the potential of politeness to serve as an indicator of whether a conversation will sustain its initial civility or eventually derail, and motivates its consideration in the present work.
Recent studies have computationally operationalized prior formulations of politeness by extracting linguistic cues that reflect politeness strategies Aubakirova and Bansal, 2016). Such research has additionally tied politeness to social factors such as individual status (Danescu-Niculescu-Mizil et al., 2012;Krishnan and Eisenstein, 2015), and the success of requests (Althoff et al., 2014) or of collaborative projects (Ortu et al., 2015). However, to the best of our knowledge, this is the first computational investigation of the relation between politeness strategies and the future trajectory of the conversations in which they are deployed. Furthermore, we generalize beyond predefined politeness strategies by using an unsupervised method to discover additional rhetorical prompts used to initiate different types of conversations that may be specific to online collaborative settings, such as coordinating work (Kittur and Kraut, 2008) or conducting factual checks.
We explore the role of such pragmatic and rhetorical devices in foretelling a particularly perplexing type of conversational failure: when participants engaged in previously civil discussion start to attack each other. This type of derailment "from within" is arguably more disruptive than other forms of antisocial behavior, such as vandalism or trolling, which the interlocutors have less control over or can choose to ignore.
We study this phenomenon in a new dataset of Wikipedia talk page discussions, which we compile through a combination of machine learning and crowdsourced filtering. The dataset consists of conversations which begin with ostensibly civil comments, and either remain healthy or derail into personal attacks. Starting from this data, we construct a setting that mitigates effects which may trivialize the task. In particular, some topical contexts (such as politics and religion) are naturally more susceptible to antisocial behavior (Kittur et al., 2009;Cheng et al., 2015). We employ techniques from causal inference (Rosenbaum, 2010) to establish a controlled framework that focuses our study on topic-agnostic linguistic cues.
In this controlled setting, we find that pragmatic cues extracted from the very first exchange in a conversation (i.e., the first comment-reply pair) can indeed provide some signal of whether the conversation will subsequently go awry. For example, conversations prompted by hedged remarks sustain their initial civility more so than those prompted by forceful questions, or by direct language addressing the other interlocutor.
In summary, our main contributions are: • We articulate the new task of detecting early on whether a conversation will derail into personal attacks; • We devise a controlled setting and build a labeled dataset to study this phenomenon; • We investigate how politeness strategies and other rhetorical devices are tied to the future trajectory of a conversation.
More broadly, we show the feasibility of automatically detecting warning signs of future misbehavior in collaborative interactions. By providing a labeled dataset together with basic methodology and several baselines, we open the door to further work on understanding factors which may derail or sustain healthy online conversations. To facilitate such future explorations, we distrubute the data and code as part of the Cornell Conversational Analysis Toolkit. 3 2 Further Related Work Antisocial behavior. Prior work has studied a wide range of disruptive interactions in various online platforms like Reddit and Wikipedia, examining behaviors like aggression (Kayany, 1998), harassment (Chatzakou et al., 2017;Vitak et al., 2017), and bullying (Akbulut et al., 2010;Kwak et al., 2015;Singh et al., 2017), as well as their impact on aspects of engagement like user retention (Collier and Bear, 2012;Wikimedia Support and Safety Team, 2015) or discussion quality (Arazy et al., 2013). Several studies have sought to develop machine learning techniques to detect signatures of online toxicity, such as personal insults (Yin et al., 2009), harassment (Sood et al., 2012) and abusive language (Nobata et al., 2016;Gambäck and Sikdar, 2017;Pavlopoulos et al., 2017a;Wulczyn et al., 2017). These works focus on detecting toxic behavior after it has already occurred; a notable exception is Cheng et al. (2017), which predicts future community enforcement against users in news-based discussions. Our work similarly aims to understand future antisocial behavior; however, our focus is on studying the trajectory of a conversation rather than the behavior of individuals across disparate discussions. Discourse analysis. Our present study builds on a large body of prior work in computationally modeling discourse. Both unsupervised (Ritter et al., 2010) and supervised (Zhang et al., 2017a) approaches have been used to categorize behavioral patterns on the basis of the language that ensues in a conversation, in the particular realm of online discussions. Models of conversational behavior have also been used to predict conversation outcomes, such as betrayal in games (Niculae et al.,3 http://convokit.infosci.cornell.edu 2015), and success in team problem solving settings (Fu et al., 2017) or in persuading others (Tan et al., 2016;Zhang et al., 2016).
While we are inspired by the techniques employed in these approaches, our work is concerned with predicting the future trajectory of an ongoing conversation as opposed to a post-hoc outcome. In this sense, we build on prior work in modeling conversation trajectory, which has largely considered structural aspects of the conversation (Kumar et al., 2010;Backstrom et al., 2013). We complement these structural models by seeking to extract potential signals of future outcomes from the linguistic discourse within the conversation.

Finding Conversations Gone Awry
We develop our framework for understanding linguistic markers of conversational trajectories in the context of Wikipedia's talk page discussionspublic forums in which contributors convene to deliberate on editing matters such as evaluating the quality of an article and reviewing the compliance of contributions with community guidelines. The dynamic of conversational derailment is particularly intriguing and consequential in this setting by virtue of its collaborative, goal-oriented nature. In contrast to unstructured commenting forums, cases where one collaborator turns on another over the course of an initially civil exchange constitute perplexing pathologies. In turn, these toxic attacks are especially disruptive in Wikipedia since they undermine the social fabric of the community as well as the ability of editors to contribute (Henner and Sefidari, 2016).
To approach this domain we reconstruct a complete view of the conversational process in the edit history of English Wikipedia by translating sequences of revisions of each talk page into structured conversations. This yields roughly 50 million conversations across 16 million talk pages.
Roughly one percent of Wikipedia comments are estimated to exhibit antisocial behavior (Wulczyn et al., 2017). This illustrates a challenge for studying conversational failure: one has to sift through many conversations in order to find even a small set of examples. To avoid such a prohibitively exhaustive analysis, we first use a machine learning classifier to identify candidate conversations that are likely to contain a toxic contribution, and then use crowdsourcing to vet the resulting labels and construct our controlled dataset.
Job 1: Ends in personal attack. We show three annotators a conversation and ask them to determine if its last comment is a personal attack toward someone else in the conversation.

Annotators Conversations Agreement
367 4,022 67.8% Job 2: Civil start. We split conversations into snippets of three consecutive comments. We ask three annotators to determine whether any of the comments in a snippet is toxic.
Annotators Conversations Snippets Agreement 247 1,252 2,181 87.5% Candidate selection. Our goal is to analyze how the start of a civil conversation is tied to its potential future derailment into personal attacks. Thus, we only consider conversations that start out as ostensibly civil, i.e., where at least the first exchange does not exhibit any toxic behavior, 4 and that continue beyond this first exchange. To focus on the especially perplexing cases when the attacks come from within, we seek examples where the attack is initiated by one of the two participants in the initial exchange.
To select candidate conversations to include in our collection, we use the toxicity classifier provided by the Perspective API, 5 which is trained on Wikipedia talk page comments that have been annotated by crowdworkers (Wulczyn et al., 2016). This provides a toxicity score t for all comments in our dataset, which we use to preselect two sets of conversations: (a) candidate conversations that are civil throughout, i.e., conversations in which all comments (including the initial exchange) are not labeled as toxic (t < 0.4); and (b) candidate conversations that turn toxic after the first (civil) exchange, i.e., conversations in which the N -th comment (N > 2) is labeled toxic (t ≥ 0.6), but all the preceding comments are not (t < 0.4). Crowdsourced filtering. Starting from these candidate sets, we use crowdsourcing to vet each conversation and select a subset that are perceived by humans to either stay civil throughout ("ontrack" conversations), or start civil but end with a personal attack ("awry-turning" conversations). To inform the design of this human-filtering process and to check its effectiveness, we start from a seed set of 232 conversations manually verified by the authors to end in personal attacks (more details about the selection of the seed set and its role in the crowd-sourcing process can be found in Appendix A). We take particular care to not over-constrain crowdworker interpretations of what personal attacks may be, and to separate toxicity from civil disagreement, which is recognized as a key aspect of effective collaborations (Coser, 1956;De Dreu and Weingart, 2003).
We design and deploy two filtering jobs using the CrowdFlower platform, summarized in Table 1 and detailed in Appendix A. Job 1 is designed to select conversations that contain a "rude, insulting, or disrespectful" comment towards another user in the conversation-i.e., a personal attack. In contrast to prior work labeling antisocial comments in isolation (Sood et al., 2012;Wulczyn et al., 2017), annotators are asked to label personal attacks in the context of the conversations in which they occur, since antisocial behavior can often be contextdependent (Cheng et al., 2017). In fact, in order to ensure that the crowdworkers read the entire conversation, we also ask them to indicate who is the target of the attack. We apply this task to the set of candidate awry-turning conversations, selecting the 14% which all three annotators perceived as ending in a personal attack. 6 Job 2 is designed to filter out conversations that do not actually start out as civil. We run this job to ensure that the awry-turning conversations are civil up to the point of the attack-i.e., they turn awry-discarding 5% of the candidates that passed Job 1. We also use it to verify that the candidate on-track conversations are indeed civil throughout, discarding 1% of the respective candidates. In both cases we filter out conversations in which three annotators could identify at least one comment that is "rude, insulting, or disrespectful". Controlled setting. Finally, we need to construct a setting that affords for meaningful comparison between conversations that derail and those that stay on track, and that accounts for trivial topical confounds (Kittur et al., 2009;Cheng et al., 2015). We mitigate topical confounds using matching, a technique developed for causal inference in observational studies (Rubin, 2007). Specifically, start-ing from our human-vetted collection of conversations, we pair each awry-turning conversation, with an on-track conversation, such that both took place on the same talk page. If we find multiple such pairs, we only keep the one in which the paired conversations take place closest in time, to tighten the control for topic. Conversations that cannot be paired are discarded.
This procedure yields a total of 1,270 paired awry-turning and on-track conversations (including our initial seed set), spanning 582 distinct talk pages (averaging 1.1 pairs per page, maximum 8) and 1,876 (overlapping) topical categories. The average length of a conversation is 4.6 comments.

Capturing Pragmatic Devices
We now describe our framework for capturing linguistic cues that might inform a conversation's future trajectory. Crucially, given our focus on conversations that start seemingly civil, we do not expect overtly hostile language-such as insults (Yin et al., 2009)-to be informative. Instead, we seek to identify pragmatic markers within the initial exchange of a conversation that might serve to reveal or exacerbate underlying tensions that eventually come to the fore, or conversely suggest sustainable civility. In particular, in this work we explore how politeness strategies and rhetorical prompts reflect the future health of a conversation. Politeness strategies.
Politeness can reflect a-priori good will and help navigate potentially face-threatening acts (Goffman, 1955;Lakoff, 1973), and also offers hints to the underlying intentions of the interlocutors (Fraser, 1980). Hence, we may naturally expect certain politeness strategies to signal that a conversation is likely to stay on track, while others might signal derailment.
In particular, we consider a set of pragmatic devices signaling politeness drawn from Brown and Levinson (1987). These linguistic features reflect two overarching types of politeness. Positive politeness strategies encourage social connection and rapport, perhaps serving to maintain cohesion throughout a conversation; such strategies include gratitude ("thanks for your help"), greetings ("hey, how is your day so far") and use of "please", both at the start ("Please find sources for your edit...") and in the middle ("Could you please help with...?") of a sentence. Negative politeness strategies serve to dampen an interlocutor's imposition on an addressee, often through conveying indirectness or uncertainty on the part of the commenter. Both commenters in example B (Fig. 1) employ one such strategy, hedging, perhaps seeking to soften an impending disagreement about a source's reliability ("I don't think...", "I would assume..."). We also consider markers of impolite behavior, such as the use of direct questions ("Why's there no mention of it?') and sentenceinitial second person pronouns ("Your sources don't matter..."), which may serve as forcefulsounding contrasts to negative politeness markers. Following Danescu-Niculescu-Mizil et al. (2013), we extract such strategies by pattern matching on the dependency parses of comments.
Types of conversation prompts. To complement our pre-defined set of politeness strategies, we seek to capture domain-specific rhetorical patterns used to initiate conversations. For instance, in a collaborative setting, we may expect conversations that start with an invitation for working together to signal less tension between the participants than those that start with statements of dispute. We discover types of such conversation prompts in an unsupervised fashion by extending a framework used to infer the rhetorical role of questions in (offline) political debates (Zhang et al., 2017b) to more generally extract the rhetorical functions of comments. The procedure follows the intuition that the rhetorical role of a comment is reflected in the type of replies it is likely to elicit. As such, comments which tend to trigger similar replies constitute a particular type of prompt.
To implement this intuition, we derive two different low-rank representations of the common lexical phrasings contained in comments (agnostic to the particular topical content discussed), automatically extracted as recurring sets of arcs in the dependency parses of comments. First, we derive reply-vectors of phrasings, which reflect their propensities to co-occur. In particular, we perform singular value decomposition on a termdocument matrix R of phrasings and replies as R ≈R = U R SV T R , where rows of U R are lowrank reply-vectors for each phrasing.
Next, we derive prompt-vectors for the phrasings, which reflect similarities in the subsequent replies that a phrasing prompts. We construct a prompt-reply matrix P = (p ij ) where p ij = 1 if phrasing j occurred in a reply to a comment containing phrasing i. We project P into the same space as U R by solving forP in P =PSV T R as

Prompt Type Description Examples
Factual check Statements about article content, pertaining to or The terms are used interchangeably in the US. contending issues like factual accuracy.
The census is not talking about families here.

Moderation
Rebukes or disputes concerning moderation decisions If you continue, you may be blocked from editing. such as blocks and reversions. He's accused me of being a troll.

Coordination
Requests, questions, and statements of intent It's a long list so I could do with your help. pertaining to collaboratively editing an article.
Let me know if you agree with this and I'll go ahead [...] Casual remark Casual, highly conversational aside-remarks.
What's with this flag image? I'm surprised there wasn't an article before.
Action statement Requests, statements, and explanations about Please consider improving the article to address the issues [...] various editing actions. The page was deleted as self-promotion.

Opinion
Statements seeking or expressing opinions about I think that it should be the other way around. editing challenges and decisions.
This article seems to have a lot of bias.  Table 4.
Each row ofP is then a promptvector of a phrasing, such that the prompt-vector for phrasing i is close to the reply-vector for phrasing j if comments with phrasing i tend to prompt replies with phrasing j. Clustering the rows ofP then yields k conversational prompt types that are unified by their similarity in the space of replies.
To infer the prompt type of a new comment, we represent the comment as an average of the representations of its constituent phrasings (i.e., rows of P) and assign the resultant vector to a cluster. 7 To determine the prompt types of comments in our dataset, we first apply the above procedure to derive a set of prompt types from a disjoint (unlabeled) corpus of Wikipedia talk page conversations (Danescu-Niculescu-Mizil et al., 2012). After initial examination of the framework's output on this external data, we chose to extract k = 6 prompt types, shown in Table 2 along with our interpretations. 8 These prompts represent signatures of conversation-starters spanning a wide range of topics and contexts which reflect core elements of Wikipedia, such as moderation disputes and coordination (Kittur et al., 2007;Kittur and Kraut, 2008). We assign each comment in our present dataset to one of these types. 9 7 We scale rows of UR andP to unit norm. We assign comments whose vector representation has (ℓ2) distance ≥ 1 to all cluster centroids to an extra, infrequently-occurring null type which we ignore in subsequent analyses. 8 We experimented with more prompt types as well, finding that while the methodology recovered finer-grained types, and obtained qualitatively similar results and prediction accuracies as described in Sections 5 and 6, the assignment of comments to types was relatively sparse due to the small data size, resulting in a loss of statistical power. 9 While the particular prompt types we discover are spe-

Analysis
We are now equipped to computationally explore how the pragmatic devices used to start a conversation can signal its future health. Concretely, to quantify the relative propensity of a linguistic marker to occur at the start of awry-turning versus on-track conversations, we compute the logodds ratio of the marker occurring in the initial exchange-i.e., in the first or second commentsof awry-turning conversations, compared to initial exchanges in the on-track setting. These quantities are depicted in Figure 2A. 10 Focusing on the first comment (represented as ♦s), we find a rough correspondence between linguistic directness and the likelihood of future personal attacks. In particular, comments which contain direct questions, or exhibit sentenceinitial you (i.e., "2 nd person start"), tend to start awry-turning conversations significantly more often than ones that stay on track (both p < 0.001). 11 This effect coheres with our intuition that directness signals some latent hostility from the conversation's initiator, and perhaps reinforces the forcefulness of contentious impositions (Brown and Levinson, 1987). This interpretation is also sugcific to Wikipedia, the methodology for inferring them is unsupervised and is applicable in other conversational settings. 10 To reduce clutter we only depict features which occur a minimum of 50 times and have absolute log-odds ≥ 0.2 in at least one of the data subsets. The markers indicated as statistically significant for Figure 2A remain so after a Bonferroni correction, with the exception of factual checks, hedges (lexicon, ♦), gratitude (♦), and opinion. 11 All p values in this section are computed as two-tailed binomial tests, comparing the proportion of awry-turning conversations exhibiting a particular device to the proportion of on-track conversations. Figure 2: Log-odds ratios of politeness strategies and prompt types exhibited in the first and second comments of conversations that turn awry, versus those that stay on-track. All: Purple and green markers denote log-odds ratios in the first and second comments, respectively; points are solid if they reflect significant (p < 0.05) log-odds ratios with an effect size of at least 0.2. A: ♦s and s denote first and second comment log-odds ratios, respectively; * denotes statistically significant differences at the p < 0.05 (*), p < 0.01 (**) and p < 0.001 (***) levels for the first comment (two-tailed binomial test); + denotes corresponding statistical significance for the second comment. B and C: ▽s and ⃝s correspond to effect sizes in the comments authored by the attacker and non-attacker, respectively, in attacker initiated (B) and non-attacker initiated (C) conversations. gested by the relative propensity of the factual check prompt, which tends to cue disputes regarding an article's factual content (p < 0.05).
In contrast, comments which initiate on-track conversations tend to contain gratitude (p < 0.05) and greetings (p < 0.001), both positive politeness strategies. Such conversations are also more likely to begin with coordination prompts (p < 0.05), signaling active efforts to foster constructive teamwork. Negative politeness strategies are salient in on-track conversations as well, reflected by the use of hedges (p < 0.01) and opinion prompts (p < 0.05), which may serve to soften impositions or factual contentions (Hübler, 1983).
These effects are echoed in the second comment-i.e., the first reply (represented as s). Interestingly, in this case we note that the difference in pronoun use is especially marked. First replies in conversations that eventually de-rail tend to contain more second person pronouns (p < 0.001), perhaps signifying a replier pushing back to contest the initiator; in contrast, on-track conversations have more sentenceinitial I/We (i.e., "1 st person start", p < 0.001), potentially indicating the replier's willingness to step into the conversation and work with-rather than argue against-the initiator (Tausczik and Pennebaker, 2010).

Distinguishing interlocutor behaviors.
Are the linguistic signals we observe solely driven by the eventual attacker, or do they reflect the behavior of both actors? To disentangle the attacker and nonattackers' roles in the initial exchange, we examine their language use in these two possible cases: when the future attacker initiates the conversation, or is the first to reply. In attacker-initiated conversations ( Figure 2B, 608 conversations), we see that both actors exhibit a propensity for the linguistically direct markers (e.g., direct questions) that tend to signal future attacks. Some of these markers are used particularly often by the nonattacking replier in awry-turning conversations (e.g., second person pronouns, p < 0.001, ⃝s), further suggesting the dynamic of the replier pushing back at-and perhaps even escalating-the attacker's initial hint of aggression. Among conversations initiated instead by the non-attacker (Figure 2C, 662 conversations), the non-attacker's linguistic behavior in the first comment (⃝s) is less distinctive from that of initiators in the on-track setting (i.e., log-odds ratios closer to 0); markers of future derailment are (unsurprisingly) more pronounced once the eventual attacker (▽s) joins the conversation in the second comment. 12 More broadly, these results reveal how different politeness strategies and rhetorical prompts deployed in the initial stages of a conversation are tied to its future trajectory.

Predicting Future Attacks
We now show that it is indeed feasible to predict whether a conversation will turn awry based on linguistic properties of its very first exchange, providing several baselines for this new task. In doing so, we demonstrate that the pragmatic devices examined above encode signals about the future trajectory of conversations, capturing some of the intuition humans are shown to have.
We consider the following balanced prediction task: given a pair of conversations, which one will eventually lead to a personal attack? We extract all features from the very first exchange in a conversation-i.e., a comment-reply pair, like those illustrated in our introductory example (Figure 1). We use logistic regression and report accuracies on a leave-one-page-out cross validation, such that in each fold, all conversation pairs from a given talk page are held out as test data and pairs from all other pages are used as training data (thus preventing the use of page-specific information). Prediction results are summarized in Table 3. Language baselines. As baselines, we consider several straightforward features: word count (which performs at chance level), sentiment lexicon (Liu et al., 2005) and bag of words. Pragmatic features. Next, we test the predictive power of the prompt types and politeness 12 As an interesting avenue for future work, we note that some markers used by non-attacking initiators potentially still anticipate later attacks, suggested by, e.g., the relative prevalence of sentence-initial you (p < 0.05, ⃝s).  strategies features introduced in Section 4. The 12 prompt type features (6 features for each comment in the initial exchange) achieve 59.2% accuracy, and the 38 politeness strategies features (19 per comment) achieve 60.5% accuracy. The pragmatic features combine to reach 61.6% accuracy. Reference points. To better contextualize the performance of our features, we compare their predictive accuracy to the following reference points: Interlocutor features: Certain kinds of interlocutors are potentially more likely to be involved in awry-turning conversations. For example, perhaps newcomers or anonymous participants are more likely to derail interactions than more experienced editors. We consider a set of features representing participants' experience on Wikipedia (i.e., number of edits) and whether the comment authors are anonymous. In our task, these features perform at the level of random chance.
Trained toxicity: We also compare with the toxicity score of the exchange from the Perspective API classifier-a perhaps unfair reference point, since this supervised system was trained on additional human-labeled training examples from the same domain and since it was used to create the very data on which we evaluate. This results in an accuracy of 60.5%; combining trained toxicity with our pragmatic features achieves 64.9%. Humans: A sample of 100 pairs were labeled by (non-author) volunteer human annotators. They were asked to guess, from the initial exchange, which conversation in a pair will lead to a personal attack. Majority vote across three annotators was used to determine the human labels, resulting in an accuracy of 72%. This confirms that humans have some intuition about whether a conversation might be heading in a bad direction, which our features can partially capture. In fact, the classifier using pragmatic features is accurate on 80% of the examples that humans also got right. Attacks on the horizon. Finally, we seek to understand whether cues extracted from the first exchange can predict future discussion trajectory beyond the immediate next couple of comments. We thus repeat the prediction experiments on the subset of conversations in which the first personal attack happens after the fourth comment (282 pairs), and find that the pragmatic devices used in the first exchange maintain their predictive power (67.4% accuracy), while the sentiment and bag of words baselines drop to the level of random chance.
Overall, these initial results show the feasibility of reconstructing some of the human intuition about the future trajectory of an ostensibly civil conversation in order to predict whether it will eventually turn awry.

Conclusions and Future Work
In this work, we started to examine the intriguing phenomenon of conversational derailment, studying how the use of pragmatic and rhetorical devices relates to future conversational failure. Our investigation centers on the particularly perplexing scenario in which one participant of a civil discussion later attacks another, and explores the new task of predicting whether an initially healthy conversation will derail into such an attack. To this end, we develop a computational framework for analyzing how general politeness strategies and domain-specific rhetorical prompts deployed in the initial stages of a conversation are tied to its future trajectory.
Making use of machine learning and crowdsourcing tools, we formulate a tightly-controlled setting that enables us to meaningfully compare conversations that stay on track with those that go awry. The human accuracy on predicting future attacks in this setting (72%) suggests it is feasible at least at the level of human intuition. We show that our computational framework can recover some of that intuition, hinting at the potential of automated methods to identify signals of the future trajectories of online conversations.
Our approach has several limitations which open avenues for future work. Our correlational analyses do not provide any insights into causal mechanisms of derailment, which randomized experiments could address. Additionally, since our procedure for collecting and vetting data focused on precision rather than recall, it might miss more subtle attacks that are overlooked by the toxicity classifier. Supplementing our investigation with other indicators of antisocial behavior, such as editors blocking one another, could enrich the range of attacks we study. Noting that our framework is not specifically tied to Wikipedia, it would also be valuable to explore the varied ways in which this phenomenon arises in other (possibly noncollaborative) public discussion venues, such as Reddit and Facebook Pages.
While our analysis focused on the very first exchange in a conversation for the sake of generality, more complex modeling could extend its scope to account for conversational features that more comprehensively span the interaction. Beyond the present binary classification task, one could explore a sequential formulation predicting whether the next turn is likely to be an attack as a discussion unfolds, capturing conversational dynamics such as sustained escalation.
Finally, our study of derailment offers only one glimpse into the space of possible conversational trajectories. Indeed, a manual investigation of conversations whose eventual trajectories were misclassified by our models-as well as by the human annotators-suggests that interactions which initially seem prone to attacks can nonetheless maintain civility, by way of level-headed interlocutors, as well as explicit acts of reparation. A promising line of future work could consider the complementary problem of identifying pragmatic strategies that can help bring uncivil conversations back on track.