TalkDown: A Corpus for Condescension Detection in Context

Condescending language use is caustic; it can bring dialogues to an end and bifurcate communities. Thus, systems for condescension detection could have a large positive impact. A challenge here is that condescension is often impossible to detect from isolated utterances, as it depends on the discourse and social context. To address this, we present TalkDown, a new labeled dataset of condescending linguistic acts in context. We show that extending a language-only model with representations of the discourse improves performance, and we motivate techniques for dealing with the low rates of condescension overall. We also use our model to estimate condescension rates in various online communities and relate these differences to differing community norms.


Introduction
Condescending language use can derail conversations and, over time, disrupt healthy communities. The caustic nature of this language traces in part to the ways that it keys into differing social roles and levels of power (Fournier et al., 2002). It is common for people to be condescending without realizing it (Wong et al., 2014), but a lack of intent only partly mitigates the damage it can cause. Thus, condescension detection is a potentially high-impact NLP task that could open the door to many applications and future research directions, including, for example, supporting productive interventions in online communities (Spertus, 1997), educating people who use condescending language in writing, helping linguists to understand the implicit linguistic acts associated with condescension, and helping social scientists to study the relationship between condescension and other variables like gender or socioeconomic status.
Progress on this task is currently limited by a >Are you struggling with this whole English language thing?
Stop being so condescending and engage in a real discussion.
Comment: You might just admit that you misunderstood. Are you struggling with this whole English language thing? Reply:

Context
Quoted Accusation of Condescension Figure 1: In this example, the REPLY quotes from part of the COMMENT and says that this QUOTED text is condescending.
lack of high-quality labeled data. A deeper challenge is that condescension is often impossible to detect from isolated utterances. First, a characteristic of condescending language is that it is not overtly negative or critical -it might even include (insincere) praise (Huckin, 2002). Second, condescension tends to rest on a pair of conflicting pragmatic presuppositions: a speaker presumption that the speaker has higher social status than the listener, and a listener presumption that this is incorrect. For example, an utterance that is entirely friendly if said by one friend to another might be perceived as highly condescending if said by a customer to a store clerk. In such cases, the social roles of the participants shape the language in particular ways to yield two very different outcomes.
In this paper, we seek to facilitate the development of models for condescension detection by introducing TALKDOWN, a new labeled dataset of condescending acts in context. The dataset is derived from Reddit, a thriving set of online communities that is diverse in content and tone. We focus on COMMENT and REPLY pairs of the sort given in Figure 1, in which the REPLY targets a specific quoted span (QUOTED) in the COMMENT as being condescending. The examples were multiplylabeled by crowdsourced workers, which ensures high-quality labels and allows us to include nu-anced examples that require human judgment.
Our central hypothesis is that context is decisive for condescension detection. To test this, we evaluate models that seek to make this classification based only on the QUOTED span in the COMMENT as well as extensions of those models that include summary representations of the preceding linguistic context, which we treat as an approximation of the discourse context in which the condescension accusation was made. Models with contextual representations are far superior, bolstering the original hypothesis. In addition, we show that these models are robust to highly imbalanced testing scenarios that approximate the true low rate of condescension in the wider world. Such robustness to imbalanced data is an important prerequisite for deploying models like this. Finally, we apply our model to a wide range of subreddits, arguing that our estimated rates of condescension are related to different community norms.

The TALKDOWN Corpus
We chose Reddit as the basis for our corpus for a few key reasons. First, it is a large, publicly available dataset from an active set of more than one million user-created online communities (subreddits). 1 Second, it varies in both content and tone. Third, users can develop strong identities on the site, which could facilitate user-level modeling, but these identities are generally pseudonymous, which is useful when studying charged social phenomena Wang and Jurgens, 2018). Fourth, the subreddit structure of the site creates opportunities to study the impact of condescension on community structure and norms (Buntain and Golbeck, 2014;Lin et al., 2017;Chandrasekharan et al., 2018).
The basis for our work is the Reddit data dump 2006-2018. 2 We first extracted COM-MENT/REPLY pairs in which the REPLY contains a condescension-related word. After further filtering out self-replies and moderator posts, and normalizing links and references, we obtain 2.62M COMMENT/REPLY pairs. Not all of these examples truly involve the RE-PLY saying that the QUOTED span is condescending. Our simple pattern-based extraction method is not sufficiently sensitive. To address this, we conducted an annotation project on Amazon Me-  chanical Turk. Our own initial assessment of 200 examples using a five-point Likert scale revealed two things that informed this project. First, we saw a clear split in the positive instances of condescension. In some, specific linguistic acts are labeled as condescending ("This is really condesending"), whereas others involve general user-level accusations that are not tied to specific acts ("You're so condescending"). We chose to focus on the specific linguistic acts. They provide a more objective basis for annotation, and they can presumably be aggregated to provide an empirically grounded picture of user-level behavior (or others' reactions to such behavior). Thus, for positive instances of condescension, we further limited our attention to COMMENT/REPLY pairs in which the REPLY contains a direct quotation from the COMMENT, using fuzzy match based on Levenshtein distance (Navarro, 2001), as illustrated in Figure 1. We extracted 66K such examples. Some statistics on these examples is given in Table 1.
Second, with the above ambiguity addressed, the signal of condescending or not is mostly clear. Thus, we designed the annotation project around a three-way multiple choice question: condescending, not condescending, and cannot decide. Each task began with instructions and two training questions, following by 10 different COM-MENT/REPLY pairs to be labeled. Appendix A.2 provides screenshots of the annotation interface.
To process the annotations, we filtered out the work of annotators who did not correctly answer the training questions. The remaining annotators have moderate to substantial agreement (Fleiss κ = 0.593;Fleiss 1971;Landis and Koch 1977). We then used Expectation-Maximization, as in Dempster et al. 1977, to assign labels. This yields slightly better quality in our hand-inspection than labels by majority vote, presumably because it factors individual worker reliability into the decision making. In the end, we obtained 4,992 valid labeled instances: 65.2% labeled as conde-  scending (henceforth, positive), and 34.8% as noncondescending (henceforth, negative). 3 To fully balance the dataset, we pulled out one random month's data for each year in 2011 to 2017.
We extracted instances using the same methods as described above, but we filtered out COMMENT/REPLY pairs in which a condescension-related word appeared. Our final dataset thus consists of annotated positive and negative instances, with supplemental randomlysampled negative instances. For our experiments, we partitioned the data into 80% train, 10% development, and 10% test splits. In addition, to simulate real-world situations, we built a dataset with a 1:20 ratio of positive to negative instances. 4 The basic statistics of the dataset are shown in Table 2.

Experiments
We now establish some baselines for the TALK-DOWN Corpus and begin to test the hypothesis that contextual representations are valuable for this task. To do this, we use the BERT model of Devlin et al. (2019), which uses a Transformerbased encoder architecture (Vaswani et al., 2017) to learn word representations by training against a masked language modeling task and a nextsentence prediction task. Our models are initialized with the pretrained representations released by the BERT team and a fully connected layer on the top (Figure 4 in Devlin et al. 2019), which is then fine-tuned to our dataset (Peters et al., 2019). We explore both BERT Base (BERT B ) and BERT Large (BERT L ), 5 to determine whether the added expense of using BERT L is justified. Appendix B provides details on our process of hyperparameter tuning and optimization. 3 There was just one case where cannot decide was the chosen label; it was in Spanish, so we excluded it and added a language classification step to our preprocessing pipeline. 4 To the best of our knowledge, there is no prior work on what percentage of conversations on Reddit (or, more broadly, in daily conversations) are condescending. Thus, we chose the ratio based on informal observations on Reddit. 5 The whole-word masking model was used as it performs better than the original one in multiple benchmarks.   Table 4: The impact of different train-set positive:negative ratios. All the models are BERT L . The first row is based on the balanced dataset, and the rest are based on the imbalanced dataset with different oversampling ratios. Model selection again used the procedure in Appendix B. Table 3 summarizes the results of our core experiments. Input 1 and Input 2 describe the basis for the feature representations. Thus, for example, QUOTED∧CONTEXT is a model that uses both the quoted span and the preceding linguistic context. We report two testing scenarios: Balanced and Imbalanced, in which there are 20 negative examples for each positive example. The results clearly support our hypothesis that context matters; using the QUOTED part and CON-TEXT together give us 3-4% boost in macro-F1 using the same model architecture. In addition, we see that increasing the capacity of the model also helps, though more modestly. It's noteworthy that the performance of using the QUOTED part is better than that of using CONTEXT alone, though the QUOTED part is roughly three times shorter. Thus, there is a strong signal in the QUOTED part -the replier chose this span for a reason -but the context contains a signal as well.

Imbalanced Testing Scenarios
Imbalanced testing scenarios are more challenging, but they also better reflect usage rates of condescending language in public forums like Reddit. To further understand how best to get traction on this problem, we explored a range of different methods for creating training data. Our results are summarized in Table 4. As expected, the balanced problem is best addressed with a balanced dataset. For the imbalanced problem, we found that an oversampling ratio of 2 to 4 yielded the best performance. Our full QUOTED ∧ CONTEXT model is again clearly superior in these scenarios.

Condescension Rates Across Subreddits
Our hope for TALKDOWN is that it will play a role in developing systems that can help identify condescending acts on social media. This will depend on models trained on TALKDOWN being able to get an accurate read on condescension at scale. As a first step towards assessing this capability, we ran our models on 14 subreddits, over the time period of July 2016 to December 2016, which covers the 2016 U.S. Presidential Election, an event that we expect to influence condescension rates in var-ious ways across Reddit. Appendix C lists these subreddits along with their post counts and estimated average rates of condescension. Figure 2 highlights a selection of them. As a baseline, we include a 10% random sample from the top 100 most active subreddits. 6 Consistently above this baseline are 'politics' and 'funny'. It makes sense that an overtly political subreddit would show a high rate of condescension (as do 'news' and 'worldnews'; Appendix C): it's a contentious topic in a contentious time period; see also the rising rate for 'The Donald' in the post-election period. It is more surprising that 'funny' shows the highest rates. We do not have a deep understanding of why this is, but it could trace to our model confusing irony and sarcasm with condescension.
Below the baseline are 'AskWomen' and 'pokemon'. We expect 'pokemon' to have low rates of condescension, as it strikes us as a supportive community. However, one might be surprised to see 'AskWomen' so low, especially as compared with 'AskMen', which has high rates in general.
There is wide support for the idea that women experience more condescension than men do (Hall and Braunwald, 1981;Harris, 1993;McKechnie et al., 1998;Cortina et al., 2002;Trix and Psenka, 2003), as reflected in the recent lexical innovation mansplaining, which can be roughly paraphrased as 'a man condescending to a woman'. 7 However, community norms on 'AskWomen' and 'AskMen' are likely shaping these outcomes. Whereas the description for 'AskWomen' says it is "curated to promote respectful and on-topic discussions, and not serve as a debate subreddit", the description for 'AskMen' ends with "And don't be an asshole. Also, go away."

Conclusion
We introduced TALKDOWN, a new annotated Reddit corpus of condescending linguistic acts in context. Using BERT, we established baseline models that suggest this is a challenging task, and one that benefits from rich contextual representations. Finally, in qualitative analyses on diverse subreddits, we offered initial evidence that models trained on TALKDOWN generalize to new data, a prerequisite for using them to help improve online communities via condescension detection. The full dataset with the pretrained BERT model is available at http://github.com/ zijwang/talkdown.

A Data
A.1 In-house Annotation Analysis Figure 3 shows the five-point Likert scale annotations between two in-house annotators. It can be seen that the signal of condescending or not is clear, and the agreement level between the two annotations is substantial: the Fleiss' κ is 0.613 for the five-point scale and 0.732 when normalized to three-point scale used in the paper (Fleiss, 1971;Landis and Koch, 1977).

A.2 Annotation Interface
In this section, we show examples of the annotation interface we used on Amazon Mechanical Turk: Figure 4 and Figure 5.
Annotators were presented with the task name, the instructions, and two simple training questions, followed by a warning in red saying they needed to pass the training questions to proceed ( Figure 4). They had unlimited trials for the training questions, and explanations (for both correct and incorrect answers) were presented directly after each trial. This helped the annotators learn how to approach the task.
After they passed the training questions, they were prompted that they could start to do the test questions ( Figure 5). The interface of the test questions was similar to that of the training questions, but without explanations after selections. We explicitly checked that the annotators had made selections on each test question before submission, while this was not forced for training questions. This was to filter out possibly lowquality annotations, where the annotators did not pay attention to the instructions.

B Model Hyperparameters
Our BERT models were trained using a set of hyperparameters based on the recommendations in Devlin et al. 2019. Specifically, we set: • Model Architecture: -BERT B : Bert Base, Cased -BERT L : Bert Large, Cased, with wholeword masking • Learning rate: {0.5, 0.8, 1, 2, 3, 5} · 10 −5 • Epoch: 2, 3 • Batch size: 32 • Max sequence length: 512 When optimizing these models, we set the batch size to 32 in order to ensure there was at least one positive instance per mini-batch. Grid search was performed with different learning rates and oversampling ratios, and best models were selected based on the best performance on the development set under the imbalanced setting. We found that oversampling 2 to 4 times the positive class (i.e., 10%-20% of the number of instances in the negative class) generally yielded good performance in all the experiments we ran. For all experiments, we used the HuggingFace PyTorch implementation of BERT. 8 8 https://github.com/huggingface/ pytorch-transformers/ C Subreddit Condescension Rates  Table 5: Subreddit experiment statistics. The raw data are from the Reddit dump from July 2016 to December 2016. 'Pairs' are COMMENT/REPLY pairs as defined in the paper. 'Mean rate' is the mean rate of condescension as estimated by our best model, and 'Std. err' gives the associated standard error. 'random' is a 10% random sample from the top 100 active subreddits over the same time period.