#WhyIStayed, #WhyILeft: Microblogging to Make Sense of Domestic Abuse

In September 2014, Twitter users unequivocally reacted to the Ray Rice assault scan-dal by unleashing personal stories of domestic abuse via the hashtags #WhyIStayed or #WhyILeft . We explore at a macro-level ﬁrsthand accounts of domestic abuse from a substantial, balanced corpus of tweeted instances designated with these tags. To seek insights into the reasons victims give for staying in vs. leaving abusive relationships, we analyze the corpus using linguistically motivated methods. We also report on an annotation study for corpus assessment. We perform classiﬁcation, contributing a classiﬁer that discriminates between the two hashtags exceptionally well at 82% accuracy with a substantial error reduction over its baseline.


Introduction
Domestic abuse is a problem of pandemic proportions; nearly 25% of females and 7.6% of males have been raped or physically assaulted by an intimate partner (Tjaden and Thoennes, 2000). These numbers only include physical violence; psychological abuse and other forms of domestic abuse may be even more prevalent. There is thus an urgent need to better understand and characterize domestic abuse, in order to provide resources for victims and efficiently implement preventative measures.
Survey methods exploring domestic abuse involve considerable time and investment, and may suffer from under-reporting, due to the taboo and stressful nature of abuse. Additionally, many may not have the option of directly seeking clinical help. Social media may provide a less intimidating and more accessible channel for reporting, collectively processing, and making sense of traumatic and stigmatizing experiences (Homan et al., 2014;Walther, 1996). Such data has been used for analyzing and predicting distinct societal and health issues, aimed at improving the understanding of wide-reaching societal concerns. For instance, Choudhury et al. (2013) predicted the onset of depression from user tweets, while other studies have modeled distress (Homan et al., 2014;Lehrman et al., 2012). Xu et al. (2013) used Twitter data to identify bullying language, then analyzed the characteristics of these tweets, and forecasted if a tweet would be deleted out of regret.
In September 2014, in the wake of the Ray Rice assault scandal 1 and the negative public reaction to the victim's decision to stay and support her abuser, Twitter users unequivocally reacted in a viral discussion of domestic abuse, defending the victim using the hashtag #WhyIStayed and contrasting those with #WhyILeft. Such narrative sharing may have a cathartic and therapeutic effect, extending the viral reach of the trend.
Analysis of the linguistic structures embedded in these tweet instances provides insight into the critical reasons that victims of domestic abuse report for choosing to stay or leave. Trained classifiers agree with these linguistic structures, adding evidence that these social media texts provide valuable insights into domestic abuse. Figure 1: Tweet count per hour with #WhyIStayed (dotted) or #WhyILeft (solid) from 9/8 to 9/12. Times in EST, vertical lines mark 12 hour periods, with label corresponding to its left line. Spam removed, includes meta tweets.

Data
We collected a new corpus of tweets using the Twitter and Topsy 2 application programming interfaces. The corpus spans the beginning of September (the start of the trend) to the beginning of October, 2014. We fully rehydrated the tweets (to update the retweet count, etc.) at the end of the collection period. Figure 1 displays the behavior from the initial days of this trend. Due to its viral nature, the majority of tweets are from the first week of the trend's creation.

Preprocessing
We removed spam tweets based on the usernames of the most prevalent spammers, as well as key spam hashtags. 3 We also removed tweets related to a key controversy, in which the Twitter account for DiGiorno Pizza (ignorant of the trend's meaning) tweeted #WhyIStayed You had pizza. 4 This resulted in over 57,000 unique tweets in the corpus.
Many tweets in the dataset were reflections on the trend itself or contained messages of support to the users sharing their stories, for example, Not usually a fan of hashtag trends, but #WhyIStayed is incredibly powerful. #NFL #RayRice. 5 These tweets, here denoted meta-tweets, were often retweeted, but they rarely contained reasons for staying or leaving (our interest), so we filtered them out by keyword. 6 In section 2.3 we empirically explore the remaining instances.

Extracting Gold Standard Labels
Typically, users provided reasons for staying and leaving, with the reasons prefixed by or appended with the hashtags #WhyIStayed or #WhyILeft as in this example: #WhyIStayed because he told me no one else would love me. #WhyILeft because I gained the courage to love myself. Regular expressions matched these structures and for tweets marked by both tags, split them into multiple instances, labeled with their respective tag. If the tweet contained only one of the target hashtags, the instance was labeled with that hashtag. If the tweet contained both hashtags but did not match with any of the regular expressions, it was excluded to ensure data quality.
The resulting corpus comprised 24,861 #WhyIStayed and 8,767 #WhyILeft labeled datapoints. The class imbalance may be a result of the origins of the trend rather than an indicator that more victims stay than leave. The tweet that started the trend contained only the hashtag #WhyIStayed, and media reporting on the trend tended to refer to it as the "#WhyIStayed phenomenon." As Figure 1 shows, the first #WhyILeft tweet occurred hours after the #WhyIStayed trend had taken off, and never gained as much use. By this reasoning, we concluded that an even set of data would be appropriate, and enable us to use the ratio metric in experiments discussed in this paper, as well as compare themes in the two sets. By random sampling of #WhyIStayed, a balanced set of 8,767 examples per class was obtained, resulting in a binary 50% baseline. From this set, 15% were held out as a final testset, to be considered after a tuning procedure with the remaining 85% devset.

Annotation Study
Four people (co-authors) annotated a random sample of 1000 instances from the devset, to further characterize the filtered corpus and to assess the automated extraction of gold standard labels. This random subset is composed of 47% #WhyIStayed and 53% #WhyILeft gold standard samples. Overall agreement overlap was 77% and Randolph's freemarginal multirater kappa (Warrens, 2010) score was 0.72. According to the annotations in this random sample, on average 36% of the instances are reasons for staying (S), 44% are reasons for leaving (L), 12% are meta comments (M), 2% are jokes (J), 2% are ads (A), and 4% do not match prior categories (O). Table 1 shows that most related directly to S or L, with annotators identifying more clearly L. Of interest are examples in which annotators did not agree, as these are indicative of problems in the data, and are samples that a classifier will likely label incorrectly. The tweet because i was slowly dying anyway was marked by two annotators as S and two annotators as L. Did the victim have no hope left and decide to stay? Or did the victim decide that since they were "slowly dying anyway" they could attempt to leave despite the possibility of potentially being killed in the attempt? The ground truth label is #WhyILeft. Another example with two annotators labeling as S and two as L is two years of bliss, followed by uncertainty and fear. This tweet's label is #WhyIStayed. The limited context from these samples makes it difficult to interpret fully, and causes human annotators to fail; however, most cases contain clear enough reasoning to interpret correctly.

Cleaning and Classifier Tuning
All experiments used the same cleaned data: removing hashtags, replacing URLs with the token url and user mentions with @mention, and replacing common emoticons with a sentiment indicator: emotsent{p|n|neut} for positive/negative/neutral. Informal register was expanded to standard English forms using a slang dictionary. 7 Classifier tuning involved 5-fold cross-validation and selecting the best parameters based on the mean accuracy. For heldout data testing the full devset was used for training.

Analysis of Vocabulary
We examined the vocabulary in use in the data of the two hashtag sets by creating a frequency distribution of all unigrams after stoplisting and lowercasing. The wordcloud unigrams in Figure 2 are weighted by their relative frequency. These wordclouds hint at the reasons; however, decontextualized unigrams lead to confusion. For example, why does left appear in both? Other experiments were done to provide context and expand analysis.

Analysis of Subject-Verb-Object Structures
Data inspection suggested that many users explained their reasons using a Subject-Verb-Object (SVO) structure, in which the abuser is doing something to the victim, or the victim is explaining something about the abuser or oneself. 9 We used the open-source tools Tweeboparser (Kong et al., 2014) and TurboParser (Martins et al., 2013) to heuristically extract syntactic dependencies, constrained by pronomial usage. Both parsers performed similarly, most likely due to the well-formed English in the corpus. While tweets are known for non-standard forms, the seriousness of the discourse domain may have encouraged more standard writing conventions. Using TurboParser, we conducted an analysis for both male and female genders acting as the abuser in the subject position. Starting at the lemmatized predicate verb in each dependency parse, if the predicate verb followed an abuser subject word 10 per the dependency links, and preceded a victim object word, 11 it was added to a conditional frequency distribution, with the two classes as conditions. These structures are here denoted abuser onto victim. We used similar methods to extract structures in which the victim is the subject. Instances with female abusers were rare, and statistical gender differences could not be pursued. Accordingly, both genders' frequency counts were combined. Discriminative predicates from these conditional frequency distributions were determined by equation (1). In Table  2 we report on those where the ratio is greater than 0.75 and the total count exceeds a threshold to avoid bias towards lower frequency verbs.

Classification Experiments
We examined the usefulness of the SVO structures, using subsets of the devset and testset having SVO structures (10% of the instances in total). While 10% is not a large proportion overall, given the massive number of possible dependency structures, it is a pattern worth examining -not only for corpus analytics but also classification, particularly as these SVO structures provide insight into the abuser-victim relationship. A linear SVM using boolean SVO features performed best (C=1), obtaining 70% ± 2% accuracy on the devset and 73% accuracy on the testset. The weights assigned to features by a Linear SVM are indicative of their importance (Guyon et al., 2002). Here, the top features presented as (S,V,O) for #WhyIStayed were: (he, introduce, me), (i, think, my), (he, convince, me), (i, believe, his), and (he, beat, my). For #WhyILeft they were (he, choke, me), (i, beg, me), (he, want, my), (i, realize, my), and (i, listen, my).
The SVO structures capture meaning related to staying and leaving, but are limited in their data coverage. Another experiment explored an extended feature set including uni-, bi-, and trigrams in sublinear tf × idf vectors, tweet instance character length, its retweet count, and SVO structures. We compared Naïve Bayes, Linear SVM, and RBF SVM classifiers from the Scikit-learn package (Pedregosa et al., 2011). The RBF SVM performed slightly better than the others, achieving a maximum accuracy of 81% ± .3% on the devset and 82% on the testset. 12,13 Feature ablation, following the procedure in Fraser et al. (2014), was utilized to determine the most important features for the classifier, the results of which can be seen in Table 3.

Removed
Remaining  Interestingly, the SVO features combined with ngrams worsened performance slightly, perhaps due to trigrams capturing the majority of SVO cases. The highest accuracy, 82.21% on the testset, could be achieved with a combination of ngrams, informal register replacement, and retweet count. However the vast majority of cases can be classified accurately with ngrams alone. Emoticons may not have contributed to performance since they were rare in the corpus. Standardizing non-standard forms presumably helped the SVM slightly by boosting the frequency counts of ngrams while removing nonstandard ngrams. Tweet length reduced accuracy slightly, while the number of retweets helped.

Discussion
From the analyses of SVO structures, wordclouds, and Linear SVM weights, interesting micronarratives of staying and leaving emerge. Victims report staying in abusive relationships due to cognitive manipulation, as indicated by a predominance of verbs including manipulate, isolate, convince, think, believe, felt while report leaving when experiencing or fearing physical violence, via predicates such as kill and kick. They also report staying when in dire financial straits (money), when attempting to keep the nuclear family united (family, marriage) or when experiencing shame about their situation (ashamed, shame). They report leaving when threats are made towards loved-ones (son, daughter), gain agency (choose, decide), realize their situation or self-worth (realize, learn, worth, deserve, finally, better), or gain support from friends or family (courage, support, help). Importantly, such reasons for staying are validated in the clinical literature (Buel, 1999).

Conclusion
We discuss and analyze a filtered, balanced corpus having the hashtags #WhyIStayed or #WhyILeft. Our analysis reveals micro-narratives in tweeted reasons for staying vs. leaving. Our findings are consistent across various methods, correspond to observations in the clinical literature, and affirm the relevance of NLP for exploring issues of social importance in social media. Future work will focus on improving SVO extraction, especially adding consideration for negations of predicate verbs. In addition we will analyse other hashtags in use in the trend and perform further analysis of the trend itself, implement advanced text normalization rather than relying on a dictionary, and determine the roles features from linked webpages and FrameNet or other semantic resources play in making sense of domestic abuse.

Acknowledgement
This work was supported in part by a Golisano College of Computing and Information Sciences Kodak Endowed Chair Fund Health Information Technology Strategic Initiative Grant and NSF Award #SES-1111016.