Understanding the Semantics of Narratives of Interpersonal Violence through Reader Annotations and Physiological Reactions

Interpersonal violence (IPV) is a prominent sociological problem that affects people of all demographic backgrounds. By analyzing how readers interpret, perceive, and react to experiences narrated in social media posts, we explore an understudied source for discourse about abuse. We asked readers to annotate Reddit posts about relationships with vs. without IPV for stakeholder roles and emotion, while measuring their galvanic skin response (GSR), pulse, and facial expression. We map annotations to coreference resolution output to obtain a labeled coreference chain for stakeholders in texts, and apply automated semantic role labeling for analyzing IPV discourse. Findings provide insights into how readers process roles and emotion in narratives. For example, abusers tend to be linked with violent actions and certain affect states. We train classifiers to predict stakeholder categories of coreference chains. We also find that subjects’ GSR noticeably changed for IPV texts, suggesting that co-collected measurement-based data about annotators can be used to support text annotation.


Introduction
More than one in three women and one in four men in the United States have experienced rape, physical violence, and/or stalking by an intimate partner (Black et al., 2011). One in nine girls and one in 53 boys under the age of eighteen are sexually abused by an adult (Finkelhor et al., 2014). Additionally, approximately one in ten elders in the USA have faced intimidation, isolation, neglect, and threats of violence. 1 Such interpersonal violence (IPV) 2 can lead to injury, depression, post-traumatic stress disorder, substance abuse, sexually transmitted diseases, as well as hospitalization, disability, or death (Black et al., 2011). Most of the science on IPV is based on survey and interview data. However, the nature of IPV relationships can make people feel uncomfortable or unsafe when participating in such studies, leading to inaccurate results. Also, surveys can be costly and time-consuming to carry out (Schrading et al., 2015b).
Social media is an understudied source of IPV data. Over 79% of adults that frequent the internet utilize social media (Greenwood et al., 2016). Online, individuals can anonymously share their experiences without fear of embarrassment or repercussions. Such narratives can also provide more details than surveys, and may lead to a deeper understanding of IPV. Nonetheless, it is extremely difficult to establish reference annotations useful for predictive modeling for discourse topics as emotionally charged as IPV.
We meet these challenges with a combination of annotator labeling, analyzing annotations, applying semantic processing techniques (coreference resolution, semantic role labeling, sentiment analysis), developing classifiers, and studying physiological sensor measurements collected in real-time from annotators as they read and annotate texts. Our contributions include: 1. Studying characteristics of the key players and their actions in IPV narratives.
2. Using coreference chains as units that map human to automated annotations for analyzing semantic roles, predicates, and characteristics such as pronoun usage to affective tone.
3. Applying distinct semantic features for classifying stakeholders, using coreference chains as classification units.
4. Analyzing how annotators interpret emotional tone of texts vs. their own reactions to them, and discussing the link to annotators' measurement-based sensor data gathered as they labeled texts about abuse.

Background and Related Work
The World Health Organization (WHO) includes in its definition of IPV acts committed by family members and intimate partners, as well as those who are unrelated to or unfamiliar with the victim (Krug et al., 2002). It divides violence into physical, sexual, psychological, and deprivational/neglect categories. The Duluth model provides another established categorization of types of violence, but was originally developed for therapy treating men who abuse women, rather than for understanding IPV scientifically (Rizza, 2009). Our study takes as its theoretical basis categories from the Department of Justice: physical, sexual, emotional, economic and psychological. 3 This Schrading et al. (2015a) developed classifiers to determine whether a Reddit post described abuse. In the study, the subreddit to which the post belonged was used to map to binary gold-standard labels: if a post came from a subreddit such as /r/survivorsofabuse, it was categorized as a post about abuse. We also draw upon such social media text data as a basis for our study, as Reddit allows us to consider narrative texts. However, we consider human perception and text annotation in conjunction with biophysical data sensed from reader-annotators.
Our study makes use of coreference resolution and semantic role labeling (SRL); the former to identify mentions linked to the same referent which are semantically co-indexed, while the latter identifies the relationships of predicates and their arguments in sentences. For example, SRL maps entity causing damage and agent as semantic descriptions of he in he hurt me. For IPV texts, current automated SRL does not directly correspond to IPV researchers' characterization frameworks, but automatically processing IPV-related texts in meaningful ways could enable IPV researchers to take advantage of such tools.
We use Linguistic Inquiry and Word Count (LIWC), a resource that cross-references word tokens with dictionaries containing categories of words such as positive/negative emotion and first/second/third person (Tausczik and Pennebaker, 2010). Normalizing the frequencies with which words occur in each category dictionary by the length of the input text allows for observations of lexical trends.
This study also considers physiological responses of reader-annotators when collecting annotations. Pulse changes have been associated with emotional reactions, as has galvanic skin response (GSR) with, for instance, arousal and stress. Prior work reports on observed changes in GSR when drivers navigated through various routes, noting spikes in skin conductance at particularly stressful traffic points (Taylor, 1964). More recently, researchers showed that GSR readings changed when individuals completed tasks with increased cognitive load (Shi et al., 2007). We incorporate forms of affect assessment and selfreporting in order to examine both reported and sensor data.

Annotation Experiment and
Pre-Processing We selected 80 narrative posts from relevant subreddits, such as /r/relationship advice and /r/survivorsofabuse. 40 texts were about relationships 4 with IPV and 40 control texts were about relationships without IPV mentions. Texts presented to subjects, often from anonymous 'throwaway' accounts, contained no personal details. Twenty subjects read and annotated texts using eHOST 5 , while sensors recorded pulse, GSR, and facial reactions (see Figures 1 and 2). Subjects were college-aged adults (10 women, 9 men, 1 non-disclose). They received $20 for participating.
Subjects completed two trials, each lasting 25 minutes; Trial 1 without IPV and Trial 2 with IPV. The ordering of the trials was consistent across participants, while texts within each trial were presented in random order. Time was extended an extra five minutes for one participant.
For each text, the tasks were: 1. Indicate the first occurrence of each stakeholder in the text. Labels for Trial 1: Partner, Secondary, and Other. Labels for Trial 2: Victim, Abuser, Victim-Supporter, Abuse-Enabler, and Other. Labels could apply to multiple stakeholders in a text.

2.
What is the dominant emotion conveyed in the text? Subjects selected 1 of 8 possible choices from the Plutchik wheel of emotions: Anger, Fear, Anticipation, Trust, Surprise, Sadness, Joy, or Disgust (Plutchik, 2001).

How do you feel reading this text?
Subjects indicated their own emotional response to each text.

Which types of abuse does this account fall under?
For texts with IPV, subjects indicated the types of violence mentioned in each text: Physical, Sexual, Emotional, Psychological, and Economic.
In the two trials (reading and annotating texts with vs. without IPV), we recorded subjects' physiological responses. Specifically, a Shimmer 3 GSR+ sensor recorded pulse and GSR on their non-dominant hands, while Camtasia 6 recorded subjects' faces and upper bodies and their screens (see Figure 1).

Linguistic Data Processing
In order to cover more texts and minimize boredom and fatigue, we asked subjects to label only the first mention of each stakeholder in each text. Coreference resolution identifies multiple references to the same individual in a given text; for example, my father, he, and dad might refer to the same individual that together form a disambiguated coreference chain. Automatic coreference resolvers such as CoreNLP 7 are reasonably accurate (Manning et al., 2014). We used CoreNLP to collect the remaining mentions of these stakeholders. Then, we manually inspected and corrected coreference linkages; one issue addressed was falsely non-linked chains.
Stakeholder labels assigned by subjects were associated with their coreference chain by use of an algorithm that took into account the similarity between these two labeled sets of text. This algorithm minimized the Levenshtein distance between words contained in the subject-labeled text and the coreference text, placing a higher weight on matching noun/pronoun headwords. Then, each chain was assigned an aggregate stakeholder 6 https://www.techsmith.com/camtasia.html 7 http://stanfordnlp.github.io/CoreNLP/ label based on the label most frequently assigned to it.
Next, each text was run through the Illinois Semantic Role Labeler, part of the Illinois Curator package of NLP tools (Punyakanok et al., 2008). We considered the assigned text labels of (verb) predicates and of the following arguments linked to them: A0, typically the subject, and A1, typically the (direct) object. As an example, given the sentence He protected his brother, for the predicate protect, the A0 he may be labeled protector and A1 brother labeled protected. The same algorithm that was used to find the coreference chain associated with a given reader's annotation text was used to automatically associate the semantic role nodes with their coreference chain. Figure 3 shows the entire mapping framework. Once complete, each coreference chain contained: (1) all human-assigned stakeholder labels, (2) the aggregate human-assigned stakeholder label (the most frequently assigned stakeholder), and (3) all semantic labels assigned to it that were generated by the SRL tool. We did not manually correct the SRL-generated labels.
This allows for examination of trends between the human-assigned stakeholder labels and the SRL-generated text labels. Matching also enabled the use of SRL features for stakeholder classification.

Physiological Sensor Data Processing
Multimodal results were synchronized by the system clock, also used as a reference to know when subjects encountered each text in the trials. The Consensys software of the Shimmer 3 GSR+ sensor was used to process and export the GSR and pulse data with timestamps that were subsequently synchronized with the Camtasia timestamps. We used Affectiva 8 to infer the emotional expression from subjects' video-recorded faces.
Facial expression data and two forms of emotion annotation pertain to 20 subjects, while the pulse and GSR sensor data comprises 18 subjects. For two subjects, the Shimmer 3 GSR+ sensor was not configured properly and thus discarded.
Occasional missing values or spikes in sensor readings, caused by brief hand movements which disrupted the sensor, necessitated filtering the data. Erroneous readings were detected by high frequency deviation from neighbors and replaced with neighborhood values. A Gaussian filter smoothed the GSR and pulse data. From there, we calculated the average GSR and pulse per text per participant. GSR, when measured in KOhms, decreases during periods of stress or arousal as skin conductivity increases. In order to compare results across subjects, all GSR data was normalized using feature standardization. Figure 4: The proportion of agreement between annotators about stakeholder labels regarding the same coreference chain. Agreement was high, and especially for Victim, Abuser, and Partner.

Results of Linguistic Analysis
Annotations. On average, the texts about relationships without IPV contained two Partners, while texts with IPV contained one Victim and one Abuser. Subjects demonstrated a high degree of agreement for assigning most stakeholder labels, as shown in Figure 4. Since every subject did not annotate all possible coreference chains for a given text, two measures of agreement are given: possible annotator agreement refers to the proportion of agreement between all participants who annotated the text, while active annotator agreement ignores participants who did not mark any stakeholder category in the coreference chain in question.
LIWC Results. Many Victim and Victim-Supporter coreference chains associated strongly with first person, while Abuser was one of few stakeholders with more third person association; see Figure 5.
Emotion lexicon appeared scarce within stakeholder coreference chains, with the notable exception of the Anger category in Abuser coreference chains. Anxiety was absent, and Sadness was present only in Partner coreference chains. However, sentiment dimensions, with broader positive and negative emotion categories, registered substantial levels of positive lexical affinity for many stakeholders, especially for Victim-Supporter, as demonstrated in Figure 6, but also for Abuser-Enabler. Again, Abusers are one of few stakeholders with observable negative diction; Partners rate second. We note that not all IPV-free texts were necessarily positive, but rather did not contain violence. As another note, only the text within coreference chains was considered in the LIWC analysis, necessitating careful interpretation of these results.
SRL Results. Tables 1 and 2 demonstrate the top labels assigned by the SRL system to coreference chains marked as Abusers and Victims. Abuser stakeholders occur more frequently in the A0 category, while Victim stakeholders occur more frequently in the A1 category. To produce these tables, the labels appearing frequently in Victim or Abuser coreferences that also appear frequently in Partner coreference chains have not been reported, so as to remove labels that do not pertain specifically to IPV. Our motivation is to highlight the differences between SRL text labels generated for Victim and Abuser categories, so labels appearing also for Partner, such as topic, or thing done, can here be discarded for sake of comparison.
The SRL-generated text labels make intuitive sense when compared with their human-annotated stakeholder coreference chains. For example, Abuser stakeholders involve controller, entity making a threat, and even abuser, suggesting that a mapping to an Abuser coarse-grained label seems possible. Similarly, predicate text labels such as hit, threaten, and control that appear when the Abuser is the doer clearly point to violent behaviors, while the the predicate texts associated with Victim as A1 are indicative of violence being inflicted on the individual.
The set of labels given to these stakeholders is not disjoint from one another (exemplified by the need to stoplist Partner labels from Abuser/Victim labels as discussed above). The SRL occasionally assigned abuser to a stakeholder marked by the human annotators as Victim. Classifiers may still need more features than semantic role labels alone in order to reach high precision.
Stakeholder Classification. Simple features extracted from coreference texts were passed into several classification engines to see if accurate stakeholder label predictions could be made. Stakeholder labels from Trial 1 and Trial 2, with the Other label from both trials grouped together, formed seven classes. Features extracted based on the coreference chains included their A0 and A1 text labels, their A0 and A1 predicate text labels, their text's unigrams, and their LIWC frequency counts.
The unigram feature was stoplisted and lemmatized, and text features were limited to the top 50 most common words/labels. The best performing model was an ensemble of 10 random forest bagged trees. Unigrams alone yield 33.5% k-fold classification accuracy, and adding the SRL and LIWC features improves to 38.5%. An ablation analysis showed text unigrams, followed by A1 text labels, then LIWC counts, as most valuable, and text labels from A1 predicates, then A0 predicates as least valuable.

Results of Physiological and Other Analysis
Reading Time. To avoid fatigue, the time limit was the same for each trial. Because the trial with IPV texts required an extra task (determining types of abuse in the text), subjects covered fewer texts in that trial. On average, participants covered 21.5 texts in the trial without IPV, and 16 texts in the trial with IPV. To explore whether texts about IPV took longer to read, while accounting for the additional task, we adjusted the reading instance duration of the second trial by 25%. Adjusted reading times between the two trials showed no difference in how long it took to read the texts. Reported Emotions. Subjects reported their subjective opinion on the dominant emotions conveyed in each text, and the emotion they felt for each text. Figure 7 demonstrates several noteworthy differences in the proportions of emotions across texts. When reading texts involving IPV, the proportion reported for texts conveying fear and sadness clearly increased. For the selfreported reader emotions there are also differences between the two trials, as shown in Figure 7. Negative emotions such as sadness, fear, and especially anger increased.
Overall, from the trial without IPV to the trial with IPV, the proportion of reported emotions like joy and anticipation decreased. In terms of anticipation, texts about relationships without IPV often sought advice about an ongoing dilemma, whereas many texts about relationships with IPV narrated about events in the past. Trust marginally increased for both text-conveyed and self-reported emotions.
Fear was generally more often reported as conveyed by the text than felt by the reader. In contrast, disgust and sadness were more often reported as reader emotions. The findings suggest that for affect-related annotation, it can be useful to collect both text-focused and readerexperienced emotion.
Facial Expressions. Affectiva, an emotion recognition software, analyzes the facial expressions of videos and assesses relative joy, fear, disgust, sadness, anger, surprise, and contempt. For each subject, the Affectiva results were split according to text timestamps, and then the highest ranked emotion was calculated for that text. Affectiva's output displayed surprise, contempt, or disgust for most subjects; the latter two may relate to false positives for unexpressive, stoic faces (such as from concentrating on reading and annotation), while for the former when participants yawned or opened the mouth widely, Affectiva reported surprise. In general, faces tended to be unexpressive. Figure 8: Left: Subjects' average pulse per text across trials. IPV trial had slightly lower mean beats per minute and wider variability across subjects. Right: Normalized GSR across subjects between trials in KOhms. Most subjects expressed noticeably lower (more prominent) GSR for the IPV trial. Figure 9: One subjects's average GSR in KOhms, per text. At the first text with IPV (ID: trial2-26), GSR drops. The lower GSR overall for Trial 2 suggests the subject had a stronger physiological reaction to reading texts about IPV. One text occurs twice due to subject looking back at this text during reading-annotation; our analysis included look-back data.
Pulse. Comparing average beats per minute across the trials without and with IPV displayed little change between trials; see Figure 8 (left panel). Across subjects, the mean pulse for reading and annotating texts without IPV was 79.4 beats per minute, and the mean for texts with IPV was 78.0 beats per minute. A histogram of the average pulse per text for each subject was generated in order to examine if certain texts stood out. While single-text spikes in pulse were observed for different subjects, upon review of videos, these rather showed movement (putting on a jacket, coughing and covering mouth) during these texts.
Galvanic Skin Response. GSR data showed noticeably lower KOhms during the trial with IPV for 14 out of 18 participants. KOhms measure the resistance of the skin, so a decrease in resistance indicates higher sweat levels. After normalizing the GSR data, we were able to compare results across subjects. On average, scores decreased when subjects began reading about relationships with violence, and remained low, as shown in Figure 8 (right panel).
One might wonder whether wearing a sensor for a long period of time would cause sweat to accumulate on participants irrespective of the text content. However, for 4 out of 18 participants, the KOhm levels remained approximately the same or increased during the trial with IPV. This suggests that the act of wearing a sensor does not automatically create a sweat response. In addition, the drop in KOhms from the trial without IPV to the trial with IPV was sudden, rather than a gradual decline; see Figure 9.
Physiological Reaction and Annotations. Besides shedding new light on IPV, this study provides an unusual exploration of the correspondence between reader-estimated dominant text/reader emotions and reader physiological reactions. It is interesting that subjects' GSR noticeably changed when reading texts with IPV. As affect annotation usually is a highly subjective task, the result has intriguing implications. It provides novel insight into how people interpret and conceptualize discourse about abuse, while it also innovatively links text-based annotation to measurement-based physiological annotator data. From this perspective, the study results suggest that co-collecting measurement-based annotator data with text-based annotations may help support annotations on emotional semantic topics.

Conclusion
Social media texts are an information-rich source for research in IPV. We report on a new data collection approach that integrates physiological sensors with human annotation of stakeholders and emotions conveyed in the text vs. felt by the reader. We also integrated human and computer semantic interpretation, and showed how coreference resolution and SRL can be effectively introduced to aid analysis of players in texts narrat-ing about IPV. The subjects generally agreed on stakeholder labels, and analysis of extracted stakeholder coreference chains provide insights about IPV not readily available from surveys. Stakeholder classification showed modest improvement when using semantic role features over unigrams from coreference chains; future work is needed to improve the classifier using a larger dataset.
Also, GSR differences between trials-with stronger response for IPV texts-provided sensorbased indicators that supported differences found across trials for human emotion annotation and in automated linguistic analysis. Broadly, the results ask the question, left for future work, if measurement-based sensors are a path to counter validity concerns in subjective text annotation tasks.