Adapting Coreference Resolution for Processing Violent Death Narratives

Coreference resolution is an important compo-nent in analyzing narrative text from admin-istrative data (e.g., clinical or police sources).However, existing coreference models trainedon general language corpora suffer from poortransferability due to domain gaps, especiallywhen they are applied to gender-inclusive datawith lesbian, gay, bisexual, and transgender(LGBT) individuals.In this paper, we an-alyzed the challenges of coreference resolu-tion in an exemplary form of administrativetext written in English: violent death nar-ratives from the USA’s Centers for DiseaseControl’s (CDC) National Violent Death Re-porting System. We developed a set of dataaugmentation rules to improve model perfor-mance using a probabilistic data programmingframework. Experiments on narratives froman administrative database, as well as existinggender-inclusive coreference datasets, demon-strate the effectiveness of data augmentationin training coreference models that can betterhandle text data about LGBT individuals.


Introduction
Coreference resolution (Soon et al., 2001;Ng and Cardie, 2002) is the task of identifying denotative phrases in text that refer to the same entity. It is an essential component in Natural Language Processing (NLP). In real world applications of NLP, coreference resolution is crucial for analysts to extract structured information from text data. Like all components of NLP, it is important that coreference resolution is robust and accurate, as applications of NLP may inform policy-making and other decisions. This is especially true when coreference systems are applied to administrative data, since results may inform policy-making decisions.
In this paper, we describe an approach to adapting a coreference model to process narrative text * kwchang@cs.ucla.edu from an important administrative database written in English: the National Violent Death Reporting System (NVDRS), maintained by the Centers for Disease Control (CDC) in the USA. Violent death narratives document murders, suicides, murder-suicides, and other violent deaths. These narratives are complex, containing information on one or more persons; some individuals are victims, others are partners (heterosexual or same-sex), family members, witnesses and law enforcement. Specifically, we apply the End-to-End Coreference Resolution (E2E-Coref) system (Lee et al., 2017(Lee et al., , 2018, which has achieved high performance on the OntoNotes 5.0 (Hovy et al., 2006) corpus. We observe that when a model trained on OntoNotes is applied to violent death narratives, the performance drops significantly for the following reasons.
First, despite the fact that OntoNotes contains multiple genres 1 , it does not include administrative data. Administrative text data is terse and contains an abundance of domain-specific jargon. Because of the gap between training and administrative data, models trained on OntoNotes are poorly equipped to handle administrative data that are heavily skewed in vocabulary, structure, and style, such as violent death narratives.
Second, approximately 5% of the victims in the NVDRS are lesbian, gay, bisexual, or transgender (LGBT). This is a vulnerable population; for example, existing data show LGB youth are 5 times more likely to attempt suicide than heterosexual youth (Clark et al., 2020) and are more likely to be bullied prior to suicide 2 . It is essential that data-analytic models work well with these hard to identify but highly vulnerable populations; indeed correctly processing text data is an important step in revealing the true level of elevated risk for 4554 primary_victim is a 50 year old male . ... primary_victim's partner states that he and primary_victim had been living together for three years. ... Figure 1: A snippet of a violent death narrative. Highlighted is what the e2e-coref model clusters, and the colored text shows what the e2e-coref model misses.
LGBT populations. This remains challenging because of limitations of existing coreference systems. Close relationship partners provide a marker of sexual orientations and can be used (Lira et al., 2019;Ream, 2020) by social scientists to identify relevant information in LGBT deaths. However, OntoNotes is heavily skewed towards male entities (Zhao et al., 2018) and E2E-Coref relies heavily on gender when deciphering context (Cao and Daumé III, 2020). Consequently, E2E-Coref has a trouble dealing with narratives involving LGBT individuals where gender referents do not follow the modal pattern. Figure 1 illustrates a scenario where coreference systems struggle. The model mislabels the pronoun "he" and this error will propagate to downstream analysis. Specifically, the model takes the context and resolves the coreference based on gender; it makes a mistake partially due to an incorrect presumption of the sexual orientation of the 50 year old male victim.
To study coreference resolution on violent death narratives (VDN), we created a new corpus that draws on a subset of cases from NVDRS where CDC has reported the sex of both victims and their partners. We assigned ground truth labels using experienced annotators trained by social scientists in public health. 3 To bridge the domain gap, we further adapted E2E-coref by using a weakly supervised data creation method empowered by the Snorkel toolkit (Ratner et al., 2017). This toolkit is often used to apply a set of rules to augment data by probabilistic programming. Inspired by Snorkel, we designed a set of rules to 1) bridge the vocabulary difference between the source and target domains and 2) to mitigate data bias by augmenting data with samples from a more diverse population. Because labeling public health data requires arduous human labor, data augmentation provide a promising method to enlarge datasets while covering a broader range of scenarios.
We verified our adaptation approach on both the in-house VDN dataset as well as two publicly available English datasets, GICoref (Cao and Daumé III, 2020) and MAP (Cao and Daumé III, 2020). We then measured the performance of our approach on documents heavily skewed toward LGBT individuals and on documents in which gendered terms were swapped with non-gendered ones (pronouns, names, etc.). On all datasets, we achieved an improvement. For LGBT specific datasets, we see much larger improvements, highlighting how poor the OntoNotes model performed on these underrepresented populations before. Models trained on the new data prove more applicable in that domain. Our experiments underscore the need for a modifiable tool to train specialized coreference resolution models across a variety of specific domains and use-cases.

Related Work
Researchers have shown coreference systems exhibit gender bias and resolve pronouns by relying heavily on gender information (Cao and Daumé III, 2020; Zhao et al., 2018;Rudinger et al., 2018;Webster et al., 2018;Zhao et al., 2019). In particular, Cao and Daumé III (2020) collected a gender-inclusive coreference dataset and evaluated how state of the art coreference models performed against them.
As NLP systems are deployed in social science, policy making, government, and industry, it is imperative to keep inclusivity in mind when working with models that perform downstream tasks with text data. For example, Named Entity Recognition (NER) was used in processing Portuguese police reports to extract people, location, organization, and time from the set of documents (Carnaz et al., 2019). These authors noted the need for a better training corpus with more NER entities. Other NLP models face challenges in domain-adaptation like the one demonstrated in this paper. One example from the biomedical field is BioBERT (Lee et al., 2019), in which the authors achieved better results on biomedical text mining tasks by pretraining BERT on a set of biomedical documents. Likewise, even when evaluating a model on a general set, Babaeianjelodar et al. (2020) showed that many general-domain datasets include as much bias as datasets designed to be toxic and biased. All these cases required re-evaluation of the corpus used to train the model. This underscores the need for methodology that can evaluate, debias, and increase the amount of data used.

Annotating Violent Death Narratives
We first applied for and were given access to the CDC's National Violent Death Reporting-system's (NVDRS) Restricted Access Database. From this, we sampled a total 115 of violent death cases 4 each over 200 words in length. In these 115 cases, we had a total of 6,134 coreference links and 44,074 tokens, with a vocabulary size of 3,653. Each case had information about the victim, the victim's partner, and the type of death. We randomly sampled 30 cases from three strata: 1) the victim is male and the partner is female, 2) the victim is female and the partner is male, and 3) it was an LGBT case. We also included 25 cases that were particularly challenging for the general E2E model. The cases used were spell-checked and cleaned thoroughly.
To obtain gold-standard labels, we tasked a team of three annotators 5 to label the coreference ground truth, under the guidance of senior experts in suicide and public health. Annotators were told that every expression referring to a specific person or group was to be placed into that person's or group's cluster. From there, we resolved the three label sets into one by a majority voting method -if two out of three annotators put the phrase in a cluster, we assigned it to that cluster. Two of the annotators had previous experience with coding the NVDRS narratives for other tasks, while one was inexperienced. Agreement was typically unanimous.
Reproducibility To get access to the NVDRS, Users must apply for access and follow a data management agreement executed directly with CDC. We cannot release VDN or the annotations but we will provide the augmentation code and instructions on how reproducing the experiments. To allow reproduction of our approaches on data without access-restriction, we perform evaluations on MAP and GICoref which are readily available.

Weakly-Supervised Data
Augmentation for Domain Adaption Our next step was to build a pipeline for adapting E2E-Coref to resolve coreference on VDN. The key component of this pipeline is the Snorkel toolkit 4 Homicides and Suicides 5 All annotators signed the release form for accessing the NVDRS data. and its capacity to design rules that programmatically label, augment, and slice data. We looked to adapt E2E-Coref to process domain-specific data by creating a set of augmentation rules that would improve training data performance. Our rules can generate augmented data with diverse genders and then challenged our model to predict the coreference clusters.
Data Augmentation by Rules With Snorkel, we assessed the weakness of the current coreference model systems. These experiments helped us to develop effective augmentation rules to create training data that mimics challenging data to guide the model going forward. Specifically, we split data into groups and evaluated our model on split data. In the case of VDN, we split a larger set of data into two groups (LGBT and non-LGBT) and gauged model performance on both groups. We then isolated specific groups of data that posed a problem and came up with sets of augmentation rules that can be used to generate difficult training data from easier training cases. For example, in our case, we sought to augment documents that contained more precisely defined gender into cases with vaguer language regarding gender often seen in genderinclusive documents and LGBT violent death narratives. This was seen in each rule's effort to strip gender from key phrases, leaving it more ambiguous to the model. For example, our model struggles when terms like 'partner' are used to describe relationships. To address this, we introduced a rule where gendered relationship terms like 'girlfriend' in one cluster were replaced by non-gendered terms like 'partner'. In this manner, our model was forced to train against these examples. Often, the model performance improved when training against these augmented examples.

Experiments and Results
We conducted experiments to analyze E2E-Coref on VDN and verified the effectiveness of the data augmentation method. We used the following corpora 6 .
• OntoNotes We used the English portion of version 5.0. It contains roughly 1.6M words.
• VDN The annotated violent death narratives described in Sec. 3. The corpus is annotated John Smith → J. Smith went to the store. He→Zie wanted to buy apples, bananas, and strawberries. His→Zir girlfriend came with him→him , and she wanted to buy peaches and oranges. Figure 2: The proposed rules for GI data applied to a sample paragraph.
by domain experts and used as the test set for measuring model performance. We split VDN into train/dev/test with a 20/5/90 document split. We are interested in the setting where only a small set of training data is available, to emulate use-cases in which annotating a large amount of data is impractical. We reserve more articles in the test set to ensure the evaluation is reliable.
• GICoref (Cao and Daumé III, 2020) consists of 95 documents from sources that include articles about non-binary people, fan-fiction from Archive of Our Own, and LGBTQ periodicals with a plethora of neopronouns (e.g., zie).
• MAP (Cao and Daumé III, 2020) consists of snippets of Wikipedia articles with two or more people and at least one pronoun.
We followed Cao and Daumé III (2020) to use LEA (Moosavi and Strube, 2016) as the evaluation metric for coreference clusters.

Results on Violent Death Narratives
We created 3 rules based on the approach described in Sec. 4: (R1) Replace gendered terms with another gender. (R2) Replace gendered relationship terms with non-gendered terms. (R3) Replace terms describing gender with non-gendered terms. Examples of the generated data are in Fig. 2 • E2E-Aug E2E-Aug trained on OntoNotes first and then fine-tuned on the augmented target training documents. Results are shown in Table 1. By fine-tuning with a modest amount of in-domain data, E2E-FT significantly improved E2E in LEA F1. We saw E2E-Aug further improved E2E-FT by 5% on LEA F1 with the 30 LGBT narratives in VDN's test set 9 . Our results meaningfully improved the classification of LGBT-related data, and show the need for a more careful approach with data from underrepresented groups. Further, this improvement extended beyond our domain-specific data: E2E-Aug further improved the E2E F1 score by 1.4% in LEA F1 on the overall set. Overall, we saw a significant improvement when training coreference models with our augmented data, on both the overall and gender-neutral LGBT set.

Results on GICoref and MAP
We then evaluated the data augmentation approach on two publicly available datasets -GICoref and MAP. We experimented with the following 3 rules. (R4) Randomly pick a person-cluster in the document and replace all pronouns in the cluster with a gender neutral pronoun (e.g., his ← zir). (R5) Truncate the first name of each person. (R6) Same as the R4 but replacing only one pronoun in the cluster to the corresponding gender neutral pronoun.
We followed Zhao et al. (2018) and used GI-Coref and MAP only as the test data. We compared E2E with its variant E2E-Aug. The latter was trained on the union of the original dataset and variants of OntoNotes augmented using the above rules. We also compared our results with those from a E2E-Coref model trained on the union of 9 Not found in tables   the original and augmented data with the gender swapping rules described in (Zhao et al., 2018).
Results on GICoref Results on GICoref are shown Table 2. Few documents (0.3%) in Ontonotes contained neopronouns. Therefore, E2E struggled with resolving pronouns refering to LGBT individuals. Zhao et al. (2018) had proposed to apply gender-swapping and entity anonymization to mitigate bias towards binary genders. However, their approach does not handle neopronouns and performs poorly compared to our models. In contrast, E2E-Aug improved E2E from a range of 4% to 6% in F1 with various data augmentation rules. When all the rules were applied, the performance was not superior to using only R4. We further investigated the performance improvement of E2E-Aug-R4 on clusters containing binary pronouns and neopronouns. As shown in Table 3, E2E-Aug-R4 yielded a 4% increase in recall among binary-gender pronouns and 12% among neopronouns as compared with E2E. This reduced the performance gap between binary-gender pronouns and neopronouns from 12% to 3%. Our results show that R4 is highly effective, despite its simplicity.
Results on MAP The core of MAP is constructed through different ablation mechanisms (Cao and Daumé III, 2020). Each ablation is a method for hiding various aspects of gender and then investigating the performance change of a model. Performance was evaluated based on the accuracy of pronoun resolution over the four label classes: person A, B, both, or neither. We considered four ablation mechanisms as described in the Appendix. With these four possible ablations, each document was ablated a total nine times with each possible combination of ablations, producing a separate document.
We compared E2E with E2E-R4 and showed the results in Figure 3. E2E-R4 was better than or competitive with E2E in all the ablation scenarios. E2E-R4 especially outperformed E2E on the original set and the +Pro. set, where the performance was improved by 30%.

Conclusion
With policy decisions increasingly informed by computational analysis, it is imperative that methods used in these analyses be robust and accurate especially for marginalized groups. Our contributions improved coreference resolution for LGBT individuals, a historically underrepresented and marginalized population at high risk for suicide; they may improve the identification of LGBT individuals in NVDRS and hence inform better policy aimed to reduce LGBT deaths. More generally, we show how to use augmentation rules to adapt NLP models to real-world application domains where it is not feasible to obtain annotated data from crowdworkers. Finally, we introduced a novel dataset, VDN, which provide a challenging and consequential corpus for coreference resolution models. Our studies demonstrate the challenges of applying NLP techniques to real-world data involving diverse individuals (including LGBT individuals and their families) and suggest ways to make these methods more accurate and robust-thus contributing to algorithmic equity.

Discussion of Ethics
Our research was exempted from human subjects review by the UCLA IRB. We applied for and were given access to the CDC's National Violent Death Reporting-System's Restricted Access Database.
As the data contain private information, we strictly follow their guidelines in our use of the dataset.
Despite our goal to improve gender inclusion in the coreference resolution system, we admit that our augmentation rules and data analyses may not fully address the diversities of sexual orientation in the population. Although our approach improves the performance of coreference systems, the final system is still not perfect and may exhibit some bias in its predictions.