Identifying civilians killed by police with distantly supervised entity-event extraction

We propose a new, socially-impactful task for natural language processing: from a news corpus, extract names of persons who have been killed by police. We present a newly collected police fatality corpus, which we release publicly, and present a model to solve this problem that uses EM-based distant supervision with logistic regression and convolutional neural network classifiers. Our model outperforms two off-the-shelf event extractor systems, and it can suggest candidate victim names in some cases faster than one of the major manually-collected police fatality databases.


Introduction
The United States government does not keep systematic records of when police kill civilians, despite a clear need for this information to serve the public interest and support social scientific analysis. Federal records rely on incomplete cooperation from local police departments, and human rights statisticians assess that they fail to document thousands of fatalities (Lum and Ball, 2015).
News articles have emerged as a valuable alternative data source. Organizations including The Guardian, The Washington Post, Mapping Police Violence, and Fatal Encounters have started to build such databases of U.S. police killings by manually reading millions of news articles 1 1 Fatal Encounters director D. Brian Burghart estimates he and colleagues have read 2 million news headlines and ledes to assemble its fatality records that date back to January, 2000 (pers. comm.); we find FE to be the most comprehensive publicly available database.

Text
Person killed by police?
Alton Sterling was killed by police. True Officers shot and killed Philando Castile.
True Officer Andrew Hanson was shot.
False Police report Megan Short was fatally shot in apparent murder-suicide. Table 1: Toy examples (with entities in bold) illustrating the problem of extracting from text names of persons who have been killed by police. and extracting victim names and event details. This approach was recently validated by a Bureau of Justice Statistics study (Banks et al., Dec. 2016) which augmented traditional policemaintained records with media reports, finding twice as many deaths compared to past government analyses. This suggests textual news data has enormous, real value, though manual news analysis remains extremely laborious.

False
We propose to help automate this process by extracting the names of persons killed by police from event descriptions in news articles (Table 1). This can be formulated as either of two cross-document entity-event extraction tasks: 1. Populating an entity-event database: From a corpus of news articles D (test) over timespan T , extract the names of persons killed by police during that same timespan (E (pred) ).
2. Updating an entity-event database: In addition to D (test) , assume access to both a historical database of killings E (train) and a historical news corpus D (train) for events that occurred before T . This setting often occurs in practice, and is the focus of this paper; it allows for the use of distantly supervised learn-ing methods. 2 The task itself has important social value, but the NLP research community may be interested in a scientific justification as well. We propose that police fatalities are a useful test case for event extraction research. Fatalities are a well defined type of event with clear semantics for coreference, avoiding some of the more complex issues in this area (Hovy et al., 2013). The task also builds on a considerable information extraction literature on knowledge base population (e.g. Craven et al. (1998)). Finally, we posit that the field of natural language processing should, when possible, advance applications of important public interest. Previous work established the value of textual news for this problem, but computational methods could alleviate the scale of manual labor needed to use it.
To introduce this problem, we: • Define the task of identifying persons killed by police, which is an instance of crossdocument entity-event extraction ( §3.1).
• Present a new dataset of web news articles collected throughout 2016 that describe possible fatal encounters with police officers ( §3.2).
• Introduce, for the database update setting, a distant supervision model ( §4) that incorporates feature-based logistic regression and convolutional neural network classifiers under a latent disjunction model.
• Demonstrate the approach's potential usefulness for practitioners: it outperforms two off-the-shelf event extractors ( §5) and finds 39 persons not included in the Guardian's "The Counted" database of police fatalities as of January 1, 2017 ( §6). This constitutes a promising first step, though performance needs to be improved for real-world usage.

Related Work
This task combines elements of information extraction, including: event extraction (a.k.a. semantic parsing), identifying descriptions of events and their arguments from text, and cross-document relation extraction, predicting semantic relations over entities. A fatality event indicates the killing of a particular person; we wish to specifically identify the names of fatality victims mentioned in text. Thus our task could be viewed as unary relation extraction: for a given person mentioned in a corpus, were they killed by a police officer? Prior work in NLP has produced a number of event extraction systems, trained on text data hand-labeled with a pre-specified ontology, including ones that identify instances of killings (Li and Ji, 2014;Das et al., 2014). Unfortunately, they perform poorly on our task ( §5), so we develop a new method.
Since we do not have access to text specifically annotated for police killing events, we instead turn to distant supervision-inducing labels by aligning relation-entity entries from a gold standard database to their mentions in a corpus (Craven and Kumlien, 1999;Mintz et al., 2009;Bunescu and Mooney, 2007;. Similar to this work, Reschke et al. (2014) apply distant supervision to multi-slot, template-based event extraction for airplane crashes; we focus on a simpler unary extraction setting with joint learning of a probabilistic model. Other related work in the crossdocument setting has examined joint inference for relations, entities, and events Lee et al., 2012;Yang et al., 2015).
Finally, other natural language processing efforts have sought to extract social behavioral event databases from news, such as instances of protests (Hanna, 2017), gun violence (Pavlick et al., 2016), and international relations (Schrodt and Gerner, 1994;Schrodt, 2012;Boschee et al., 2013;O'Connor et al., 2013;Gerrish, 2013). They can also be viewed as event database population tasks, with differing levels of semantic specificity in the definition of "event." 3 Task and Data 3.1 Cross-document entity-event extraction for police fatalties From a corpus of documents D, the task is to extract a list of candidate person names, E, and for each e ∈ E find P (y e = 1 | x M(e) ). (1) Here y ∈ {0, 1} is the entity-level label where y e = 1 means a person (

News documents
We download a collection of web news articles by continually querying Google News 3 throughout 2016 with lists of police keywords (i.e police, officer, cop etc.) and fatality-related keywords (i.e. kill, shot, murder etc.). The keyword lists were constructed semi-automatically from cosine similarity lookups from the word2vec pretrained word embeddings 4 in order to select a high-recall, broad set of keywords. The search is restricted to what Google News defines as a "regional edition" of "United States (English)" which seems to roughly restrict to U.S. news though we anecdotally observed instances of news about events in the U.K. and other countries. We apply a pipeline of text extraction, cleaning, and sentence de-duplication described in the appendix.

Entity and mention extraction
We process all documents with the open source spaCy NLP package 5 to segment sentences, and extract entity mentions. Mentions are token spans that (1) were identified as "persons" by spaCy's named entity recognizer, and (2) have a (firstname, lastname) pair as analyzed by the HAPNIS rulebased name parser, 6 which extracts, for example,  Table 3: Training and testing settings for mention sentences x, mention labels z, and entity labels y.
(John, Doe) from the string Mr. John A. Doe Jr.. 7 To prepare sentence text for modeling, our preprocessor collapses the candidate mention span to a special TARGET symbol. To prevent overfitting, other person names are mapped to a different PER-SON symbol; e.g. "TARGET was killed in an encounter with police officer PERSON." There were initially 18,966,757 and 6,061,717 extracted mentions for the train and test periods respectively. To improve precision and computational efficiency, we filtered to sentences that contained at least one police keyword and one fatality keyword. This filter reduced positive entity recall a moderate amount (from 0.68 to 0.57), but removed 99% of the mentions, resulting in the |M| counts in Table 2. 8 Other preprocessing steps included heuristics for extraction and name cleanups and are detailed in the appendix.

Models
Our goal is to classify entities as to whether they have been killed by police ( §4.1). Since we do not have gold-standard labels to train our model, we turn to distant supervision (Craven and Kumlien, 1999;Mintz et al., 2009), which heuristically aligns facts in a knowledge base to text in a corpus to impute positive mention-level labels for supervised learning. Previous work typically examines distant supervision in the context of binary relation extraction (Bunescu and Mooney, 2007;Hoffmann et al., 2011), but we are concerned with the unary predicate "person was killed by police." As our gold standard knowledge base (G), we use Fatal Encounters' (FE) publicly available dataset: around 18,000 entries of victim's name, age, gender and race as well as location, cause and date of death. (We use a version of the FE database downloaded Feb. 27, 2017.) We compare two different distant supervision training paradigms (Table 3): "hard" label training ( §4.2) and "soft" EM-based training ( §4.3). This section also details mention-level models ( §4.4, §4.5) and evaluation ( §4.6).

Approach: Latent disjunction model
Our discriminative model is built on mention-level probabilistic classifiers. Recall a single entity will have one or more mentions (i.e. the same name occurs in multiple sentences in our corpus). For a given mention i in sentence x i , our model predicts whether the person is described as having been killed by police, z i = 1, with a binary logistic model, ( 2) We experiment with both logistic regression ( §4.4) and convolutional neural networks ( §4.5) for this component, which use logistic regression weights β and feature extractor parameters γ. Then we must somehow aggregate mention-level decisions to determine entity labels y e . 9 If a human reader were to observe at least one sentence that states a person was killed by police, they would infer that person was killed by police. Therefore we aggregate an entity's mention-level labels with a deterministic disjunction: At test time, z i is latent. Therefore the correct inference for an entity is to marginalize out the model's uncertainty over z i : Eq. 6 is the noisyor formula (Pearl, 1988;Craven and Kumlien, 1999). Procedurally, it counts strong probabilistic predictions as evidence, but can also 9 An alternative approach is to aggregate features across mentions into an entity-level feature vector (Mintz et al., 2009;; but here we opt to directly model at the mention level, which can use contextual information. incorporate a large number of weaker signals as positive evidence as well. 10 In order to train these classifiers, we need mention-level labels (z i ) which we impute via two different distant supervision labeling methods: "hard" and "soft."

"Hard" distant label training
In "hard" distant labeling, labels for mentions in the training data are heuristically imputed and directly used for training. We use two labeling rules. First, name-only: This is the direct unary predicate analogue of Mintz et al. (2009)'s distant supervision assumption, which assumes every mention of a goldpositive entity exhibits a description of a police killing. This assumption is not correct. We manually analyze a sample of positive mentions and find 36 out of 100 name-only sentences did not express a police fatality event-for example, sentences contain commentary, or describe killings not by police. This is similar to the precision for distant supervision of binary relations found by , who reported 10-38% of sentences did not express the relation in question.
Our higher precision rule, name-and-location, leverages the fact that the location of the fatality is also in the Fatal Encounters database and requires both to be present: We use this rule for training since precision is slightly better, although there is still a considerable level of noise.

"Soft" (EM) joint training
At training time, the distant supervision assumption used in "hard" label training is flawed: many positively-labeled mentions are in sentences that do not assert the person was killed by a police officer. Alternatively, at training time we can treat z i as a latent variable and assume, as our model states, that at least one of the mentions asserts the fatality event, but leave uncertainty over which mention (or multiple mentions) conveys this information. This corresponds to multiple instance learning (MIL; Dietterich et al. (1997)) which has been applied to distantly supervised relation extraction by enforcing the at least one constraint at training time (Bunescu and Mooney, 2007;Hoffmann et al., 2011;Ritter et al., 2013). Our approach differs by using exact marginal posterior inference for the Estep.
With z i as latent, the model can be trained with the EM algorithm (Dempster et al., 1977). We initialize the model by training on the "hard" distant labels ( §4.2), and then learn improved parameters by alternating E-and M-steps.
The E-step requires calculating the marginal posterior probability for each z i , This corresponds to calculating the posterior probability of a disjunct, given knowledge of the output of the disjunction, and prior probabilities of all disjuncts (given by the mention-level classifier). .
The numerator simplifies to the mention prediction P (z i = 1 | x i ) and the denominator is the entity-level noisyor probability (Eq. 6). This has the effect of taking the classifier's predicted probability and increasing it slightly (since Eq. 10's denominator is no greater than 1); thus the disjunction constraint implies a soft positive labeling. In the case of a negative entity with y e = 0, the disjunction constraint implies all z M(e) stay clamped to 0 as in the "hard" label training method. The q(z i ) posterior weights are then used for the M-step's expected log-likelihood objective: (11) This objective (plus regularization) is maximized with gradient ascent as before.
This approach can be applied to any mentionlevel probabilistic model; we explore two in the next sections. Features D1 length 3 dependency paths that include TARGET: word, POS, dep. label D2 length 3 dependency paths that include TARGET: word and dep. label D3 length 3 dependency paths that include TARGET: word and POS D4 all length 2 dependency paths with word, POS, dep. labels N 1 n-grams length 1, 2, 3 N 2 n-grams length 1, 2, 3 plus POS tags N 3 n-grams length 1, 2, 3 plus directionality and position from TARGET N 4 concatenated POS tags of 5-word window centered on TARGET N 5 word and POS tags for 5-word window centered on TARGET Table 4: Feature templates for logistic regression grouped into syntactic dependencies (D) and Ngram (N ) features.

Feature-based logistic regression
We construct hand-crafted features for regularized logistic regression (LR) ( Table 4), designed to be broadly similar to the n-gram and syntactic dependency features used in previous work on feature-based semantic parsing (e.g. Das et al. (2014); Thomson et al. (2014)). We use randomized feature hashing (Weinberger et al., 2009) to efficiently represent features in 450,000 dimensions, which achieved similar performance as an explicit feature representation. The logistic regression weights (β in Eq. 2) are learned with scikitlearn (Pedregosa et al., 2011). 11 For EM (soft-LR) training, the test set's area under the precision recall curve converges after 96 iterations (Fig. 1).

Convolutional neural network
We also train a convolutional neural network (CNN) classifier, which uses word embeddings and their nonlinear compositions to potentially generalize better than sparse lexical and n-gram features. CNNs have been shown useful for sentence-level classification tasks (Kim, 2014;Zhang and Wallace, 2015), relation classification (Zeng et al., 2014) and, similar to this setting, event detection (Nguyen and Grishman, 2015). We use Kim (2014)'s open-source CNN implementation, 12 where a logistic function makes the final mention prediction based on max-pooled values from convolutional layers of three different filter sizes, whose parameters are learned (γ in Eq. 2). We use pretrained word embeddings for initialization, 13 and update them during training. We also add two special vectors for the TARGET and PERSON symbols, initialized randomly. 14 For training, we perform stochastic gradient descent for the negative expected log-likelihood (Eq. 11) by sampling with replacement fifty mentionlabel pairs for each minibatch, choosing each (i, k) ∈ M × {0, 1} with probability proportional to q(z i = k). This strategy attains the same expected gradient as the overall objective. We use "epoch" to refer to training on 265,700 examples (approx. twice the number of mentions). Unlike EM for logistic regression, we do not run gradient descent to convergence, instead applying an Estep every two epochs to update q; this approach is related to incremental and online variants of EM (Neal and Hinton, 1998;Liang and Klein, 2009), and is justified since both SGD and E-steps improve the evidence lower bound (ELBO). It is also similar to Salakhutdinov et al. (2003)'s expectation gradient method; their analysis implies the gradient calculated immediately after an Estep is in fact the gradient for the marginal loglikelihood. We are not aware of recent work that uses EM to train latent-variable neural network models, though this combination has been explored (e.g. Jordan and Jacobs (1994))

Evaluation
On documents from the test period (Sept-Dec 2016), our models predict entity-level labels 12 https://github.com/yoonkim/CNN sentence 13 From the same word2vec embeddings used in §3. 14 Training proceeds with ADADELTA (Zeiler, 2012). We tested several different settings of dropout and L2 regularization hyperparameters on a development set, but found mixed results, so used their default values. P (y e = 1 | x M(e) ) (Eq. 6), and we wish to evaluate whether retrieved entities are listed in Fatal Encounters as being killed during Sept-Dec 2016. We rank entities by predicted probabilities to construct a precision-recall curve (Fig. 4, Table 5). Area under the precision-recall curve (AUPRC) is calculated with a trapezoidal rule; F1 scores are shown for convenient comparison to non-ranking approaches ( §5). Excluding historical fatalities: Our model gives strong positive predictions for many people who were killed by police before the test period (i.e. before Sept 2016), when news articles contain discussion of historical police killings. We exclude these entities from evaluation, since we want to simulate an update to a fatality database (Fig 2). Our test dataset contains 1,148 such historical entities.
Data upper bound: Of the 452 gold entities in the FE database at test time, our news corpus only contained 258 (Table 2) Table 5: Area under precision-recall curve (AUPRC) and F1 (its maximum value from the PR curve) for entity prediction on the test set.
per bound of 0.57 recall, which also gives an upper bound of 0.57 on AUPRC. This is mostly a limitation of our news corpus; though we collect hundreds of thousands of news articles, it turns out Google News only accesses a subset of relevant web news, as opposed to more comprehensive data sources manually reviewed by Fatal Encounters' human experts. We still believe our dataset is large enough to be realistic for developing better methods, and expect the same approaches could be applied to a more comprehensive news corpus.

Off-the-shelf event extraction baselines
From a practitioner's perspective, a natural first approach to this task would be to run the corpus of police fatality documents through pre-trained, "off-the-shelf" event extractor systems that could identify killing events. In modern NLP research, a major paradigm for event extraction is to formulate a hand-crafted ontology of event classes, annotate a small corpus, and craft supervised learn-  Table 6: Precision, recall, and F1 scores for test data using event extractors SEMAFOR and RPI-JIE and rules R1-R3 described below.
ing systems to predict event parses of documents. We evaluate two freely available, off-the-shelf event extractors that were developed under this paradigm: SEMAFOR (Das et al., 2014), and the RPI Joint Information Extraction System (RPI-JIE) (Li and Ji, 2014), which output semantic structures following the FrameNet (Fillmore et al., 2003) and ACE (Doddington et al., 2004) event ontologies, respectively. 15 Pavlick et al. (2016) use RPI-JIE to identify instances of gun violence.
For each mention i ∈ M we use SEMAFOR and RPI-JIE to extract event tuples of the form t i = (event type, agent, patient) from the sentence x i . We want the system to detect (1) killing events, where (2) the killed person is the target mention i, and (3) the person who killed them is a police officer. We implement a small progression of these neo-Davidsonian (Parsons, 1990) conjuncts with rules to classify z i = 1 if: 16 • (R1) the event type is 'kill.' • (R2) R1 holds and the patient token span contains e i . 15 Many other annotated datasets encode similar event structures in text, but with lighter ontologies where event classes directly correspond with lexical items-including PropBank, Prague Treebank, DELPHI-IN MRS, and Abstract Meaning Representation (Kingsbury and Palmer, 2002;Hajic et al., 2012;Oepen et al., 2014;Banarescu et al., 2013). We assume such systems are too narrow for our purposes, since we need an extraction system to handle different trigger constructions like "killed" versus "shot dead." 16 For SEMAFOR, we use the FrameNet 'Killing' frame with frame elements 'Victim' and 'Killer'. For RPI-JIE, we use the ACE 'life/die' event type/subtype with roles 'victim' and 'agent'. SEMAFOR defines a token span for every argument; RPI-JIE/ACE defines two spans, both a head word and entity extent; we use the entity extent. SEMAFOR only predicts spans as event arguments, while RPI-JIE also predicts entities as event arguments, where each entity has a within-text coreference chain over one or more mentions; since we only use single sentences, these chains tend to be small, though they do sometimes resolve pronouns. For determining R2 and R3, we allow a match on any of an entity's extents from any of its mentions.
• (R3) R2 holds and the agent token span contains a police keyword.
As in §4.1 (Eq. 3), we aggregate mention-level z i predictions to obtain entity-level predictions with a deterministic OR of z M(e) . RPI-JIE under the full R3 system performs best, though all results are relatively poor ( Table 6). Part of this is due to inherent difficulty of the task, though our task-specific model still outperforms (Table 5). We suspect a major issue is that these systems heavily rely on their annotated training sets and may have significant performance loss on new domains, or messy text extracted from web news, suggesting domain transfer for future work.

Results and discussion
Significance testing: We would like to test robustness of performance results to the finite datasets with bootstrap testing (Berg-Kirkpatrick et al., 2012), which can accomodate performence metrics like AUPRC. It is not clear what the appropriate unit of resampling should be-for example, parsing and machine translation research in NLP often resamples sentences, which is inappropriate for our setting. We elect to resample documents in the test set, simulating variability in the generation and retrieval of news articles. Standard errors for one model's AUPRC and F1 are in the range 0.004-0.008 and 0.008-0.010 respectively; we also note pairwise significance test results. See appendix for details.
Overall performance: Our results indicate our model is better than existing computational methods methods to extract names of people killed by police, by comparing to F1 scores of off-the-shelf extractors (Table 5 vs. Table 6; differences are statistically significant).
We also compare entities extracted from our test dataset to the Guardian's "The Counted" database of U.S. police killings during the span of the test period (Sept.-Dec., 2016), 17 and found 39 persons they did not include in the database, but who were in fact killed by police. This implies our approach could augment journalistic collection efforts. Additionally, our model could help practitioners by presenting them with sentence-level information in the form of Table 7; we hope this could decrease the amount of time and emotional toll required to maintain real-time updates of police fatality databases.
CNN: Model predictions were relatively unstable during the training process. Despite the fact that EM's evidence lower bound objective (H(Q) + E Q [log P (Z, Y |X)]) converged fairly well on the training set, test set AUPRC substantially fluctuated as much as 2% between epochs, and also between three different random initializations for training (Fig. 3). We conducted these multiple runs initially to check for variability, then used them to construct a basic ensemble: we averaged the three models' mention-level predictions before applying noisyor aggregation. This outperformed the individual models-especially for EM training-and showed less fluctuation in AUPRC, which made it easier to detect convergence. Reported performance numbers in Table 5 are with the average of all three runs from the final epoch of training.
LR vs. CNN: After feature ablation we found that hard-CNN and hard-LR with n-gram features (N1-N5) had comparable AUPRC values (Table  5). But adding dependency features (D1-D4) caused the logistic regression models to outperform the neural networks (albeit with bare significance: p = 0.046). We hypothesize these dependency features capture longer-distance semantic relationships between the entity, fatality trigger word, and police officer, which short n-grams cannot. Moving to sequence or graph LSTMs may better capture such dependencies.
Soft (EM) training: Using the EM algorithm gives substantially better performance: for the CNN, AUC improves from 0.130 to 0.164, and for LR, from 0.142 to 0.193. (Both improvements are statistically significant.) Logistic regression with EM training is the most accurate model. Examining the precision-recall curves (Fig. 4), many of the gains are in the higher confidence predictions (left side of figure). In fact, the soft EM model makes fewer strongly positive predictions: for example, hard-LR predicts y e = 1 with more than 99% confidence for 170 out of 24,550 test set entities, but soft-LR does so for only 24. This makes sense given that the hard-LR model at training time assumes that many more positive entity mentions are evidence of a killing than they are in reality ( §4.2).
Manual analysis: Manual analysis of false positives indicates misspellings or mismatches of names, police fatalities outside of the U.S., people who were shot by police but not killed, and names of police officers who were killed are com-  Future work: While we have made progress on this application, more work is necessary for accuracy to be high enough to be useful for practitioners. Our model allows for the use of mentionlevel semantic parsing models; systems with explicit trigger/agent/patient representations, more like traditional event extraction systems, may be useful, as would more sophisticated neural network models, or attention models as an alternative to disjunction aggregation (Lin et al., 2016).
One goal is to use our model as part of a semi-automatic system, where people manually review a ranked list of entity suggestions. In this case, it is more important to focus on improving recall-specifically, improving precision at highrecall points on the precision-recall curve. Our best models, by contrast, tend to improve precision at lower-recall points on the curve. Higher recall may be possible through cost-sensitive training (e.g. Gimpel and Smith (2010)) and using features from beyond single sentences within the document.
Furthermore, our dataset could be used to contribute to communication studies, by exploring research questions about the dynamics of media attention (for example, the effect of race and geography on coverage of police killings), and discussions of historical killings in news-for example, many articles in 2016 discussed Michael Brown's 2014 death in Ferguson, Missouri. Improving NLP analysis of historical events would also be useful for the event extraction task itself, by delineating between recent events that re- 18 We attempted to correct non-U.S. false positive errors by using CLAVIN, an open-source country identifier, but this significantly hurt recall. quire a database update, versus historical events that appear as "noise" from the perspective of the database update task. Finally, it may also be possible to adapt our model to extract other types of social behavior events.