Multi-Sentence Argument Linking

We present a novel document-level model for finding argument spans that fill an event’s roles, connecting related ideas in sentence-level semantic role labeling and coreference resolution. Because existing datasets for cross-sentence linking are small, development of our neural model is supported through the creation of a new resource, Roles Across Multiple Sentences (RAMS), which contains 9,124 annotated events across 139 types. We demonstrate strong performance of our model on RAMS and other event-related datasets.


Introduction
Textual event descriptions may span multiple sentences, yet large-scale datasets predominately annotate for events and their arguments at the sentence level.Data (un-)availability has driven researchers to focus on sentence-level tasks such as traditional semantic role labeling, even though perfect performance at such tasks would still enable a less than complete understanding of an event at the document level.
In this work we approach event argument understanding as a form of linking, more akin to coreference resolution than traditional semantic role labeling.Roles are taken to be evoked by an event trigger, with these implicit arguments then potentially linked to explicit mention spans somewhere in the document.
Figure 1: A passage with arguments underlined and event triggers bolded.Arguments and triggers belonging to the same event structure are co-indexed.Arguments may participate in multiple events.
). 2 Intuitively we recognize the possible existence of fillers for arguments, implied by the roles in the context of the passage, e.g., the LOCATION of the particular explosion event or the RECIP-IENT of the particular providing event.These implicit arguments are linked to the explicit arguments in the document (i.e., text spans).This framework allows for arguments to participate in multiple events: "The victims" are involved in the explosion, the providing, and the airlifting events.Additionally, this framework does not restrict each role to be filled by only one argument, nor does it restrict each explicit argument to take at most one role.
Prior work on annotating cross-sentence argument links has produced small datasets due to the cost of annotating with document-level context.Existing datasets either focus on a small number of predicate types (Gerber andChai, 2010, 2012;Feizabadi and Padó, 2014) or on a small number of documents (Ruppenhofer et al., 2010).
We introduce Roles Across Multiple Sentences (RAMS), a dataset of 9,075 annotated events from news text based on an ontology of 139 event types and 65 distinct roles.In a 5-sentence window around the trigger for each event, we annotated the closest argument span for each role (if present).We find that 18% of these arguments are not in the same sentence as the event trigger.
We adapt recent span-based models (Lee et al., 2018;He et al., 2018;Ouchi et al., 2018) for the multi-sentence argument linking task for both our annotated dataset, RAMS, and for an existing slot filling dataset, GVDB (Pavlick et al., 2016).On RAMS with gold argument span boundaries, our best model achieves 68.1 F 1 without gold event types, and 73.2 F 1 with event types.We also establish much stronger baselines with broader coverage over GVDB.
Our experiments highlight our model's adaptability to a limited set of event-based tasks, which we plan to expand.We also plan to extend our experiments on RAMS to include argument detection rather than relying on gold spans.We hope our document-level model and RAMS are valuable contributions to further automatic understanding of events at the document level.

Background 2.1 Argument Linking
In this paper, we adopt the following formalism for Argument Linking.Given a document D, we assume there is a set of typed events E, each designated by a trigger (a text span in D).Each event type evokes a set of roles the event's arguments can take, denoted R e .For each e ∈ E, the task is to link roles with arguments (text spans in D) if they exist.Specifically, the task is to find all (r, a) pairs given e such that r ∈ R e and a ∈ D.
Compared to Semantic Role Labeling (SRL) which operates on predicates at the sentence level (as specified by the OntoNotes 5.0 dataset (Weischedel et al., 2013;Pradhan et al., 2013)), Argument Linking operates at the document level.In other words, an event trigger and argument are not constrained to be in the same sentence.Furthermore, arguments for every event (predicate) in SRL assume roles from the same (small) universe of abstracted roles.However, this is not the case in Argument Linking, in which the event type restricts the set of possible roles.
Argument Linking also draws similarities to slot filling, exemplified by the various MUC datasets (Sundheim, 1992) and the Gun Violence Database (GVDB) (Pavlick et al., 2016).In the slot filling setting, documents evoke (perhaps multiple) templates (events) each with its own set of slots (roles).However, slot filling tasks do not necessarily mark event triggers.In Argument Linking, arguments assume roles relative to a particular lexicalized event (i.e., tokens in the text).

Non-Local Arguments
We are not the first to consider non-local event arguments; our contributions are a new large-scale dataset and a novel model for the task.Here we review prior considerations of non-local arguments and refer to O'Gorman (2019) for further reading.
Recently under the DARPA AIDA program, the Linguistic Data Consortium (LDC) has annotated for document-level event arguments,3 employing a new three-level hierarchical event ontology (see Figure 2) influenced by prior LDC-supported ontologies such as ERE and ACE.The AIDA corpus consists of a mix of Russian, Ukrainian, and English texts, along with other modalities, with a focus on full-document annotation.For a given document, if an AIDA-salient event is discovered, it is annotated, but with no guarantee that any given type in the ontology has even a single example in the training corpus.Of the 139 distinct event sub-subtypes in the ontology, only 88 have at least one annotation in the text portion of the collection, with a total of 1,559 text-based event triggers in total. 4We have calculated that 38.1% of these annotated events have an entity argument outside the same sentence as the trigger.Our dataset, RAMS, employs the same annotation ontology.It is restricted to English, but is significantly larger and covers 137 of the 139 types in the ontology.Figure 6 compares the two datasets.
Much of the effort on non-local arguments, sometimes called implicit SRL, has focused on two datasets, one based on stories that were produced for SemEval-2010Task 10 (Ruppenhofer et al., 2010) and the other an expansion of Nom-Bank (Meyers et al., 2004) compiled by Gerber andChai (2010, 2012).Although non-local arguments are a common phenomenon (Gerber and Chai (2012) found that their annotation of nonlocal arguments added 71% (relative) role coverage to NomBank annotations), these datasets are significantly smaller than RAMS: the SemEval shared task training set contains 1,370 frame instantiations over 438 sentences, and the data from Gerber and Chai (2012)  It is not surprising that across multiple datasets, a significant number of event arguments are observed to be non-local given the analysis of zero anaphora and definite null complements by Fillmore (1986) and the distinction between "core" and "non-core" frame elements or roles in FrameNet (Baker et al., 1998) and Prop-Bank (Palmer et al., 2005).
As previous datasets have been small, various approaches have been taken to ameliorate scarcity. 5To obtain more training data, Roth and Frank (2013) automatically induce implicit arguments from pairs of comparable texts, but recover only a small set of additional arguments.Feizabadi and Padó (2015) combine existing corpora to increase and diversify sources of model supervision.In contrast to prior datasets, RAMS contains over 9,000 annotated examples covering a wide range of nominal and verbal triggers.

Architecture
Our model architecture is similar to recent models for SRL (He et al., 2018;Ouchi et al., 2018).Contextualized embeddings of the text are used to compute span representations for all spans up to a certain length.These spans are then pruned and scored alongside the trigger span and learned role embeddings to determine the best argument span (possibly none) out of a set of candidate arguments, A, for each (event, role) pair, i.e., argmax a∈A P (a | e, r) for each event e ∈ E and role r ∈ R e .
Representations To represent spans from within the text, we adopt the convention from Lee et al. (2017) to compute vector representations of each span by independently encoding each sentence with a bidirectional LSTM starting from contextualized encodings (Peters et al., 2018;Devlin et al., 2018), GloVe embeddings (Pennington et al., 2014), and character-level convolutions.The hidden states at the start and end of the span are concatenated along with a feature vector for the size of the span and a soft head word vector produced by a learned attention mask over the word vectors within the span.While originally motivated by coreference resolution, this method for representing spans has also been used for a broad suite of core NLP tasks (Swayamdipta et al., 2018;He et al., 2018;Tenney et al., 2019b).
To represent a role r, we consider the universe of possible roles r ∈ R according to the ontology and learn a separate embedding r for each role.Similarly, we represent all candidate arguments with a span embedding ã, consistent with Lee et al. (2017).Since our objective is to link candidate arguments to (event, role) pairs, we construct an event-role representation6 by applying a feed-forward neural network (F ã) to the concatenation of an event trigger representation e and a role embedding r: Our formulation for constructing the representations is more flexible than using the function r • F E,A ([e; a])-which is analogous to the rolespecific scorers used by He et al. (2018); Tenney et al. (2019b)-because r can interact arbitrarily with e. Equation 1 is similar to the input representations used in the graph LSTM model by Song et al. (2018) for cross-sentence relation extraction, in which word representations and edge la-bel representations are concatenated and linearly projected to form edge representations.
Pruning Given a text of length n, there are O(n 2 ) spans of text we could consider as candidate arguments.We consider spans up to a fixed width that do not cross sentence boundaries, which reduces the number of spans to O(n).Following He et al. (2018), we score each span a using a learned unary function of its representation s A (a) = w A F A (a) and keep only the top λ A n spans, where λ A ∈ [0, 1] is a hyperparameter.We refer to this set of top-scoring candidate argument spans as A.
Unlike He et al. (2018), n in our case is not the length of a sentence, but the length of a document.Thus, our set of candidate arguments A is linear with document length.Further, we create e |R e | event-role representations, and so we need to evaluate O(n e |R e |) combinations of events, roles, and arguments, which is prohibitively large when there are numerous events and roles.Assuming the number of events is linear with document length, the number of combinations would be quadratic in document length (rather than quadratic in sentence length).Lee et al. (2018) addresses this issue in coreference resolution, a different document-level task, by implementing a coarse pruner to limit the number of candidate spans that are subsequently scored.In the case that the event type is not explicitly known during inference, any role can potentially be filled.Thus, we do not wish to prematurely prune (e, r) pairs, and so we must prune the number of possible candidate arguments, A, to a more manageable size.In addition, rather than using a coarse scorer between a ∈ A and each eventrole pair (e, r), we assign a score between a and the event e.This relaxation reduces the amount of computation in the coarse pruning step and reflects a loose notion of how likely an argument span is to co-occur with an event, which can be determined irrespective of a role: where W c is learned and the features used in φ c (e, a) can depend on the task; in this paper, it is the score of a bucketed distance embedding.The asymmetry between e and a in the above equation is because the triggers are given and do not have to be found through a scoring process.By taking the top-k-scoring candidate argument spans in relation to e, we reduce our set of candidate arguments to a constant-sized set A e ∈ A.
Scoring We introduce a link scoring function, l(a, ãe,r ), which scores candidate argument spans a ∈ A e and event-role pairs ãe,r = (e, r) ∈ E × R. 7 We adopt the scoring function from Lee et al. (2017); He et al. (2018) which decomposes the function into unary and binary scores8 i.e., l(a, ãe,r ) = s E (e) + s E,R (e, r) + s A,R (a, r) where φ l (a, ãe,r ) is an additional feature vector containing information such as document genre and the (bucketed) token distance between e and a. F x are feed-forward neural networks.The direct scoring of candidate arguments against event-role pairs, s l (a, ãe,r ), bears similarities to the approach taken by Schenk and Chiarcos (2016), which finds the candidate argument whose representation is most similar to the prototypical filler of a frame element (role).Each component of the overall link score may be ablated.
Learning Following prior work, an empty argument represents "no argument" and has link score l( , ãe,r ) 0, which acts as a threshold for the link function.For every event-argument-role triple (e, a, r), we directly maximize Decoding We experiment with three decoding settings: argmax, greedy, and type-constrained.
Assuming each role is satisfied by exactly one argument (potentially ), we can perform argmax decoding independently for each role: It is possible to instead predict multiple nonoverlapping arguments for a single role with a greedy threshold-based decoding strategy (Ouchi et al., 2018).While an instance of the task may provide a mapping of events e to their permitted roles R e , our model does not necessarily make use of this mapping.This is desired in general for the case when the model does not know gold event types.However, if gold event types are provided, they can used during training by masking out illegal event-role pairs.We adopt a simpler approach of type-constrained decoding, which removes illegal event-role pairs from the model's predictions when given gold event types.Furthermore, if an event type allows m r arguments of a given role r,9 constrained decoding keeps at most m r arguments based to their link scores and discards the rest.

Related Models
Our model is inspired by several recent span selection models (He et al., 2018;Lee et al., 2018;Ouchi et al., 2018).In particular, O'Gorman (2019) speculates a joint coreference and SRL model in which implicit discourse referents are generated for each event predicate and subsequently clustered with the discovered referent spans using a model for coreference.Furthermore, the author claims that these span selection models would be difficult to scale to the document level, which is the regime we are most interested in.In our model, we focus only on the implicit discourse referents (i.e., the event-role representations) and link them to a single existing mention, rather than cluster them with all coreferent mentions.
CoNLL 2012 SRL As our model bears similarities to the SRL model proposed by He et al. (2018), we evaluate our model on the sentencelevel CoNLL 2012 dataset as a sanity check.Based on a small hyperparameter sweep, our model achieves 81.4 F 1 when given gold predicates and 81.2 F 1 when not given gold predicates. 10Preliminary analysis shows that our model's recall is harmed by the pruning necessary to operate on entire documents.Although our model is designed to accommodate cross-sentence predictions, it maintains competitive performance on sentence-level SRL.

RAMS
In this section, we introduce our dataset with annotated Roles Across Multiple Sentences (RAMS).We constructed a crowd-sourced dataset with 9,075 event and argument annotations following the AIDA ontology referenced earlier.Each data point consists of a typed trigger span and 0 or more argument spans in an English document.A trigger span is a word or phrase that evokes a certain event type in the context of a document (e.g., "pledge" may evoke the CONTACT -COMMIT-MENTPROMISEEXPRESSINTENT -BROADCAST event sub-subtype in certain contexts), while argument spans denote participants (with a certain role) in the event (e.g., the COMMUNICATOR or the RECIPIENT).Both trigger and argument spans are token-level (start, end) offsets into a pretokenized text document.
Typically, event and relation datasets annotate only the argument spans that are in the same sentence as the trigger span, but we present annotators with a multi-sentence context window: the sentence containing the trigger span, some number of sentences before, and some number of sentences after.The annotator is then able to select argument spans anywhere inside of the context window.

Dataset Description
Data Source We used Reddit, a popular internet forum, to identify a collection of texts likely to contain AIDA-relevant event mentions.On Reddit, users make submissions containing links to news articles, images, videos, or other kinds of documents, and other users may then vote or comment on the submitted content.We collected news articles matching the following criteria: 1) Posted to the r/politics sub-forum between January and October 2016; 2) Resulted in threads with at least 25 comments; and 3) Contained at least one mention of the string "Russia".The resulting subset of articles tended to describe geopolitical events and relations like the ones in the AIDA ontology.In order to filter out low-quality, fake, or disreputable news articles, we considered only submissions that had generated at least 25 comments, which is a signal of information content.This approach of gathering user-submitted and subsequently curated content through Reddit is similar to those used for creating large datasets for language model pretraining (Radford et al., 2019).After applying these criteria we identified approximately 12,000 news articles with an average length of approximately 40 sentences.Annotation We manually constructed a mapping from each AIDA event (sub-(sub))type to a small list of lexical units (LUs) likely to evoke that type.This mapping was intended to have high precision and low recall, in that for a given (Event, Wordlist) pair, the LUs in the Wordlist were all likely to evoke the Event, but the Wordlist could be missing many LUs that also evoked the Event.Each event type had 3.89 manually curated LUs on average.Using the type-to-LU mapping we performed a soft match between every LU in the mapping and every word in our collection of texts in order to select candidate sentences with respect to each event type.In the soft matching procedure, we stemmed and lower-cased the words in order to get a highrecall set of candidate sentences.This matching procedure returned approximately 94,000 candidates, which we then balanced at the LU level, i.e., we sampled the same number of candidate sentences for each LU.
Candidate sentences were then manually vetted by crowd-sourcing to ensure that they evoked their associated event type.Each vetting task contained an event definition and several candidate sentences, each with a highlighted LU.Annotators were asked to judge how well each highlighted LU, in the context of its sentence, matched the provided event definition.They were also asked to assess the factuality of the sentence-whether the event is stated to have actually happened.We collected judgments on approximately 17,500 candidate sentences.
Of the 17,500 candidate sentences, 52% were determined to match their provided event type definition and have positive factuality, yielding 9,075 sentences, each with a highlighted LU trigger known to evoke a given event type.Using these sentences we then collected multi-sentence annotations, presenting annotators with a 5-sentence window containing two sentences of context before the sentence with the trigger and two sentences after. 11ach argument selection task contained five tokenized sentences, a contiguous set of tokens marking the trigger, a definition of the event, and a list of roles and their associated definitions.For each role, annotators were asked whether a corresponding argument was present in the 5-sentence window, and if so, to highlight the argument span that was closest to the event trigger, as there could be multiple.Annotators were allowed to highlight any set of contiguous tokens within the 5-sentence window aside from the trigger tokens.
A window size of five sentences was chosen because 90% of event arguments in the AIDA Phase 1 training dataset are recoverable in this window size.Similarly, Gerber and Chai (2010) found that, in their data, almost 90% of implicit arguments could be resolved in the two sentences preceding the trigger.Figure 4 shows that anno- tated arguments fall close to the trigger in RAMS as well, with over 80% of the annotated arguments occurring in the same sentence as the trigger.On average, we collected 66 full annotations (trigger and arguments) per event type.Statistics on dataset size and coverage are given in Table 1 and in Figure 3.
Inter-Annotator Agreement We randomly selected 93 tasks for redundant annotation in order to measure inter-annotator agreement, collecting five responses per task from distinct users.We report two types of agreement: first, the extent to which annotators chose, for a given role, the same (start, end) span boundaries; and second, the frequency with which annotators agreed a given role was or was not present in the context window.
For span boundary agreement, we present several pairwise statistics.For each annotated (event, role) combination, we consider up to 5 2 = 10 pairs of judgments; the actual number of pairs considered is occasionally less since we discard pairs where the annotators disagree on whether a role is present.We present a separate analysis to address agreement on role presence.
Table 2 reports the pairwise span boundary agreement for several span difference thresholds, where span difference is calculated by using the absolute difference of the (start, end) token indices from each pair.In conjunctive agreement, both |start 1 − start 2 | and |end 1 − end 2 | must be less than the given threshold; therefore, conjunctive agreement at threshold 0 is the percent of pairs that exactly agree.Disjunctive agreement is less strict, requiring that either the absolute difference of start offsets or end offsets must be less than the threshold.Start and end agreement is determined by considering whether the absolute difference of the pair's start or end offsets (respectively) is within the given threshold.
To measure the frequency with which annotators agree whether a given role is present, we treat the majority annotation as the gold standard.Then, we calculated the precision, recall, and F 1 of the annotations.Across the set of redundantly annotated tasks, there were 83 false negatives, 60 false positives, and 892 true positives, giving a precision of 93.7, recall of 91.5, and an F 1 of 92.6.
The approach of treating majority annotation as gold would be problematic if judgments were frequently evenly split.However, the distribution of presence-absence responses shown in Figure 5 demonstrates that this is not the case.Most of the time, the annotators all vote the same way, all marking the role as either absent (Abs-Dis0) or present (Pres-Dis0).Abs-Dis2 and Pres-Dis2, indicating stronger disagreement, are far less frequent, as are Abs-Dis1 and Pres-Dis1, which indicate minor disagreement.

Related
Datasets Comparisons between RAMS, the AIDA Phase 1 data, and Beyond NomBank (Gerber andChai, 2010, 2012) are given in Figure 6 and Figure 7.For both event and role annotations, RAMS provides larger and broader coverage than the AIDA Phase 1 data and Beyond NomBank.By design, Beyond NomBank focuses on only a few predicate types, but we include its statistics for reference.
Related Protocols Feizabadi and Padó (2014) also consider the case of crowdsourcing annotations for cross-sentence arguments.Like us, they provide annotators with a context window rather than the whole document.Annotators in that work were shown the sentence containing the predicate and the 3 previous sentences.In our data collection, annotators were shown the sentence containing the trigger and two sentences on either side.
Rather than instructing annotators to highlight spans in the text ("marking"), Feizabadi and Padó (2014) directed annotators to fill in blanks in templatic sentences ("gap filling").Not only does this approach not scale easily to large sets of frames, there is no guarantee that annotators will produce text spans verbatim from the document.Argument Linking requires extractable text spans, so the gap filling approach would not produce suitable data for our task.
Furthermore, the collection by Feizabadi and Padó (2014) was limited.They consider only two frames each with four roles over 384 predicates.We annotated over 9,000 triggers across almost 140 event types.
Our two-step annotation protocol of event type verification followed by argument finding is similar to the protocol supported by interfaces such as SALTO (Burchardt et al., 2006) and that of Fillmore et al. (2002).

Experiments and Results
We experiment with both feature-based BERTbase12 (Devlin et al., 2018) and ELMo (Peters et al., 2018).For BERT, we split the documents into segments of size 512 subtokens and encoded each segment separately.In practice, only 0.2% of the training documents are split across more than one segment.With ELMo, we encode each sentence separately.
While we perform preliminary sweeps across hyperparameter values, they are fixed while we perform a more exhaustive sweep across scoring features.We also compare argmax decoding with greedy decoding during training.The best model is selected based on F 1 on the development set, and ablations are reported in Table 5.Our final model uses greedy decoding, s A,R , and s l and omits s E,R and s c (see Equation 3).More details can be found in Appendix A.
We also compare the decoding method used during training with type-constrained decoding, to observe whether the model is able to effectively use gold event types if they are given at test time.
The results for both BERT and ELMo with greedy and type-constrained decoding are reported in Table 3. Providing the event type at test time reduces the number of predictions made by the model, boosting precision.Recall is still affected because it is possible for the model to be more confident in the wrong argument for a given role, thus filtering out the less confident, correct one.Nevertheless, using gold types at test time leads to gains  in performance.

Analysis
Argument-Trigger Distance One of the differentiating components of RAMS (compared to SRL datasets) is its non-local annotation of arguments.At the same time, the data is natural and arguments are still heavily distributed within the same sentence as the trigger (see Figure 4).This setting allows us to ask whether our model is still able to accurately label arguments outside of the sentence containing the event trigger.
We compute F 1 based on distance on the development set (where ELMo outperforms BERT) in Table 4 and find that the trend persists-the better model generally does better across the board.Each model performs best (often by a small mar- gin) when the event trigger and argument span are in the same sentence, which may be attributable to the fact that over 80% of the data consists of arguments and event triggers that are in the same sentence.
Role Embeddings and Confusion Both the BERT and ELMo models for this task learn explicit 50-dimensional representations for each role.We present the cosine similarity between the learned role embeddings in the BERT model (which performs best on the training objective) and also visualize the errors made by the BERT model under argmax decoding. 13igure 8 visualizes the cosine similarity between the learned role embeddings.It shows that some roles are more correlated than others.For example, origin and destination are the closest pair of role embeddings, possibly because they co-occur frequently and have the same entity type.Similarly, beneficiary is closest to recipient and giver.While beneficiary is virtually a synonym for recipient, all three roles would be expected to co-occur in similar events.Conversely, the negatively correlated embeddings have different entity types or occur in different events, such as communicator compared to destination and passenger.
Since Argument Linking is not strictly a role labeling problem, we perform a modified procedure for visualizing a confusion matrix.For each argument span, we first align the correct prediction(s) Table 5: F 1 on RAMS development data when pairwise scores are separately included/excluded from the link score (Equation 3) in the best performing model.All models use threshold-based greedy decoding (except where noted), optionally followed by constrained decoding (CD).
and subsequently compute confusion for the remaining gold and predicted label(s).If there are no remaining gold or predicted labels for a set, we omit that prediction entirely from the table.
Many of the errors in the confusion matrix shown in Figure 9 correlate the cosine distances between roles.For example, the model predicts the more frequent place instead of origin, which share the same entity type, and also confuses similar roles from different events (e.g., target and victim).Incorrect predictions tend to err on the side of the more frequent role.For example, all 5 of the 13 instances of jailer that were linked were incorrectly predicted as preventer (the other 8 were not linked).More examples are in Appendix B. Ablations Ablation studies on development data for components of the link score as well as the decoding strategy are shown in Table 5. Constrained decoding based on knowledge of gold event types improves F 1 in all cases because it removes predictions that are invalid with respect to the ontology.Threshold-based greedy decoding outperforms argmax decoding when constrained decoding is not applied, as is the case when gold event types are unavailable.However, using argmax decoding to predict the one-best argument for each role followed by constrained decoding gives the best overall performance (76.5 F 1 ), and the performance gain from applying constrained decoding after argmax decoding (+7.4) is larger than the performance gain seen by applying constrained decoding after greedy decoding (+4.5).
The most important component is the score between a combined event-role and an argument, which measures the compatibility of the argument filling the role for the event.This follows intuitions that s l is the primary component of the link score since it directly compares the representations in the way specified by the Argument Linking task (finding an argument to assume a given role for a given event).
The distance score helps boost link scores for arguments that are not in the same sentence as the trigger, i.e., in cases where there is no intrasentential syntactic signal present in the link.When the distance score is ablated, performance on in-sentence links drops marginally to 75.5 F 1 , whereas performance on cross-sentence links drops at every distance level.The largest performance difference occurs when the argument is two sentences after the trigger, where performance drops from 70.8 to 68.0 F 1 .The compatibility score between an argument and a role (independent of the event), s A,R , is also an important component of the link score.Because the roles in RAMS are semantically grounded and concrete (compared to abstracted roles such as ARG0 in PropBank (Palmer et al., 2005)), determining whether an argument is likely to fill a given role (regardless of the particular event) may provide some signal of implicit entity typing.Figure 8 and Figure 9 demonstrate that similar roles tend to take arguments of the same or similar types.

Gun Violence Database (GVDB)
Dataset We also explore the feasibility of performing other information extraction tasks with our model.The Gun Violence Database (Pavlick et al., 2016) is a collection of news articles related to gun violence where each article is crowdsource annotated for various attributes specifically related to a gun violence event.Unlike the broad event type coverage in RAMS, all documents in GVDB are centered around this event type.The annotated fields are VICTIM (name, age, race), SHOOTER (name, age, race), LOCATION (specific location14 or city), TIME (time of day or clock time) and WEAPON (weapon type, number of shots fired).Each field value exists in the document as a span of characters.While GVDB's schema allows for multiple shooters or victims, here we restrict the task to predicting the first mention of each type.Pavlick et al. (2016) perform evaluation in two settings: strict and approximate.A field is marked as correct under the strict setting if any of the predictions for that field match the string of the correct answer exactly, while an approximate match is awarded if either a prediction contains the correct answer or if the correct answer contains the predicted string.The approximate setting is necessary due to inconsistent annotations (e.g., omitting first or last names).
Each article may contain an additional DATE metadata field marking the publication date, which is not necessarily in the text body of the article.In total, there are 7,366 news articles in the database spanning primarily from the early 2000s to 2016.For this work, we split the corpus chronologically into a training set of 5,056 articles, a development set of 375, a buffer of 100 (thrown out), and a test set of 472. 15The remaining articles do not have a reliable date field in the metadata or do not have any annotated fields.We additionally filter out an-notated fields that cross sentence boundaries. 16  Experiments We assume a document contains a single gun violence event, triggered by the full document.The goal is then to predict the fields for the event, which are the same as roles under our Argument Linking formalization.While the baseline experiments of Pavlick et al. (2016) made sentence-level predictions focusing on five attributes, 17 we make document-level predictions and consider the larger set of attributes.
We experiment with the feature-based version of BERT-base and with ELMo as our contextualized encoder.As each slot is filled by only one value, we use argmax decoding.See Appendix C for additional hyperparameters and results.

Results
The results are reported in Table 6.The numerical values are not directly comparable with Pavlick et al. (2016) because they make predictions on the full dataset and predict on a different set of fields.Despite the inability to directly compare, we nonetheless present a stronger and more comprehensive baseline for future work with GVDB.Our results show that the Argument Linking formalization is suitable for information extraction tasks like slot filling.

Beyond NomBank
The Beyond NomBank dataset ("BNB") collected by Gerber and Chai (2010) and refined by Gerber and Chai (2012) contains nominal predicates (event triggers), and multi-sentence arguments, both of which are properties shared with RAMS.BNB however is much smaller.Our model and dataset assume that arguments are contiguous spans, while many arguments in BNB are made up of split spans.Future work will attempt to resolve this discrepancy, perhaps by accommodating split arguments in the model or adopting continuation tags such as those used in Ontonotes 5.0 (Pradhan et al., 2013;Bonial et al., 2015).

Conclusion
To address the scarcity of cross-sentence argument linking data, we create and release the RAMS dataset, which contains annotations for over 9,000 events covering 137 event types and 65 roles.In 16 Since the original data is not tokenized, we use SpaCy 2.1.4for finding sentence boundaries and tokenization. 17They predict VICTIM.NAME, SHOOTER.NAME, LOCATION.(CITY|LOCATION),TIME.(TIME|CLOCK), and WEAPON.WEAPON; see Table 6.addition to achieving 68.1 F 1 on our dataset (73.2 F 1 when given gold event types), our model also provides stronger and more comprehensive baselines on the Gun Violence Database slot filling dataset.We hope that RAMS will stimulate further work on multi-sentence argument linking.

Figure 2 :
Figure 2: Subset of the AIDA ontology, demonstrating the three-level Type.Subtype.Subsubtype event hierarchy.Dotted grey edges point to roles for two of the event nodes, which have one role in common (Place).

Figure 3 :
Figure 3: Number of event types for which a given percentage of roles are filled in RAMS train set.Role coverage per event type is calculated as the average number of filled roles per instance of the event type divided by the number of roles specified by the ontology.

Figure 4 :
Figure 4: Distances between triggers and arguments in RAMS and proportion of arguments at that distance (counts are shown above each bar).Negative distances indicate that the argument appears before the trigger.

Figure 5 :
Figure 5: Distribution of role presence and absence responses.With five responses for each (event, role) combination, the five annotators can either completely agree that the role is absent or present (Abs-Dis0 and Pres-Dis0, respectively), or can disagree, e.g., Abs-Dis1 indicates that the majority voted on absence, but with one dissenting response.

Figure 6 :
Figure 6: Comparison of frequency (top) and amount of dataset covered (bottom) of event types in various datasets sorted by decreasing frequency in that dataset.RAMS has a heavier tail than the AIDA Phase 1 data and Beyond NomBank and broader coverage of events.

Figure 7 :
Figure 7: Comparison of frequency (top) and amount of dataset covered (bottom) of roles sorted by decreasing frequency.RAMS has more annotations for a more diverse set of role types than the AIDA Phase 1 data.

Figure 8 :
Figure 8: Cosine similarity between role embeddings for the 20 most frequent roles with the BERT model.The full heatmap in presented in Appendix B.

Figure 9 :
Figure 9: Row-normalized confusion matrix for the 20 most frequent roles made by the BERT model on development data.The full matrix in presented in Appendix B.

Table 1 :
Sizes and coverage of RAMS splits and the source AIDA event ontology.

Table 2 :
Pairwise span boundary inter-annotator agreement statistics for varying span difference thresholds.

Table 4 :
Performance breakdown by distance between argument and event trigger for both BERT and ELMo (best model) using constrained decoding over the development data.

Table 6 :
(Pavlick et al., 2016) and F 1 on event-based slot filling (GVDB) using BERT as the document encoder (ELMo in Table9in Appendix C).Due to the different data splits and evaluation conditions, the results are not directly comparable to the baseline(Pavlick et al., 2016), which is provided only for reference.Fields that were aggregated in the baseline are predicted separately in our model.'-' indicates result is not reported in the baseline.