On the Creation of a Security-Related Event Corpus

This paper reports on an effort of creating a corpus of structured information on security-related events automatically extracted from on-line news, part of which has been manually curated. The main motivation behind this effort is to provide material to the NLP community working on event extraction that could be used both for training and evaluation purposes.


Introduction
Due to a rapid proliferation of textual information in digital form various security-related organisations have recently acknowledged the benefits of deploying techniques to automate the process of extraction of structured information on events from free texts (Appelt, 1999;Ashish et al., 2006;Ji et al., 2009;Piskorski and Yangarber, 2013). Examples of current capabilities of such techniques for the extraction of disease outbreaks, crisis situations, cross-border crimes and computer security events from on-line sources are given in (Grishman et al., 2002;King and Lowe, 2003;Yangarber et al., 2008;Gao et al., 2013;Danilova and Popova, 2014;Ritter et al., 2015). This paper reports on the creation of a corpus of structured information on security-related events automatically extracted from online news over a period of 8 years, part of which has been manually curated. The main drive behind this endeavour is to provide material to theNLP community working on event extraction, which could be used in various ways, e.g., for: (a) carrying out evaluations of detection and extraction of securityrelated events from online news (human-curated data), (b) training event type classifiers, (c) learning domain-specific terminology, (d) creating full-fledged inline or stand-off annotations with eventcentric information based on the automatically extracted event templates.
Other efforts on the creation of corpora with event-related annotation of various nature include: GDELT , FactBank (Saurí and Pustejovsky, 2009), ICEWS (Ward et al., 2013), EventCorefBank (Cybulska and Vossen, 2014), ASTRE (Nguyen et al., 2016) and (Hong et al., 2016). Contrary to most other initiatives our corpus contains aggregated information on events extracted at news cluster level without provision of links to concrete phrases in news articles from which the information was inferred.
Section 2 briefly presents our news event extraction system. Section 3 reports on an evaluation thereof to provide insights on the quality of extraction. Section 4 provides some corpus statistics.

Event Extraction System Description
Our event extraction system has been primarily designed to help analysts from international institutions to automate the process of gathering intelligence on security-related events from online news. It is capable of extracting information on different types of crises, such as political violence, social turmoil, natural and man-made disasters. We briefly describe the core elements of the event extraction system, while a more detailed description can be found in Tanev et al., 2009). The event extraction system runs on top of Europe Media Monitor (EMM), a large-scale news aggregation engine that gathers articles from ca. 7000 sources in 60 languages on a 24/7 basis (Atkinson and Van Der Goot, 2009). The system takes as input a set of news articles on the same topic, called a news cluster which are pro-duced every 10 minutes by the news aggregation engine. The output of the event extraction is an event description template corresponding to the main event reported in the cluster and includes two main slots: Event type and Geolocation, and other event-type specific descriptive and numerical slots, e.g., Perpetrators, Dead victims, Number injured, Displaced, Targeted infrastructure.
In the first step, each article in the cluster is linguistically pre-processed to produce a more abstract representation of the text, including, i.a., tokenization, sentence splitting, NER, and labeling of key terms like action words (e.g. kill, shoot).
Our event extraction system is applied only on the title and lead sentences of each article assuming that articles are written using the inverted pyramid style, the dominant paradigm in modern journalism (Pöttker, 2003). Although one might potentially report on a relevant event in the final paragraphs of an article, our system has not been designed to capture them.
Next lexico-semantic patterns for the extraction of one or two slots in the event template are applied to parse more complex phrases, which express different actions and situations whose results are death, injury or other effects on people, e.g. five people were injured, the boss of Cosa Nostra was found dead. These patterns (several hundreds) were semi-automatically acquired using a bootstrapping approach (Riloff, 1996;Yangarber et al., 2000) described in more detail in (Tanev et al., , 2009.
Since information about events is scattered over different articles, in the next step cross-document information validation and fusion heuristics are deployed, e.g., majority voting-like heuristics described in . To give a more precise example, in the context of extracting descriptive slots, among the phrases that apear as a filler of a given slot in the event templates extracted from all articles in the cluster, the most frequent one is selected.
Event classification is done using: (i) detecting keyword combinations, e.g., if a word in: soldiers, troops, tanks, marines, etc. occurs in the vicinity of a word in: attacked, destroyed, raided, etc., then Armed conflict type is inferred, (ii) type-preference heuristics, e.g., if the text talks about violence, but simultaneously arrested people were detected using some pattern, then Arrest is preferred to Violence, and (iii) SVM-based word ngram text classifier, which is applied, when the rule-based classification yields no result.
Our event types, e.g. Armed Conflict, Terrorist Attack, Protest, Earthquake, etc., were chosen among those that have the strongest impact on the security of the society.
Finally, a keyword-based filter (semiautomatically created using bootstrapping lexical learning (Tanev and Zavarella, 2014)) is deployed to eliminate events that are vaguely related to some past security-related events, e.g., commemorations related to past natural disasters, political meetings with the purpose of resolving violence-related issues, fake threats of terrorist attacks.
Our event extraction system relies on lightweight linguistic processing vis-a-vis state-of-the-art systems that use more linguistic sophistication (Kilicoglu and Bergler, 2009;Chen et al., 2015) due to: (a) specifics of the environment our system used in, where the key feature is scalability, i.e., one has to be able to quickly extend the system to detect new event types and process news in many languages, and (b) the paramount importance of providing the analysts some sort of event-centric navigation structure to guide further reading and analysis, in whose context the high quality extraction of certain slots and extraction of very fine-grained information (e.g. guessing the most specific event location information versus guessing the administrative region in which an event happened) is of lower importance.

Test Data Set
For the purpose of evaluating the performance of the automatically extracted information we have first selected 15 event types, taken from the full list of 62 types that the system is designed to detect, reported in Annex (see Sec. 4.2). The chosen types are representative of 5 broader event categories: (a) Natural disasters: Wildfire, Flood, Earthquake, (b) Man-made disasters: Maritime Accident, Explosion, Ordinary Man-Made Disaster, (c) Violence: Kidnapping/Hostage Taking, Shooting, Terrorist attack, (d) Military-related: Heavy Weapons Fire, Armed Conflict, Air/Missile Attack, and (e) Socio-political: Riot/Turmoil, Boycott/Strike, Public Demonstration. Then, we have randomly collected 16 news clusters that the sys-tem had tagged with each target event type, from system data between 1/05/2016 and 31/12/2016, for a total of 240 clusters.
For each event news cluster, the annotators were given the title and first two sentences of each of the 15 (max) latest articles, including duplicates. The rationale of this setting is that we intended to 'simulate' the limited amount of data an analyst is usually able to process in order to pick up the main facts of the event reported in an article set.
For each news cluster the annotators were then tasked to provide: (a) a ranked list of up to three event types, where higher rank is given to more specific event types applicable (e.g., riot vs. disorder) and to the main event reported in the cluster vis-a-vis background events mentioned in the cluster 1 , (b) a non-ranked list of locations, each represented by an ID, where in case of two or more locations being in 'administrative' inclusion relation only the most specific one is retained, (c) for each event role descriptor slot a non-ranked list of all names and descriptions found in the text, and (d) for each event role amount slot a single integer or a span of integers reflecting the minimum and maximum values reported.
Gold standard was annotated by 4 annotators. We analyzed inter-annotator agreement (IAA) for the event type classification task on a sample of 120 clusters, by considering only the first type in the ranked lists, obtaining a Fleiss Kappa score of 0.7 (Fleiss, 1971).

Evaluation metrics
For the purpose of evaluating the performance of event extraction methods the research community has been predominantly using mention-based metrics and standards such as ACE (Doddington et al., 2004), where, e.g., the scores for extracted slot fillers are summed up over their mentions in text. However, motivated by the specific environment in which our event extraction system is used, we propose partly novel evaluation metrics that try to quantify from a user perspective the most relevant semantic dimensions of event information aggregated from multi-document sets. As an example, evaluating geo-coding as the task of locating events both on a geographical reference system and an administrative unit hierarchy (rather than as a standard entity recognition task (Mandl et al., 2009)) allows to estimate its usefulness for spatial analysis of aggregated event data. For an analyst responsible for studying events that happened in a particular administrative region (e.g., country, state) an incorrect extraction of the place, although within the boundaries of the region assigned to him, still does provide some value, which should be awarded with a non-zero score.
We first introduce the metrics for the evaluation of event type and location extraction. Let C = {c 1 , . . . , c n } denote the set of input clusters of articles. Let also t c (l c ) denote the event type (location) for cluster c returned by the system. Further, let T G c denote an ordered list of event types in the gold truth for cluster c, and let L G c denote an unordered list of event locations l G c for c in the gold truth 2 .
For the evaluation of event type classification we use an adapted version of the Mean Reciprocal Rank (MRR) (Craswell, 2009) defined as follows: where score(t c ) = 1/rank(t c ) with rank(t c ) denoting the rank of t c in T G c , or score(t c ) = 0 if t c / ∈ T G c . In our context the reciprocal rank of a system response for cluster c is the multiplicative inverse of the rank thereof in the gold truth.
For the evaluation of the event location extraction we define two basic metrics: Geographical Closeness (GC) and Administrative Closeness (AC) which are maximized over the gold truth locations. The GC metric is defined as follows: where dist(a, b) denotes the physical distance (in km.), between a and b, which is computed using the GEONAMES gazetteer 3 ; The AC metric is a modification of WUP, the semantic metric presented in (Wu and Palmer, 1994), whose main aim is to reflect how close the system location response is to the corresponding gold truth location in the administrative hier-archy of geographical references. Let T GEO denote the administrative hierarchy in the GeoNames gazetteer 4 and let LCS(x, y) denote the lowest common subsumer for nodes x and y in T GEO . AC is then defined as follows: δ/2 i is a weighted depth of a node v in T GEO , with δ empirically set to 10. The main intuition behind AC is to apply a higher penalty to system errors: (a) closer to the root of T GEO (e.g., guessing wrong country is worse than guessing wrong city within a province), and (b) resulting from providing over-specific, false information vis-a-vis system responses being not as specific, but still encompassing, gold truth location (e.g. guessing only the region of a gold truth town).
We also compute Location Accuracy (LC) as a weighted harmonic mean of GC and AC, maximized over the gold truth locations: where β was set to 1 in the evaluation. For event slot descriptors we first distinguish two cases: definite description phrases are normalized and possibly merged to the morphological base forms of their noun/adjective components (e.g. 'three Iraqi militants' and 'Iraqi militants' are merged into 'Iraqi militant', while all upper case phrases (supposedly person names) are kept as such. In the former case, if descr c is a system output descriptor for a certain role of event in cluster c and descr G c is a gold standard descriptor for the same role, the match between descr c and descr G c is computed as: where descr N c and descr GN c are the sets of all N-grams of descr c and descr G c , resp., and W U P (m, n) is a WordNet-based semantic relatedness measure (Wu and Palmer, 1994). In the latter case, matches are computed as: where StringSim(m, n) is modification of the Jaro metric boosting agreeing initial characters (Winkler, 1999).
In both cases, in order to penalize cases of role filler inversion, we score as 0 the matches of a system output role descriptor if it is lower than the max similarity with any of the other event role fillers in gold standard. Given the scores above, standard Precision, Recall and F1 measure are computed.
Finally, we record the root Mean Squared Error (MSE) of system output victim count values against gold standard, over all applicable roles 5 .

Results
The evaluation results for the extraction of the event type and location are provided in Table 1. The overall results are good vis-a-vis the state-ofthe-art results reported elswhere. A rudimentary error analysis of event type extraction revealed that somewhat worse results for Violence, Sociopolitical and Military categories were caused by the semantic 'proximity' of the event types contained in each of these categories. In particular, based on the low performance of extraction of Explosion events they were not included in the event corpus. The overall 0.4 score for GC corresponds to an average geographical error of around 9.2km from the gold standard location point, while the 0.49 for AC translates to a level of granularity between country and region levels.
The evaluation results for the extraction of the 'descriptor' and numerical slots are provided in Table 2, mF and MF columns for each role description task represent resp. the micro/macro average F1-measure. Extraction of numerical slots is quite accurate, except than for the Dead role, as dead counts are more likely to occur as cumulative figures in highly deadly events such as military conflicts; the systems often fails to separate them from real-time victim count updates.  The current version of the corpus contains two sets: (a) moderated events (MOD) resulting from manual curation of the automated extractions in 6 languages by one trained human expert responsible for providing 'cleaned' data to the end-users, and (b) automatically extracted events (AUTO) from English news. The quantitative data on the MOD set containing 17536 event templates is given in Table 3. The (MOD) set was created during the period of 1/02/2009 to 18/08/2015. The breakdown of the events w.r.t. to languages covered is as follows: English (45.3%), Spanish (16.3%), Italian (12.0%), French (11,2%), Portuguese (7,7%) and Russian (7,5%). The AUTO set contains ca. 600K events extracted from online news in English for the period 1/1/2009 to 1/4//2017. We have selected ca. 330K of the most 'reliable' event templates therefrom, i.e, whose extraction appears to be more accurate vis-a-vis other event types. The preliminary quantitative data on the corpus 6 is given in Table 4 Table 4: Quantitative data on the AUTO set (numbers of events are provided in thousands).

Format and Access
The current version of the corpus accompanied with additional information (including, i.a., list of event types and corresponding slots, instructions on how to access the underlying news stories from which the events were extracted, etc.) can be accessed at: http://labs.emm4u. eu/events.html The corpus is available in two formats: (a) comma separated values (csv) and (b) JSON. The former contains only the following (reduced) data: unique event id; type of the event; event type category 7 ; the date when it was detected; the title of the centroid article in the cluster; and the identified place name (including latitude/longitude and computed administrative path, where the first element therein provides most fine-grained information). The unique event id can be used to publicly access the articles in the cluster from which the event was extracted. The JSON format contains the full template structure including the descriptive slots: who was killed/injured; the perpetrators; the weapons used; any other descriptors that were identified for that particular event type.
It is envisaged to further extend the corpus through the provision of: (a) annotated data for new languages, (b) a new attribute reflecting extraction reliability (c) cross-language event links (Ji, 2010;, and (d) additional access methods (e.g., KML).