Timeline extraction using distant supervision and joint inference

In timeline extraction the goal is to order all the events in which a target entity is involved in a timeline. Due to the lack of explicitly annotated data, previous work is primarily rule-based and uses pre-trained temporal linking systems. In this work, we propose a distantly supervised approach by heuristically aligning timelines with documents. The noisy training data created allows us to learn models that anchor events to temporal expressions and entities; during testing, the predictions of these models are combined to produce the timeline. Furthermore, we show how to improve performance using joint inference. In experiments in the SemEval-2015 TimeLine task we show that our distantly supervised approach matches the state-of-the-art performance while joint inference further improves on it by 3.2 F-score points.


Introduction
Temporal information extraction focuses on extracting relations and events along with the time when they were true or happened. In this work we focus on timeline extraction, following the recent SemEval TimeLine shared task (Minard et al., 2015). The aim of the task is to extract timelines from multiple documents consisting of events in which a given target entity is the main participant. An example timeline for the entity Steve Jobs extracted from 4 documents is given in Fig.1.
The development data provided by the TimeLine shared task does not contain annotations for the various intermediate processing stages needed, only a set of documents with annotated event mentions (input) and the timelines extracted for a few target entities (output). No training data was provided, thus participating systems used rules combined with temporal linking systems trained on related tasks in order to anchor events to temporal expressions and entities to construct the timelines.
We propose a new approach to timeline extraction that uses the development data provided as distant supervision to generate noisy training data (Craven and Kumlien, 1999;Mintz et al., 2009). More specifically, we heuristically align the target entity and the timestamps from the timelines with automatically recognized entities and temporal expressions in the documents. This noisy labeled data set allows us to learn models for the subtasks of anchoring events to temporal expressions and to entities, without requiring training models on additional data. Also, we improve the performance using joint inference for both anchoring subtasks. In our experiments, we show that our distantly supervised approach matches the state-of-the-art performance while joint inference further improves on it by 3.2 F-score points. Our code is publicly available at http://github.com/savac/timeline.

Timeline extraction
The task of timeline extraction given a target entity and a set of documents can be decomposed as follows. The initial stages are event mention extraction, target entity recognition, and temporal expression identification and resolution. The next stages are anchoring event mentions to target entities and temporal expressions. The final stages are event corefer- Documents: DocId: 16844, DCT: 2010-06-08, Sentence: 2,3,4,5: Yesterday 2010 , at this year 's Apple Worldwide Developers Conference ( WWDC ) , company CEO Steve Jobs Steve Jobs unveiled iPhone 4 iPhone 4 , along with the new iOS 4 operating system for Apple mobile devices . The announcement was long-awaited but not a very big surprise . In April , the technology blog Gizmodo obtained a prototype of the new phone iPhone 4 and published details of it iPhone 4 online . While introducing iPhone 4 iPhone 4 , at the annual conference , Jobs Steve Jobs started by hinting at the incident , saying , " Stop me if you 've already seen this . DocId: 17036, DCT: 2010-07-17, Sentence: 6,15: Rather than recall the devices or offer a hardware fix , Jobs Steve Jobs said yesterday 2010-07-16 that Apple will offer a free case to anyone who has purchased an iPhone 4 iPhone 4 . [...] However ,Jobs Steve Jobs admitted that the percentage of calls dropped on the iPhone 4 iPhone 4 was slightly greater than the percentage of calls dropped on the 3GS iPhone 3GS . id-sentence index. The annotations for the target entities and temporal expression mentions need to be done by the system. ence resolution and ordering of the events in a timeline, which rely largely on their anchoring to temporal expressions. The TimeLine shared task had two tracks, A and B, the only difference being that in Track B the event mentions are provided in the input. We consider this track in this paper and focus on learning the anchoring of events to temporal expressions and entities.
The development data provided in the context of the shared task consisted of documents related to Apple and gold timelines for six target entities. Evaluation was performed by extracting timelines from three document sets, each related to Airbus, GM and Stock market respectively. We used the official evaluation which is based on the metric introduced by UzZaman and Allen (2011) which assesses a predicted timeline versus the gold standard one using precision, recall and F-score over binary temporal relations between the events.

Distant supervision
In order to generate training data for anchoring event mentions to target entities and temporal expressions via distant supervision, we first need to identify them. For entity recognition we use approximate string matching combined with the Stanford Coreference Resolution System (Lee et al., 2013). For temporal expression identification and resolution to absolute timestamps we use the UWTime temporal parser (Lee et al., 2014).
Next we generate labeled instances as follows. For anchoring events to entities, we consider for each event mention the correct entity mention to be the nearest mention of the target entity in the same sentence, and all others to be incorrect. Similarly, for anchoring events to timestamps, we consider for each event mention the correct temporal expression to be the nearest temporal expression that exactly matches the timestamp according to the timeline (but not necessarily in the same sentence), and all others to be incorrect. The datasets generated will be noisy since correct anchors may be entity mentions and temporal expressions that are not the nearest ones. Further noise is expected due to errors in the entity recognition and temporal expression identification and resolution stages.

Event anchoring
After generating training data for anchoring event mentions to target entities and to temporal expressions with distant supervision, we now proceed to developing linear models for each of these tasks.

Classification
Using distant supervision we obtained examples of correct and incorrect anchoring of event mentions to entities and temporal expressions. Thus we learn for each of the two tasks a binary linear classifier of the form: score(x, y, w) = w · φ(x, y) where x is an event mention, y is the anchor (either the target entity or the temporal expression) and w are the parameters to be learned. The features extracted by φ represent various distance measures and syntactic dependencies between the event mention and the anchor obtained using Stanford CoreNLP (Manning et al., 2014). The temporal expression anchoring model also uses a few feature templates that depend on the timestamp of the temporal expression. The full list of features extracted by φ are denoted as local in Tables 1 and 2.

Alignment
The classification approach described is limited to anchoring each event mention to an entity or a temporal expression in isolation. However it would be preferable to infer the decisions for each task jointly at the document level and take into account the dependencies in anchoring different events, e.g. that consecutive events in text are likely to be anchored  to the same entity, as shown in Figure 2, or to the same temporal expression. Capturing such dependencies can be crucial when the correct anchor is not explicitly signalled in the text but can be inferred considering other relations and/or their ordering in text (Derczynski, 2013). Defining our joint model formally, let x be a vector containing all event mentions in a document and y be the vector of all anchors (target entity mentions or temporal expressions) in the same document. The order of the events in x is as they appear in the document. Let z be a vector of the same length as x that defines the alignment between x and y by containing pointers to elements in y, thus allowing for multiple events to share the same anchor. The scoring function is defined as score(x, y, z, w) = w · Φ(x, y, z) (2) where the global feature function Φ, in addition to the features returned by the local scoring function (Eq. 1), also returns features taking into account anchoring predictions across the document. Apart from features encoding subsequences of anchoring Documents: DocId: 17036, DCT: 2010-07-17, Sentence: 9,10,11: Jobs Steve Jobs also said that those who had already purchased a bumper will receive a full refund for the accessory. For consumers still dissatisfied with iPhone 4 iPhone 4 , Jobs Steve Jobs said that the phones can be returned for a refund as well. Jobs Steve Jobs acknowledged that "a very small percentage of users" were experiencing antenna issues, but dismissed the existence of an "Antennagate," saying that similar problems plague all cellular phones and that the iPhone iPhone 4 issue "has been blown so out of proportion that it is incredible." predictions, it also makes possible to make them dependent on the events, e.g. a binary indicator encoding whether two consecutive events with the same stem share the same anchor or not. The full list of local and global features extracted by Φ are presented in Tables 1 and 2. Predicting with the scoring function in Eq.2 amounts to finding the anchoring sequence vector z that maximizes it. To be able to perform exact inference efficiently, we impose a first order Markov assumption and use the Viterbi algorithm (Viterbi, 1967). Similar approaches have been successful in word alignment for machine translation (Blunsom and Cohn, 2006).

Post-processing
During testing, we need to construct the timeline for each target entity using the events that were predicted to be anchored to it and the timestamps of the temporal expressions each event was anchored to. Thus, we need to perform two additional tasks, event coreference and ordering. For the former we define a simple heuristic by which if two mentions have the same stems and timestamps then they refer to the same event. The only exception is that if two mentions represent communication events (said, announced etc.), then they are resolved to different events when in the same document. We finally order the events according to their timestamp.

Results
We evaluate our system using the setup provided by the TimeLine task ensuring that the training and validation are performed only using the development data i.e. the Apple collection. All linear models were trained with the perceptron update rule (Pedregosa et al., 2011). We tuned the number of perceptron iterations by performing cross-validation using the development data by holding out the timeline for one target entity and training on the timelines for the remaining ones.
In Table 3 we compare the binary classification model (Our System Binary) against the alignment model (Our System Alignment) and show that the latter outperforms the former by a margin of 3.2 points in F-score, achieving a micro F 1 -score of 28.58 across the three test corpora, thus confirming the benefits of joint inference. The only corpus in which joint inference did not help was Stock which has on average shorter event chains per document (Minard et al., 2015) and thus renders joint anchoring less likely to be useful.
We now compare our approach to the two participants in the TimeLine shared task with two runs each. The best-performing GPLSIUA team (Navarro and Saquete, 2015) used the TIPSem tool developed by Llorens et al. (2010) for temporal relation processing which extracts events and temporal expressions and uses a Conditional Random Field model to anchor them against each other. However, TIPSem only considers anchoring of events to temporal expressions that are in the same sentence. GPLSIUA also used the semantic role labeler from SENNA (Collobert et al., 2011) and Open-NER and anchored entities to events using a rulebased approach. The HeidelToul team (Moulahi et al., 2015) used HeidelTime (Strötgen et al., 2013) to identify and resolve temporal expressions and de-  veloped a target entity mention identification tool similar to ours using Stanford CoreNLP (Manning et al., 2014). However, they rely on a rule-based approach for event anchoring. Our binary model matches the performance of the best system, and our alignment model exceeds it by 3.2 F 1 -score points across, even though we do not use any off-the-shelf components developed for temporal relation extraction. Instead we rely on training data generated with distant supervision, and UWTime for temporal expression identification and resolution, for which the participants also used similar components.

Related work
In recent work, Laparra et al. (2015) also considered anchoring at the document-level in the context of the Track A of the TimeLine shared task, however they developed a rule-based approach. The structure features used in our joint inference approach encode similar intuitions, but we are learning model weights using distant supervision so that we can combine them more flexibly. And even though the noise in the trainng data generated with distant supervision is a concern, manual annotation of temporal relations is known to have low inter-annotator agreement rates 1 and thus also likely to be noisy. Prior to the TimeLine shared task, TempEval (Verhagen et al., 2007) was the original task that focused on categorising the relations between events, temporal expressions and Document Creation Time using the the TimeML annotation language. The task classified only the relations between mentions in the same or consecutive sentences. The two following tasks, TempEval-2 (Verhagen et al., 2010) and TempEval-3 (UzZaman et al., 2013), added tasks for event and temporal expression identifica-tion as well as an end-to-end temporal relation processing task that was performed on raw text.
Beyond TempEval, McClosky and Manning (2012) used distant supervision in order to learn how to extract the temporal bounds for events in the context of the TAC temporal knowledge base population task (Ji et al., 2011). However they focus on learning real-world event ordering constraints (e.g. people go to school before university) instead of how events are reported in text.

Conclusions
In this paper we proposed a timeline extraction approach in which we generate noisy training data for anchoring events to entities and temporal expressions using distant supervision. By learning a binary classifier we match the state-of-the-art F 1 -score for the Track B of the TimeLine shared task. We further improve this result by 3.2 F 1 -score points using joint inference.