TweeTime : A Minimally Supervised Method for Recognizing and Normalizing Time Expressions in Twitter

We describe TweeTIME, a temporal tagger for recognizing and normalizing time expressions in Twitter. Most previous work in social media analysis has to rely on temporal resolvers that are designed for well-edited text, and therefore suffer from the reduced performance due to domain mismatch. We present a minimally supervised method that learns from large quantities of unlabeled data and requires no hand-engineered rules or hand-annotated training corpora. TweeTIME achieves 0.68 F1 score on the end-to-end task of resolving date expressions, outperforming a broad range of state-of-the-art systems.


Introduction
Temporal expressions are words or phrases that refer to dates, times or durations.Resolving time expressions is an important task in information extraction (IE) that enables downstream applications such as calendars or timelines of events (Derczynski and Gaizauskas, 2013;Do et al., 2012;Ritter et al., 2012;Ling and Weld, 2010), knowledge base population (Ji et al., 2011), information retrieval (Alonso et al., 2007), automatically scheduling meetings from email and more.Previous work in this area has applied rule-based systems (Mani and Wilson, 2000;Bethard, 2013b;Chambers, 2013) or supervised machine learning on small collections of hand-annotated news documents (Angeli et al., 2012;Lee et al., 2014). 1 Our code and data are publicly available at https:// github.com/jeniyat/TweeTime.Social media especially contains time-sensitive information and requires accurate temporal analysis, for example, for detecting real-time cybersecurity events (Ritter et al., 2015;Chang et al., 2016), disease outbreaks (Kanhabua et al., 2012) and extracting personal information (Schwartz et al., 2015).However, most work on social media simply uses generic temporal resolvers and therefore suffers from suboptimal performance.Recent work on temporal resolution focuses primarily on news articles and clinical texts (UzZaman et al., 2013;Bethard and Savova, 2016).
Resolving time expressions in social media is a non-trivial problem.Besides many spelling variations, time expressions are more likely to refer to future dates than in newswire.For the example in Figure 1, we need to recognize that Monday refers to the upcoming Monday and not the previous one to resolve to its correct normalized date (5/9/2016).We also need to identify that the word Sun is not referring to a Sunday in this context.
In this paper, we present a new minimally supervised approach to temporal resolution that requires no in-domain annotation or hand-crafted rules, instead learning from large quantities of unlabeled text in conjunction with a database of known events.Our approach is capable of learning robust time expression models adapted to the informal style of text found on social media.
For popular events, some related tweets (e.g. Figure 2) may contain explicit or other simple time mentions that can be captured by a generic temporal tagger.An open-domain information extraction system (Ritter et al., 2012) can then identify events (e.g.[Mercury, 5/9/2016]) by aggregating those tweets.To automatically generate temporally annotated data for training, we make the following novel distant supervision assumption:2 Tweets posted near the time of a known event that mention central entities are likely to contain time expressions that refer to the date of the event.
Based on this assumption, tweets that contain the same named entity (e.g. Figure 1) are heuristically labeled as training data.Each tweet is associated with multiple overlapping labels that indicate the day of the week, day of the month, whether the event is in the past or future and other time properties of the event date in relation to the tweet's creation date.In order to learn a tagger that can recognize temporal expressions at the word-level, we present a multipleinstance learning approach to model sentence and word-level tags jointly and handle overlapping labels.Using heuristically labeled data and the temporal tags predicted by the multiple-instance learning model as input, we then train a log-linear model that normalizes time expressions to calendar dates.
Building on top of the multiple-instance learning model, we further improve performance using a missing data model that addresses the problem of errors introduced during the heuristic labeling process.Our best model achieves a 0.68 F1 score when resolving date mentions in Twitter.This is a 17% increase over SUTime (Chang and Manning, 2012), outperforming other state-of-the-art time expression resolvers HeidelTime (Strötgen andGertz, 2013), TempEX (Mani andWilson, 2000) and UWTime (Lee et al., 2014) as well.Our approach also produces a confidence score that allows us to trade recall for precision.To the best of our knowledge, TweeTIME is the first time resolver designed specifically for social media data.3This is also the first time that distant supervision is successfully applied for end-to-end temporal recognition and normalization.Previous distant supervision approaches (Angeli et al., 2012;Angeli and Uszkoreit, 2013) only address the normalization problem, assuming gold time mentions are available at test time.

System Overview
Our TweeTIME system consists of two major components as shown in Figure 3 To train these two models without corpora manually annotated with time expressions, we leverage a large database of known events as distant supervision.The event database is extracted automatically from Twitter using the open-domain IE system  proposed by Ritter et al. (2012).Each event consists of one or more named entities, in addition to the date on which the event takes place, for example [Mercury, 5/9/2016].Tweets are first processed by a Twitter named entity recognizer (Ritter et al., 2011), and a generic date resolver (Mani and Wilson, 2000).Events are then extracted based on the strength of association between each named entity and calendar date, as measured by a G 2 test on their co-occurrence counts.More details of the Event Extractor can be found in Section 5.1.
The following two sections describe the details of our Temporal Recognizer and Temporal Normalizer separately.

Distant Supervision for Recognizing Time Expressions
The goal of the recognizer is to predict the temporal tag of each word, given a sentence (or a tweet) w = w 1 , . . ., w n .We propose a multiple-instance learning model and a missing data model that are capable of learning word-level taggers given only sentence-level labels.
Our recognizer module in is built using a database of known events as distant supervision.We assume tweets published around the time of a known event that mention a central entity are also likely to contain time expressions referring to the event's date.For each event, such as [Mercury, 5/9/2016], we gather all tweets that contain the central entity Mercury and are posted within 7 days of 5/9/2016.We then label each tweet based on the event date in addition to the tweet's creation date.The sentence-level temporal tags for the tweet in Figure 1 are: TL=f uture, DOW=M on, DOM=9, MOY=M ay.

Multiple-Instance Learning Temporal
Tagging Model (MultiT) Unlike supervised learning, where labeled instances are provided to the learner, in multiple instance learning scenarios (Dietterich et al., 1997), the learner is only provided with bags of instances labeled as either positive (where at least one instance is positive) or all negative.This is a close match to our problem setting, in which sentences are labeled with tags that should be assigned to one or more words.
We represent sentences and their labels using a graphical model that is divided into word-level and sentence-level variables (as shown in Figure 4).Unlike the standard supervised tagging prob- lem, we never directly observe the words' tags (z = z 1 , . . ., z n ) during learning.Instead, they are latent and we only observe the date of an event mentioned in the text, from which we derive sentencelevel binary variables t = t 1 , . . ., t k corresponding to temporal tags for the sentence.Following previous work on multiple-instance learning (Hoffmann et al., 2011a;Xu et al., 2014), we model the connection between sentence-level labels and word-level tags using a set of deterministic-OR factors φ sent .The overall conditional probability of our model is defined as: where f(z j , w j ) is a feature vector and (2) We include a standard set of tagging features that includes word shape and identity in addition to prefixes and suffixes.To learn parameters θ r of the Temporal Tagger, we maximize the likelihood of the sentence-level heuristic labels conditioned on observed words over all tweets in the training corpus.Given a training instance w with label t, the gradient of the conditional log-likelihood with respect to the parameters is: This gradient is the difference of two conditional expectations over the feature vector f: a "clamped" expectation that is conditioned on the observed words and tags (w, t) and a "free" expectation that is only conditioned on the words in the text, w, and ignores the sentence-level labels.To make the inference tractable, we use a Viterbi approximation that replaces the expectations with maximization.Because each sentence corresponds to more than one temporal tag, the maximization of the "clamped" maximization is somewhat challenging to compute.

Missing Data Temporal Tagging Model (MiDaT)
While the multiple-instance learning assumption works well much of the time, it can easily be violated -there are many tweets that mention entities involved in an event but that never explicitly mention its date.
The missing data modeling approach to weakly supervised learning proposed by Ritter et. al. (2013) addresses this problem by relaxing the hard constraints of deterministic-OR factors, such as those described above, as soft constraints.Our missingdata model for weakly supervised tagging splits the sentence-level variables, t into two parts : m which represents whether a temporal tag is mentioned by at least one word of the tweet, and t which represents whether a temporal tag can be derived from the event date.A set of pairwise potentials ψ(m j , t j ) are introduced that encourage (but don't strictly require) agreement between m j and t j , that is: Here, α p (Penalty), and α r (Reward) are parameters for the MiDaT model.α p is the penalty for extracting a temporal tag that is not related to the event-date and α r is the reward for extracting a tag that matches the date.
During learning, if the local classifier is very confident, it is possible for a word to be labeled with a tag that is not derived from the event-date, and also for a sentence-level tag to be ignored, although either case will be penalized by the agreement potentials, ψ(m j , t j ), in the global objective.We use a local-search approach to inference that was empirically demonstrated to nearly always yield exact solutions by Ritter et. al. (2013).

A Log-Linear Model for Normalizing Time Expressions
The Temporal Normalizer is built using a log-linear model which takes the tags t produced by the Temporal Recognizer as input and outputs one or more dates mentioned in a tweet.We formulate date normalization as a binary classification problem: given a tweet w published on date d pub , we consider 22 candidate target dates (w, d cand l ) such that d cand l = d pub + l, where l = −10, . . ., −1, 0, +1, . . ., +10, limiting the possible date references that are considered within 10 days before or after the tweet creation date, in addition to d cand l = null (the tweet does not mention a date). 4While our basic approach has the limitation, that it is only able to predict dates within ±10 days of the target date, we found that in practice the majority of date references on social media fall within this window.Our approach is also able to score dates outside this range that are generated by traditional approaches to resolving time expressions, as described in Section 5.3.3. 4 Although the temporal recognizer is trained with tweets from ±7 days around the event date, we found that extending the candidate date range to ±10 days for the temporal normalizer increased the performance of TweeTIME in the dev set.
The normalizer is similarly trained using the event database as distant supervision.The probability that a tweet mentions a candidate date is estimated using a log-linear model: where θ n and g are the parameter and feature vector respectively in the Temporal Normalizer.For every tweet and candidate date pair (w, d cand l ), we extract the following set of features: Temporal Tag Features that indicate whether the candidate date agrees with the temporal tags extracted by the Temporal Recognizer.Three cases can happen here: The recognizer can extract a tag that can not be derived from the candidate date; The recognizer can miss a tag derived from the candidate date; The recognizer can extract a tag that is derived from the candidate date.
Lexical Features that include two types of binary features from the tweet: 1) Word Tag features consist of conjunctions of words in the tweet and tags associated with the candidate date.We remove URLs, stop words and punctuation; 2) Word POS features that are the same as above, but include conjunctions of POS tags, words and temporal tags derived from the candidate date.Time Difference Features are numerical features that indicate the distance between the creation date and the candidate date.They include difference of day ranges form -10 to 10 and the difference of week ranges from -2 to 2.

Experiments
In the following sub-sections we present experimental results on learning to resolve time expressions in Twitter using minimal supervision.We start by describing our dataset, and proceed to present our results, including a large-scale evaluation on heuristically-labeled data and an evaluation comparing against human judgements.

Data Collection
We collected around 120 million tweets posted in a one year window starting from April 2011 to May 2012.These tweets were automatically annotated with named entities, POS tags and TempEx dates (Ritter et al., 2011).
From this automatically-annotated corpus we extract the top 10, 000 events and their corresponding dates using the G 2 test, which measures the strength of association between an entity and date using the log-likelihood ratio between a model in which the entity is conditioned on the date and a model of independence (Ritter et al., 2012).Events extracted using this approach then simply consist of the highest-scoring entity-date pairs, for example [Mercury, 5/9/2016].
After automatically extracting the database of events, we next gather all tweets that mention an entity from the list that are also written within ±7 days of the event.These tweets and the dates of the known events serve as labeled examples that are likely to mention a known date.
We also include a set of pseudo-negative examples, that are unlikely to refer to any event, by gathering a random sample of tweets that do not mention any of the top 10, 000 events and where TempEx does not extract any date.

Large-Scale Heuristic Evaluation
We first evaluate our tagging model, by testing how well it can predict the heuristically generated labels.As noted in previous work on distant supervision (Mintz et al., 2009a), this type of evaluation usually under-estimates precision, however it provides us with a useful intrinsic measure of performance.
In order to provide even coverage of months in the training and test set, we divide the twitter corpus into 3 subsets based on the mod-5 week of each tweet's creation date.To train system we use tweets that are created in 1st, 2nd or 3rd weeks.To tune parameters of the MiDaT model we used tweets from 5th weeks, and to evaluate the performance of the trained model we used tweets from 4th weeks.The performance of the MiDaT model varies with the penalty and reward parameters.To find a (near) optimal setting of the values we performed a grid search on the dev set and found that a penalty of −25 and reward of 500 works best.A comparison of MultiT and MiDaT's performance at predicting heuristically generated labels is shown in Table 2.
The word level tags predicted by the temporal recognizer are used as the input to the temporal normalizer, which predicts the referenced date from each tweet.The overall system's performance at predicting event dates on the automatically generated test set, compared against SUTime, is shown in Table 3

Evaluation Against Human Judgements
In addition to automatically evaluating our tagger on a large corpus of heuristically-labeled tweets, we also evaluate the performance of our tagging and date-resolution models on a random sample of tweets taken from a much later time period, that were manually annotated by the authors.

Word-Level Tags
To evaluate the performance of the MiDaT-tagger we randomly selected 50 tweets and labeled each word with its corresponding tag.Against this hand annotated test set, MiDaT achieves Precision=0.54,.A few examples of word-level tags predicted by MiDaT are shown in Table 4.We found that because the tags are learned as latent variables inferred by our model, they sometimes don't line up exactly with our intuitions but Tweets and their corresponding word tags (word tag ) of-the-art temporal taggers.TweeTIME+SU uses our proposed approach to system combination, re-scoring output from SU-Time using extracted features and learned parameters from TweeTIME.

End-to-end Date Resolution
To evaluate the final performance of our system and compare against existing state-of-the art time resolvers, we randomly sampled 250 tweets from 2014-2016 and manually annotated them with normalized dates; note that this is a separate date range from our weakly-labeled training data which is taken from 2011-2012.We use 50 tweets as a development set and the remaining 200 as a final test set.We experimented with different feature sets on the development data.Feature ablation experiments are presented in Table 5.
The final performance of our system, compared  against a range of state-of-the-art time resolvers is presented in Table 6.We see that TweeTIME outperforms SUTime, Tempex, HeidelTime (using its COLLOQUIAL mode, which is designed for SMS text) and UWTime.Brief descriptions of each system can be found in Section 6.

System Combination with SUTime
As our basic TweeTIME system is designed to predict dates within ±10 days of the creation date, it fails when a tweet refers to a date outside this range.To overcome this limitation we append the date predicted by SUTime in the list of candidate days.We then re-rank SUTime's predictions using our log-linear model, and include its output as a predicted date if the confidence of our normalizer is sufficiently high.

Error Analysis
We manually examined the system outputs and found 7 typical categories of errors (see examples Over-Prediction: Unlike rule-based systems, Twee-TIME has a tendency to over-predict when there is no explicit time expression in the tweets, possibly because of the presence of present tense verbs.Such mistakes could also happen in some past tense verbs.
Because TweeTIME resolves time expressions using a very different approach compared to traditional methods, its distribution of errors is quite distinct, as illustrated in Figure 6.

Related Work
Temporal Resolvers primarily utilize either rulebased or probabilistic approaches.Notable rulebased systems such as TempEx (Mani and Wilson, 2000), SUTime (Chang and Manning, 2012) and HeidelTime (Strötgen and Gertz, 2013) provide particularly competitive performance compared to the state-of-the-art machine learning methods.Probabilistic approaches use supervised classifiers trained on in-domain annotated data (Kolomiyets and Moens, 2010;Bethard, 2013a;Filannino et al., 2013) or hybrid with hand-engineered rules (UzZaman and Allen, 2010; Lee et al., 2014).UWTime (Lee et al., 2014) is one of the most recent and competitive systems and uses Combinatory Categorial Grammar (CCG).
Although the recent research challenge TempEval (UzZaman et al., 2013;Bethard and Savova, 2016) offers an evaluation in the clinical domain besides newswire, most participants used the provided annotated corpus to train supervised models in addition to employing hand-coded rules.Previous work on adapting temporal taggers primarily focus on scaling up to more languages.HeidelTime was extended to multilingual (Strötgen and Gertz, 2015), collo-   (Strötgen and Gertz, 2012) using dictionaries and additional in-domain annotated data.One existing work used distant supervision (Angeli et al., 2012;Angeli and Uszkoreit, 2013), but for normalization only, assuming gold time mentions as input.They used an EM-style bootstrapping approach and a CKY parser.Distant Supervision has recently become popular in natural language processing.Much of the work has focused on the task of relation extraction (Craven and Kumlien, 1999;Bunescu and Mooney, 2007;Mintz et al., 2009b;Riedel et al., 2010;Hoffmann et al., 2011b;Nguyen and Moschitti, 2011;Surdeanu et al., 2012;Xu et al., 2013;Ritter et al., 2013;Angeli et al., 2014).Recent work also shows exciting results on extracting named entities (Ritter et al., 2011;Plank et al., 2014), emotions (Purver andBattersby, 2012), sentiment (Marchetti-Bowick and Chambers, 2012), as well as finding evidence in medical publications (Wallace et al., 2016).Our work is closely related to the joint word-sentence model that exploits multiple-instance learning for paraphrase identification (Xu et al., 2014) in Twit-ter.

Conclusions
In this paper, we showed how to learn time resolvers from large amounts of unlabeled text, using a database of known events as distant supervision.We presented a method for learning a wordlevel temporal tagging models from tweets that are heuristically labeled with only sentence-level labels.This approach was further extended to account for the case of missing tags, or temporal properties that are not explicitly mentioned in the text of a tweet.These temporal tags were then combined with a variety of other features in a novel date-resolver that predicts normalized dates referenced in a Tweet.By learning from large quantities of in-domain data, we were able to achieve 0.68 F1 score on the end-to-end time normalization task for social media data, significantly outperforming SUTime, TempEx, Heidel-Time and UWTime on this challenging dataset for time normalization.

Figure 1 :
Figure 1: A tweet published on Friday 5/6/2016 that contains the temporal expression Monday referring to the date of the event (5/9/2016), which a generic temporal tagger failed to resolve correctly.

Figure 2 :
Figure 2: A tweet that contains a simple explicit time mention and an event (Mercury, 5/9/2016) that can be identified by an open-domain information extraction system.

Figure 4 :
Figure 4: Multiple-Instance Learning Temporal Tagging Model -our approach to learn a word-level tagging model given only sentence-level labels.In this example a sentence-level variable ta = 1 indicates the temporal tag DOW=M on must be present and t b = 1 indicates that the target date is in the future (TL=f uture).The multiple instance learning assumption implies that at least one word must be tagged with each of these present temporal tags.For example, ideally after training, the model will learn to assign z8 to tag a and z1 to tag b.

Figure 5 :
Figure 5: Precision and recall at resolving time expressions compared against human judgements.TweeTIME achieves higher precision at comparable recall than other state-of-the-art systems.

Table 1 :
Our Temporal Recognizer can extract five different

Table 2 :
Performance comparison of MultiT and MiDaT at predicting heuristically generated tags on the dev set. .

Table 3 :
Performance comparison of TweeTIME and SUTime at predicting heuristically labeled normalized dates.

Table 5 :
Feature ablation of the Temporal Resolver by removing each individual feature group from the full set.

Table 4 :
Example MiDaT tagging output on the test set.stillprovideusefulpredictions, for example in Table4, Christmas is labeled with the tag MOY=dec.

Table 6 :
Performance comparison of TweeTIME against state- In many cases, temporal words like Friday in the tweet Is it Friday yet?may not refer to any specific event or date, but are often predicted incorrectly.Also included in this category are cases where the future and past are confused.For example, predicting the past Friday, when it is actually the coming Friday.
Over-Prediction RT @tinatbh: January 2015: this will be my year December 2015: maybe not.