To Schedule or not to Schedule: Extracting Task Specific Temporal Entities and Associated Negation Constraints

State of the art research for date-time entity extraction from text is task agnostic. Consequently, while the methods proposed in literature perform well for generic date-time extraction from texts, they don't fare as well on task specific date-time entity extraction where only a subset of the date-time entities present in the text are pertinent to solving the task. Furthermore, some tasks require identifying negation constraints associated with the date-time entities to correctly reason over time. We showcase a novel model for extracting task-specific date-time entities along with their negation constraints. We show the efficacy of our method on the task of date-time understanding in the context of scheduling meetings for an email-based digital AI scheduling assistant. Our method achieves an absolute gain of 19\% f-score points compared to baseline methods in detecting the date-time entities relevant to scheduling meetings and a 4\% improvement over baseline methods for detecting negation constraints over date-time entities.


Introduction
Temporal entity extraction and normalization is an important aspect of Natural Language Processing (Alonso et al., 2011;Campos et al., 2014). There has been a substantial body of work on the task and there exist numerous well performing publicly available models for identifying and normalizing temporal entities (Strötgen and Gertz, 2010;Chang and Manning, 2012;Zhong and Cambria, 2018).
There exist however a growing number of NLP applications which require extraction of only a relevant subset of time entities that are useful for solving specific problems within a larger body of text. Examples of such tasks include understanding search queries ("Find me all emails sent by April between May 11 th and May 21 st "), Goal Oriented Dialogue Systems ("Deliver George Orwell's 1984 by next week.", "Send the "FY 2020 Budget" to Watson Monday morning.") etc. Using the temporal entity extraction models for these tasks is insufficient, since they fail to disambiguate between general date-time entities and entities necessary to solve the task.
In this paper, we address the task of recognizing date-time entities required by an AI scheduling assistant for correctly scheduling meetings. Cortana from Microsoft Scheduler, Clara from Clara Labs and Amy from X.ai are examples of such email based digital assistants for scheduling meetings. For such systems, a user organizing the meeting adds the digital assistant as a recipient in an email with other attendees and delegates the task of scheduling to the digital assistant in natural language. For the assistant to correctly schedule the meeting, it must correctly extract the date-time entities expressed by the user in the email to indicate the times they want the meeting scheduled, as well as the times that do not work for them. The verbose nature of emails often exacerbates the difficulty of identifying relevant date-time entities; since the number of distractor (i.e valid date-time entities not pertinent to the task) tend to increase (Eg: In Fig. 1 "today" serves as a distractor entity).
To this end, we present SHERLOCK: ScHeduling Entity Recovery by LOoking at Contextual Knowledge, a novel model for detecting relevant date-time entities in the context of scheduling as well as identifying the entities associated with a negation constraint. SHERLOCK comprises of 3 modules for identifying the relevant entities as well as negation constraints associated with them: Figure 1: The 3 modules of SHERLOCK: First a high recall rule based extractor generates the potential entities. The neural module then takes the email and the entities and generates scores for each entity. Only the relevant entities are passed to the final negation module to detect times to schedule and times to avoid.
Date-Time Extractor: A high recall date-time entity extractor to identify all date-time entities in an email Entity Relevance Scorer: A neural model to classify each of the extracted entities as being relevant to scheduling or not by considering the context presented in the email.
Negation Detector: A negation module to identify if there exists a negation constraint associated with each of the extracted relevant entities. Fig. 1 illustrates each module: the entity extractor extracts "today", "next week", "Wednesday" and "May". Each of these entities is scored by the neural module, and only "next week" and "Wednesday" are identified as being relevant to scheduling. Finally, the negation module identifies that "Wednesday" has a negation constraint. While SHERLOCK focuses on the task of scheduling, we believe that a similar approach can be used to tackle the problem of extracting relevant datetime entities from documents for other tasks.
The contributions of this paper are as follows: Task specific date-time extractor: A novel method for combining conventional high recall rule-based model with a novel neural model for incorporating contextual information to identify relevant date-time entities for the task at hand.
Identifying negation constraints for temporal entities: A heuristic negation module that helps identify negation constraints associated with time entities in the context of scheduling meetings. To the best of our knowledge, prior to this work, negation constraints associated with time-entity extraction have not been studied before. We first present our proposed method for extracting time entities relevant to the task of scheduling a meeting in ( §2). Next, we describe our approach for identifying negation constraints associated with extracted entities in ( §3). In ( §4), we describe our experimental setup and baselines. We discuss the results in ( §5) and show that SHERLOCK helps improve performance both on the task of identifying relevant entities as well as identifying negation constraints. We then present the related work in ( §6), and finally conclude in ( §7).

Contextual Date-Time Extraction
In order to correctly extract relevant temporal entities in the context of scheduling meetings from an email, we first extract potential entities using an off-the-shelf date-time entity extractor. Both the email and the extracted entities are then encoded using neural modules. For each extracted entity, we then generate a context embedding using the encoded email and the encoded entity. Both the contextual and encoded entity embedding are then used to predict if an entity is relevant or not. We describe each component in detail below:

Entity Extraction and Encoding
Given an email X = {w 1 · · · w n }, we first use a rule-based tagger for extracting potential date-time entities from an email. Specifically, we use LUIS 2 (Williams et al., 2015) for extracting the entities. The model is recall heavy and identifies potential time utterances (Eg: in Figure 1, LUIS detects "today", "next week", "wednesday", "may"). We denote the extracted entities as E = {e 1 · · · e m }, where e i = {e i,1 · · · e i,l i } represents the i th entity and l i denotes the length of e i .
For each entity e i , we generate an embedding u e i ∈ R de (where d e denotes the entity embedding dimension) as follows: In Equation (1), t i,j denotes the word level embedding of the j th word if the i th entity (e i,j ). As is standard practice, OOV words all share a common word embedding, while other entities encountered during training are represented by a learnt vector. We also augment this with an embedding from a character level encoder. r i,j denotes the word level embedding obtained by passing e i,j 3 through a character level encoder, which allows the model to represent OOV entities. The two embeddings are concatenated, and then passed through another Seq2SeqEncoder model (any Sequenceto-Sequence encoder (Sutskever et al., 2014)) to get the final entity encoding u e i 4 .

Contextual Entity Embeddings
From Figure 1, we can observe that it is clear from context that "May" is not a time entity, and "today" is not an entity relevant to scheduling. We want to capture this contextual information for each entity.
To do so, we first encode the email as follows: (v w 1 · · · v wn ) = Seq2SeqEncoder(w 1 · · · w n ) (2) In Equation (2), v w i denotes the embedding for the i th word of the email X. d w here denotes the embedding size for the email embedding.
Once we have the email embeddings, we then compute the contextual embedding for each entity using an attention mechanism (Bahdanau et al., 2014). For entity e i , given the entity embedding u e i , and the email embeddings (v w 1 · · · w wn ), v w i ∈ R dw , the contextual embedding c e i ∈ R dw is obtained as follows: R are learned parameters. The final entity embedding (f e i ) is the concatenation of the entity embedding and the contextual embedding. Finally, for each entity, we generate a probability score to indicate if an entity is relevant or not.
Where M ∈ R (de+dw)×1 , g ∈ R are learned parameters, and σ indicates the sigmoid function.

Learning
Given the entities that are relevant to scheduling Y ⊆ E (Eg: "next week" and "Wednesday" in Figure 1), we train the model with a scoring loss as follows: Similar to (Ruder, 2017;Gehrmann et al., 2018;Li et al., 2018), we find that augmenting the learning with a related auxiliary task helps improve performance. In this case, a simple related auxiliary function is the task of sequence tagging. Specifically, given the email X, and the relevant entities Y, we tag the location of each entity with I-Time tag, and every other token with an O tag. Let the generated tags be z = (z 1 · · · z n ) and let C denote the set of possible tagging labels (in our case 2: {I-Time, O}) We then train a standard CRF for tagging as follows: Where P ∈ R dw×|C| , q ∈ R |C| are trainable parameters, T ∈ R |C|×|C| is the transition matrix and Z is the set of all possible sequence labels.
The final loss that we optimize for is Where γ balances between the two loss functions.

Choosing the Prediction Threshold
In order to find the threshold for classifying the positive class (i.e t such that e i = 1 if s e i > t), we compute the F1 score on the validation set using a grid of thresholds 5 , and choose the threshold maximizing the F1 score.

Identifying Negation Constraints
For a Scheduling Assistant to be able to correctly schedule meetings, understanding negations is crucial; otherwise it can lead to an unsatisfactory user experience (E.g.: In Figure 1, the meeting being scheduled on Wednesday would be a frustrating experience for the organizer Sherlock). Only about 10% of scheduling requests in our dataset have negation constraints. Building a model directly for the task did not show promising results from our preliminary experiments. We hypothesize this was due to the small volume of the data as well as the lack of good quality supervised data. Consequently, to find negated time-entities, we adopt the approach of first finding the negation scope. If an entity occurs inside the negation scope, we mark it to be negated.
In order to find the negation scope, we build on the approach proposed in Rosenberg (2013). We first find the negation cue ("except" in Figure 1). To find the negation cue, we first tokenize the email into sentences. For each sentence, we try fo find if cue from a set of negating cues (Appendix A) occurs in the sentence.
After finding the negation cue, we identify the POS tag of the negating cue (Prep. for "except"). Given the POS tag and the negation cue, we trigger a set of heuristics to identify the negation scope. Most heuristics work by identifying the negation cue from the dependency parse of the sentence as well as the governor of the negating word. Generating the narrow scope of negation (i.e. not containing the subject) then involves identifying the constituent from the constituency parse that contains both the negation cue and the governor word ("any day except Wednesday", see Figure 2). This constituent is considered to be the candidate narrow scope, and usually, the part following the cue is considered to be the narrow scope.
For some cases, the narrow negation scope is not enough to identify the time entity being negated. Consider the second example from Figure  For this case, the narrow scope is not enough to identify the entity being negated ("next week"). To find the wide scope, the heuristics leveraging the dependency path starting from the governor word are used. The main idea is to find the subject associated with the governor node, and extract that as the wide scope ("Next week"). Following the guidelines set by Morante and Daelemans (2012), we also include the aux dependency node in the wide scope ("does").
We also expand the heuristic set presented in Rosenberg (2013), adding the following rules: • If a Noun Phrase (NP) acting as an adverbial modifier acts as a subject to the governor, we include it in the wide scope (Figure 2) • If a NP exists as a subject of a passive clause, we include it in the wide scope, as well as the passive auxiliary associated with it.
• A Prepositional Phrase (PP) acting as a subject to the governor is included in the wide scope.
• For the narrow scope, we prune out the subtree that exists as an object an adverbial clause relation (advcl) headed by the governor node.
Due to space constraints, we include examples for the above in Appendix B. After obtaining the narrow and wide scopes, we check if any entities are found in the narrow scope. If found, those entities are scored negated. If no entities are found in the narrow scope, we then check the wide scope to find negated entities.
Finally, for some cases, we also use domain specific cues that imply a non-availability (For example, in "Dr. John out of office on Monday.", "out of office" implies an unavailability to meet.) When such implied negation cues are encountered, we default to a custom heuristic which marks any entity occurring within the sentence containing the cue word as a negation.

Experimental Setup
We first show the effectiveness of our proposed entity scoring method of incorporating context for improving temporal entity extraction on the TempEval-3 dataset (UzZaman et al., 2013) ( §4.1). We then show the efficacy of SHERLOCK for the task of extracting the correct temporal entities relevant for the context of scheduling, a task for which context becomes substantially more important ( §4.2). Finally, we show that SHERLOCK's negation module outperforms baseline methods on the task of identifying the entities with negation constraints ( §4.3). All our models have been implemented using the AllenNLP framework (Gardner et al., 2017). The hyperparameters for all the experiments can be found in Appendix C and D.

Dataset
We use the TimeBank dataset (Pustejovsky et al., 2003) which serves as the benchmark dataset for the TempEval series. The dataset consists of 256 documents, comprising of 95,391 tokens and 1,822 TimeEx entities for training and validation purposes, and 20 documents (6,375 tokens, 138 TimeEx) for serving as the test set.

Baseline Models
We show the performance of augmenting 3 rulebased models with our proposed model. Specifically, we consider SUTime (Chang and Manning, 2012), HeidelTime (Strötgen and Gertz, 2010) and Syntime (Zhong et al., 2017) as the rule-based extractors. We also compare against UWTime (Lee et al., 2014), a learning based model.

Evaluation
We use the official TempEval-3 scoring script and report the standard metrics. Specifically, we report the detection precision, recall and F1 with the relaxed and strict metrics. A gold mention is considered for the relaxed metric if any of the output candidates overlap with it and for the strict case, an exact string match is considered.

Date-Time extraction for Scheduling
This task aims at extracting the date-time entities necessary for the Scheduling Agent to correctly schedule the meeting. The task necessarily needs the model to incorporate context for making the correct prediction (E.g.: In Figure 1, "today" is a valid date-time entity, but not relevant for scheduling, while "May" refers to a person.)

Dataset
We use an internal scheduling dataset for training and evaluating the models. The dataset consists of emails and annotated times to schedule. The training and validation set consists of 44,214 emails (4,589,631 tokens, and 48083 entities), while the test set consists of 4914 emails (519,021 tokens, 5233 entities).

Baseline Models
We compare the performance of our model against SUTime, HidelTime and LUIS. We use LUIS as our base date-time extractor since it provides a much larger coverage for date-time entities 6 .

Evaluation
We use the Strict F1 measure to compare the performance of the different models proposed.

Negation Detection
Finally, we compare the performance of our proposed model on the task of negation extraction.

Dataset
We use an internal dataset for comparing different models on the task of negation extraction. The dataset consists of 1253 emails for which timeentities that are relevant to scheduling are selected, and those that are a part of a negation constraint are marked as negated entities. There exist 3231 time-entities, of which 1589 are negated entities.

Baselines
We compare our proposed method against a naive heuristic method as well as a neural model trained on a publicly available negation scope detection dataset.
Heuristic: A naive heuristic model. If a negation cue is identified in a sentence, the model predicts that all entities in that sentence are negated.
NegNN: We use a NegNN model (Fancellu et al., 2016), modified to use BERT contextual embeddings and trained on the *SEM2012 Shared task (Morante and Blanco, 2012). The training, development and test sets are a collection of stories from Conan Doyle's Sherlock Holmes, with the cue and scope annotated. An entity is considered negated if it is a part of a negated scope, as predicted by the model. The performance of the modified NegNN model on the *SEM2012 Task can be found in Appendix E.

Evaluation
We measure the performance of different models by comparing the predicted set of negated entities and the gold labels for the entities. If the model makes a mistake (i.e. it predicts an entity to be negated, when it's not), that's considered a false positive. Likewise, any negated entities missed by the model contribute to the false negatives. We thus report the precision, recall and F1 score.  Table 1 shows the performance of SHERLOCK's entity scoring module on the TempEval-2013 dataset. Note that SHERLOCK is limited by the recall of the base rule-based extractor 7 . We observe that augmenting the rule-based model with SHERLOCK improves the precision for all three cases without a substantial drop in recall. Furthermore, the precision obtained for all the augmented models compares favorably with UWTime.   Table 2 shows the performance of SHERLOCK for the Scheduling related date-time extraction task.

Date-Time extraction for Scheduling
As can be seen, being able to incorporate context yields a substantial improvement over the baseline methods.
We also observed that incorporating the tagging loss L t helped improve performance (SHERLOCK vs SHERLOCK -L t ). On investigating further, we observed that the attention weights associated with an entity for a model trained with L t concentrated much better around the position of the entity in the email than for the model without it. To see why that is advantageous, consider the following example: "Let's schedule for tomorrow. Next month, I plan on taking up Mr Baskerville's case" Here, the model without L t generates high attention weights for embeddings associated with "tomorrow", since the localization of the attention weights is much more spread out. Consequently, it also uses the embeddings associated with "tomorrow" for predicting the label of "next month", and hence, predicts it to be relevant to scheduling when it is not. Due to space constraints, we include our localization experiments in Appendix F.    Table 3 shows the performance of SHERLOCK compared to the baseline methods. We hypothesize the reason why SHERLOCK and the simple heuristic model outperform the neural baseline is two-fold: the neural negation model was trained on a dataset of Sherlock Holmes stories and consequently does not adapt well when used for negation extraction for emails; and that the neural model has no notion of implied negations.

Negation Detection
To test this hypothesis, we split the negations into two categories: explicit negations (defined as a negation where the cue is one of the explicit negation cues), and the case wherein the negation is implied (any case that was not explicit was deemed implied). 50% of emails in the negation dataset contained explicit negations only, 48% contained implied negations only and 2% contained both. Table 4 shows the performance of SHERLOCK and the baselines for both the explicit negation and the implied negation cases. Unsurprisingly, we see that both the baselines as well as SHERLOCK perform better on explicit negations than they do on implied negations. However, the gains observed by both the heuristic model and SHERLOCK substantially outperform NegNN, with SHERLOCK substantially outperforming the heuristic. Examples 1 and 2 in Table 5 give qualitative examples of where SHERLOCK outperforms the heuristic.
The primary source of errors for detecting implied negations is from failing to identify the correct cue. Since heuristics for implied negations are more heavily focused on precision, the absence of negation cues results in the model not detecting the implied negation, which in turn negatively impacts the recall. Examples 3, 4 and 5 in Table 5 show cases where the cue is not present in the heuristic set of implied cues.
For explicit negations, one source of errors is due to entity co-referencing. Consider Example 6: the negated time instance Tuesday is referenced as "then" and hence the negation scope "then" is insufficient to identify the correct negated entity. A few errors also stem from inherent ambiguity: in Example 7, the request can either be interpreted as being for anytime next week except Thursday 10am, or for 10 am on all days except Thursday. Finally, we also observe errors due to double negations (Example 8) and due to incorrect constituency and dependency parses.

Related Work
Existing approaches for time expression extraction can be categorized into rule-based methods and learning-based methods.
Rule-based Methods Rule-based methods like HeidelTime, and SUTime mainly handcraft deterministic rules to identify time expressions. Tem-pEx and GUTime use both hand-crafted rules and machine-learnt rules to resolve time expressions (Mani and Wilson, 2000;Verhagen et al., 2005;Blamey et al., 2013). HeidelTime manually designs rules with time resources to recognize time   (Strötgen and Gertz, 2010). SUTime designs deterministic rules at three levels (i.e., individual word level, chunk level, and time expression level) for time expression recognition (Chang and Manning, 2012). A recent type-based time tagger, SynTime, designs general heuristic rules with a token type system to recognize time expressions (Zhong et al., 2017). TOMN (Zhong and Cambria, 2018) uses the token regular expressions, similar to SUTime (Chang and Manning, 2012) and SynTime (Zhong et al., 2017), and further groups them into three token types, similar to SynTime. TOMN also leverages statistical information from entire corpus to improve the precisions and alleviate the deterministic role of deterministic and heuristic rules.
Learning-based Method Learning-based methods in TempEval series mainly extract features from text (e.g., character features, word features, syntactic features, and semantic features), and on the features apply statistical models (e.g., CRFs) to model time expressions (Bethard, 2013;Filannino et al., 2013;Llorens et al., 2010;UzZaman and Allen, 2010). Besides the standard methods, (Angeli et al., 2012;Angeli and Uszkoreit, 2013) exploit an EM-style approach with compositional grammar to learn latent time parsers. (Lee et al., 2014) leverage a learnt CCG (Steedman, 1996) parser and define a lexicon with linguistic context to model time expressions, using the loose structure information by grouping the constituent words of time expression under three token types.
Negation Scope Detection: Most negation detection research has focused in the Bio-Medical domain (Mehrabi et al., 2015;Agarwal and Yu, 2010). Non Bio-Medical text related negation detection tasks usually involve learning supervised classifiers over hand-crafted features leveraging syntactic structure (constituency and dependency parses) Lapponi et al., 2012;Chowdhury and Mahbub, 2012;White, 2012;Abu-Jbara and Radev, 2012). The current state of the art learned method uses a Neural BiLSTM-CRF model (Fancellu et al., 2016). However, the corpus available for negation detection is on Sherlock Holmes stories (*SEM2012 Shared task (Morante and Daelemans, 2012)), and consequently, as shown in this work, do not adapt well on language used in other document styles (like emails). In this work, we built over the work of (Rosenberg, 2013), who develop linguistic rules over constituency and dependency parses to identify negation scopes. The primary advantage of leveraging their work is that it is not strongly tied to the *SEM 2012 dataset, and we found this to generalize better.
Finally, there has been some work on directly training a model to extract entities and associated negation constraints (Bhatia et al., 2019). However, these works usually assume the availability of good quality annotated negated entities. Given enough annotated data, exploring this direction would be an interesting line of future work.

Conclusion
In this paper, we presented a novel model that leverages conventional high recall rule-based models and neural models for utilizing contextual informa-tion for identifying task relevant temporal entities. Our proposed model, when used in conjunction with 3 different rule-based models, achieves substantial precision gains for all of them without suffering from a huge recall drop. Further, the model substantially outperforms baseline methods for the task of identifying relevant date-time entities for the task of scheduling a meeting.
We also presented a novel approach for identifying the negation constraints of date-time entities. Identifying the negation constraints associated with date-time entities correctly is necessary for the task of scheduling. We showed that the existing neural approaches for detecting negation scopes do not transfer well, and that our proposed model based on heuristics defined over constituency and dependency parses achieves strong performance gains, especially for the case of explicit negations.

B Negation Heuristics
We describe the modifications made to the model presented in (Rosenberg, 2013).

B.1 Additional Rules: Wide scope
Instead of considering just the words connected by a nsubj relation (both in the general case as well as when the governor is linked by a conj link, in which case the subject is identified bt the term in the first coordinate clause), we include words connected by the nsubjpass as well as the npadvmod link. Here, the prepositional phrase (Before Wednesday) acts as a subject to the governor node, and consequently needs to be included in the wide scope.

B.2 Additional Rules: Narrow Scope Pruning
While computing the narrow scope, if the scope contains a word that forms an adverbial clause relation with the governor, we remove the dependency subtree associated with the advcl word. We give an example of such a case below: Example: This won't work as Mary is free next week.

D Hyperparameters: Negation Model
We reimplement the heuristics outlined in (Rosenberg, 2013) as well as the additional heuristics mentioned in the paper. We use the following components for our reimplementation: POS Tagger: Spacy POS Tagger 8 using the "en core web sm" model.

Constituency Parser: The Berkeley Neural
Parser Model (Kitaev and Klein, 2018). 9 Dependency Parsers: We use an ensemble of two parsers The AllenNLP implementation of the model presented in (Dozat and Manning, 2016) 10 The Spacy "en core web sm" model 11

E Neural Model Training on Sherlock Dataset
We experiment with two models on the *SEMEval 2012 dataset, which yield comparable performance to the model presented by (Fancellu et al., 2016). We experiment with tuning a BERT model (Devlin et al., 2018). Specifically, we compare the performance of the "bert base cased en" and "bert base multilingual cased" model 12 . We use the HuggingFace Transformers implementation of 8 Spacy POS Tagger 9 Berkeley Neural Parser 10 AllenNLP Biaffine Dependency Parser 11 Spacy Dependency Parser 12 BERT Models the models (Wolf et al., 2019). Specifically, we try fine-tuning BERT as well as using BERT as a feature extractor, and using an LSTM over the BERT model. We observed that using an LSTM performed a little better (Table 6), and consequently use that model for the experiments presented in the paper. As mentioned in the paper, we also observed that the model augmented with L tagging has an improved localization of entities by the attention module. Specifically, given an entity and the context, the attention weights induced by the entity (Equation 3) are localized more around the entity text for the case of the augmented model 13 In order to validate this hypothesis, we cluster the attention weights generated by an entity into two clusters using KMeans++ (Arthur and Vassilvitskii, 2006) with k = 2. We then extract the tokens associated with the smaller cluster, and consider that to be the localized context (since these weights are the highest, the embeddings associated with these tokens have the maximum impact in terms of predicting if the entity is relevant or not). For the localized context, we measure the degree of overlap with the original entity 14 as a measure of quality of localization. Figure 3 shows the histogram of the coverage of a model trained with and without L tagging . As can be seen, the model with the tagging loss is much better at concentrating the attention weights 13 For example: "Let us meet next week.", a good localization of the attention weights for entity "next week" would be if the attention weights were high around "next week" in text.
14 Given a localized context {c1 · · · c k } ⊂ X for the entity ei, we compute |{c 1 ···c k }∩e i | k around the entity in consideration. To see why this is advantageous, consider the following example: "Let's schedule for tomorrow. Next month, I plan on taking up Mr Baskerville's case." Entity in consideration: Next month With tagging: Next month Without tagging: tomorrow . next month Here, the model without tagging also uses the embeddings associated with tomorrow for predicting the label of next month, and consequently, predicts it to be relevant to scheduling when it is not.