Smart To-Do: Automatic Generation of To-Do Items from Emails

Intelligent features in email service applications aim to increase productivity by helping people organize their folders, compose their emails and respond to pending tasks. In this work, we explore a new application, Smart-To-Do, that helps users with task management over emails. We introduce a new task and dataset for automatically generating To-Do items from emails where the sender has promised to perform an action. We design a two-stage process leveraging recent advances in neural text generation and sequence-to-sequence learning, obtaining BLEU and ROUGE scores of 0.23 and 0.63 for this task. To the best of our knowledge, this is the first work to address the problem of composing To-Do items from emails.


Introduction
Email is one of the most used forms of communication especially in enterprise and work settings (Radicati and Levenstein, 2015). With the growing number of users in email platforms, service providers are constantly seeking to improve user experience for a myriad of applications such as online retail, instant messaging and event management (Feddern-Bekcan, 2008). Smart Reply (Kannan et al., 2016) and Smart Compose (Chen et al., 2019) are two recent features that provide contextual assistance to users aiming to reduce typing efforts. Another line of work in this direction is for automated task management and scheduling. For example. the recent Nudge feature 1 in Gmail and Insights in Outlook 2 are designed to remind users to follow-up on an email or pay attention to pending tasks.
Smart To-Do takes a step further in task assistance and seeks to boost user productivity by automatically generating To-Do items from their email * Work done as an intern at Microsoft Research. 1 Gmail Nudge 2 Outlook Insights context. Text generation from emails, like creating To-Do items, is replete with complexities due to the diversity of conversations in email threads, heterogeneous structure of emails and various meta-deta involved. As opposed to prior works in text generation like news headlines, email subject lines and email conversation summarization, To-Do items are action-focused, requiring the identification of a specific task to be performed.
In this work, we introduce the task of automatically generating To-Do items from email context and meta-data to assist users with following up on their promised actions (also referred to as commitments in this work). Refer to Figure 1 for an illustration. Given an email, its temporal context (i.e. thread), and associated meta-data like the name of the sender and recipient, we want to generate a short and succinct To-Do item for the task mentioned in the email. This requires identifying the task sentence (also referred to as a query), relevant sentences in the email that provide contextual information about the query along with the entities (e.g., people) associated with the task. We utilize existing work to identify the task sentence via a commitment classifier that detects action intents in the emails. Thereafter   Figure 2: Smart To-Do flowchart: The email content is first scanned to detect any possible commitment sentence. If present, a To-Do item is generated using a two-stage Smart To-Do framework.
we use an unsupervised technique to extract key sentences in the email that are helpful in providing contextual information about the query. These pieces of information are further combined to generate the To-Do item using a sequence-to-sequence architecture with deep neural networks. Figure 2 shows a schematic diagram of the process. Since there is no existing work or dataset on this problem, our first step is to collect annotated data for this task. Overall, our contributions can be summarized as follows: • We create a new dataset for To-Do item generation from emails containing action items based on the publicly available email corpus Avocado (Oard et al., 2015). 3 • We develop a two-stage algorithm, based on unsupervised task-focused content selection and subsequent text generation combining contextual information and email meta-data. • We conduct experiments on this new dataset and show that our model performs at par with human judgments on multiple performance metrics.

Related Works
Summarization of email threads has been the focus of multiple research works in the past (Rambow et al., 2004;Carenini et al., 2007;Dredze et al., 2008). There has also been considerable research on identifying speech acts or tasks in emails (Carvalho and Cohen, 2005;Lampert et al., 2010;Scerri et al., 2010) and how it can be robustly adapted across diverse email corpora (Azarbonyad et al., 2019). Recently, novel neural architectures have been explored for modeling action items in emails 3 We will release the code and data (in accordance with LDC and Avocado policy) at https://aka.ms/SmartToDo. Email examples in this paper are similar to those in our dataset but are not reproducing text from the Avocado dataset. (Lin et al., 2018) and identifying intents in email conversations . However, there has been less focus on task-specific email summarization (Corston-Oliver et al., 2004). The closest to our work is that of email subject line generation (Zhang and Tetreault, 2019). But it focuses on a common email theme and uses a supervised approach for sentence selection, whereas our method relies on identifying the task-related context.

Dataset Preparation
We build upon the Avocado dataset (Oard et al., 2015) 4 containing an anonymized version of the Outlook mailbox for 279 employees with various meta-data and 938, 035 emails overall.

Identifying Action Items in Emails
Emails contain various user intents including planning and scheduling meetings, requests for information, exchange of information, casual conversations, etc. . For the purpose of this work, we first need to extract emails containing at least one sentence where the sender has promised to perform an action. It could be performing a task, providing some information, keeping others informed about a topic and so on. We use the term commitment to refer to such intent in an email and the term commitment sentence to refer to each sentence with that intent. Commitment classifier: A commitment classifier C : S → [0, 1] takes as input an email sentence S and returns a probability of whether the sentence is a commitment or not. The classifier is built using labels from an annotation task with 3 judges. The Cohen's kappa value is 0.694, depicting substantial agreement. The final label is obtained from the majority vote, generating a total of 9076 instances (with 2586 positive/commitment labels and 6490 negative labels). The classifier is an RNN-based model with word embeddings and self-attention geared for binary classification with the input being the entire email context . The classifier has a precision of 86% and recall of 84% on sentences in the Avocado corpus.

To-Do Item Annotation
Candidate emails: We extracted 500k raw sentences from Avocado emails and passed them through the commitment classifier. We threshold the commitment classifier confidence to 0.9 and obtained 29k potential candidates for To-Do items. Of these, a random subset of 12k instances were selected for annotation. Annotation guideline: For each candidate email e c and the previous email in the thread e p (if present), we obtained meta-data like 'From', 'Sent-To', 'Subject' and 'Body'. The commitment sentence in e c was highlighted and annotators were asked to write a To-Do item using all of the information in e c and e p . We prepared a comprehensive guideline to help human annotators write To-Do Items containing the definition and structure of To-Do Items and commitment sentences, along with illustrative examples. Annotators were instructed to use words and phrases from the email context as closely as possible and introduce new vocabulary only when required. Each instance was annotated by 2 judges. Analysis of human annotations: We obtained a total of 9349 email instances with To-Do items, each of which was annotated by two annotators. To-Do items have a median token length of 9 and a mean length of 9.71. For 60.42% of the candidate emails, both annotators agreed that the subject line was helpful in writing the To-Do Item.
To further analyze the annotation quality, we randomly sampled 100 annotated To-Do items and asked a judge to rate them on (a) fluency (grammatical and spelling correctness), and (b) completeness (capturing all the action items in the email) on a 4 point scale (1: Poor, 2: Fair, 3: Good, 4: Excellent). Overall, we obtained a mean rating of 3.1 and 2.9 respectively for fluency and completeness. Table 1 shows a snapshot of the analysis.

Smart To-Do : Two Stage Generation
In this section, we describe our two-stage approach to generate To-Do items. In the first stage, we select sentences that are helpful in writing the To-Do item. Emails contain generic sentences such as salutations, thanks and casual conversations not relevant to the commitment task. The objective of the first stage is to select sentences containing informative concepts necessary to write the To-Do.

Identifying Helpful Sentences for Commitment Task
In the absence of reliable labels to extract helpful sentences in a supervised fashion, we resort to an unsupervised matching-based approach. Let the commitment sentence in the email be denoted as H, and the rest of the sentences from the current email e c and previous email e p be denoted as The unsupervised approach seeks to obtain a relevance score Ω(s i ) for each sentence.
The top K sentences with the highest scores will be selected as the extractive summary for the commitment sentence (also referred to as the query).
Enriched query context: We first extract top τ maximum frequency tokens from all the sentences in the given email, the commitment and the subject (i.e., {s 1 , s 2 , . . . s d } ∪ H ∪ Subject). Tokens are lemmatized and stop-words are removed. We set τ = 10 in our experiments. An enriched context for the query E is formed by concatenating the commitment sentence H, subject and top τ tokens. Relevance score computation: Task-specific relevance score Ω for a sentence s i is obtained by inner product in the embedding space with the enriched context. Let h(·) be the function denoting the embedding of a sentence with Ω(s Our objective is to find helpful sentences for the commitment given by semantic similarity between concepts in the enriched context and a target sentence. In case of a short or less informative query, the subject and topic of the email provide useful information via the enriched context. We experiment with three different embedding functions. (1) Term-frequency (Tf) -The binarized term frequency vector is used to represent the sentence.
(2) FastText Word Embeddings -We trained FastText embeddings (Bojanowski et al., 2017) of dimension 300 on all sentences in the Avocado corpus. The embedding function h(s j ) is given by taking the max (or mean) across the word-embedding dimension of all tokens in the sentence s j .
(3) Contextualized Word Embeddings -We utilize recent advances in contextualized representations from pre-trained language models like BERT (Devlin et al., 2019). We use the second last layer of pre-trained BERT for sentence embeddings.
We also fine-tuned BERT on the labeled dataset for commitment classifier. The dataset is first made balanced (2586 positive and 2586 negative instances). Uncased BERT is trained for 5 epochs for commitment classification, with the input being word-piece tokenized email sentences. This model is denoted as BERT (fine-tuned) in Table 2. Evaluation of unsupervised approaches: Retrieving at-least one helpful sentence is crucial to obtain contextual information for the To-Do item. Therefore, we evaluate our approaches based on the proportion of emails where at-least one helpful sentence is present in the top K retrieved sentences.
We manually annotated 100 email instances and labeled every sentence as helpful or not based on (a) whether the sentence contains concepts appearing in the target To-Do item, and (b) whether the sentence helps to understand the task context. Interannotator agreement between 2 judgments for this task has a Cohen Kappa score of 0.69. This annotation task also demonstrates the importance of the previous email in a thread. Out of 100 annotated instances, 44 have a replied-to email of which 31 contains a helpful sentence in the replied-to email body (70.4%). Table 2 shows the performance of the various unsupervised extractive algorithms. FastText with max-pooling of embeddings performed the best and used in the subsequent generation stage.

To-Do Item Generation
The generation phase of our approach can be formulated as sequence-to-sequence (Seq2Seq) learning with attention (Sutskever et al., 2014;Bahdanau et al., 2014). It consists of two neural networks, an encoder and a decoder. The input to the encoder consists of concatenated tokens from different meta-data fields of the email like 'sent-to', 'subject', commitment sentence H and extracted sentences I separated by special markers. For instance, the input to the encoder for the example in Figure 1 is given as: <to> alice <sub> hello ? <query> i will send it to you < sent> could you send me the sales report ? <eos> We experiment with multiple versions of the generation model as follows: Vanilla Seq2Seq: Input tokens {x 1 , x 2 , . . . x T } are passed through a word-embedding layer and a single layer LSTM to obtain encoded representations h t = f (x t , h t−1 ) ∀ t for the input. The decoder is another LSTM that makes use of the encoder state h t and prior decoder state s t−1 to generate the target words at every timestep t. We consider Seq2Seq with attention mechanism where the decoder LSTM uses attention distribution a t over timesteps t to focus on important hidden states to generate the context vector h t . This is the first baseline in our work.
Seq2Seq with copy mechanism: As the second model, we consider Seq2Seq with copy mechanism (See et al., 2017) to copy tokens from important email fields. Copying is pivotal for To-Do item generation since every task involves named

From: Raymond Jiang
To:support@company.com Subject: Bug 62 Hi, there is a periodic bug 62 appearing in my cellphone browser, whenever I choose to open the request. It might be a JavaScript issue on our side, but it would be nice if you take a look. Thanks, Ray.

From: Criag Johnson
To: Raymond Jiang Subject: Bug 62 Good Morning Ray, I shall take a look at it and get back to you. GOLD: Take a look at Bug 62 and get back to Raymond. PRED: Take a look at periodic and get back to Raymond.  entities in terms of the persons involved, specific times and dates when the task has to be accomplished and other task-specific details present in the email context. To understand the copy mechanism, consider the decoder input at each decoding step as y t and the context vector as h t . The decoder at each timestep t has the choice of generating the output word from the vocabulary V with probability p gen = φ(h t , s t , y t ), or with probability 1 − p gen it can copy the word from the input context. To allow that, the vocabulary is extended as V = V ∪ {x 1 , x 2 , . . . x T }. The model is trained end-to-end to maximize the log-likelihood of target words (To-Do items) given the email context. Seq2Seq BiFocal: As a third model, we experimented with query-focused attention having two encoders -one containing only tokens of the query and the other containing rest of the input context. We use a bifocal copy mechanism that can copy tokens from either of the encoders. We refer the reader to the Appendix for more details about training and hyper-parameters used in our models.

Experimental Results
We trained the above neural networks for To-Do item generation on our annotated dataset. Of the 9349 email instances with To-Do items, we used 7349 for training and 1000 each for validation and testing. For each instance, we chose the annotation with fewer tokens as ground-truth reference.
The median token length of the encoder input is 43 (including the helpful sentence). Table 4 shows the performance comparison of various models. We report BLEU-4 (Papineni et al., 2002) and the F1-scores for Rouge-1, Rouge-2 and Rouge-L (Lin, 2004). We also report the human performance for this task in terms of the above metrics computed between annotations from the two judges.
A trivial baseline -which concatenates tokens from the 'sent-to' and 'subject' fields and the commitment sentence -is included for comparison.
The best performance is obtained with Seq2Seq using copying mechanism. We observe our model to perform at par with human performance for writing To-Do items. Table 3 shows some examples of To-Do item generation from our best model.

Conclusions
In this work, we study the problem of automatic To-Do item generation from email context and metadata to provide smart contextual assistance in email applications. To this end, we introduce a new task and dataset for action-focused text intelligence. We design a two stage framework with deep neural networks for task-focused text generation.
There are several directions for future work including better architecture design for utilizing structured meta-data and replacing the two-stage framework with a multi-task generation model that can jointly identify helpful context for the task and perform corresponding text generation.

A.1 Hyper-parameters
We now provide the hyper-parameters and training details for ease of reproducibility of our results. The encoder-decode architecture consists of LSTM units. The word embedding look-up matrix is initialized using Glove embeddings and then trained jointly to adapt to the structure of the problem. We found this step crucial for improved performance. Using random initialization or static Glove embeddings degraded performance.
We also experimented with using either a shared or a separate vocabulary for the encoder and decoder. A token was included in the vocabulary if it occurred at least 2 times in the training input/target. Separate vocabulary for source and target had better performance. Typically, source vocabulary had higher number of tokens than target. A shared dictionary led to increased number of parameters in the decoder and to subsequent over-fitting. The validation data was used for early stopping. The patience was decreased whenever either the validation token accuracy or perplexity failed to improve. We used the OpenNMT framework in PyTorch for all our Seq2Seq experiments.

A.2 Illustrative Examples
In this Section, we provide further examples of the email threads along with the highlighted commitment sentence. Note that some of the emails have previous thread email present, and some do not have it. For each of these examples, we also provide the To-Do item written by the human judge (denoted as GOLD) and that predicted by our best model (denoted as PRED). As in the main text, the sentences have been paraphrased and names changed due to the data sensitivity of Avocado.

From: Beverly Evans
To: Carlos Simmons Subject: Amazon.com update Carlos, I came to know today from John Carter than we received a PHP script that is not decoding the correct database. Can you check with them why they sent us the eCommerce PHP code when the loss of functionality was not out fault? I have registered the error log in the eCommerce section because the staff scientist from Amazon mentioned it in his email. He also said they have not been able to resolve the issue and surprisingly did not mention who we should contact next. (This email exchange was about a week ago when I had handed them the cloud expenditures.) Also, we need to generate a PHP example to replicate the error. Could you update me if the team is working on it? Thanks, Beverly From: Carlos Simmons To: Beverly Evans Subject: Amazon.com update The PHP they shared with us is an example. eCommerce is not what they want us to resolve. I feel we should wait until their engineers test all possibilities. Joseph informed us that they need to test the database more carefully and figure out which PHP code to send to us and whether they want our feedback on the database. I am not sure why they sent me a 'relevant PHP example' -I thought there was the only file they sent us yesterday. I will forward that to you and Renata. GOLD: Forward PHP example to Beverly and Renata. PRED: Forward eCommerce PHP to Beverly.

From: Kirstin Barnes
To: Nannie Jacobs Subject: Ready for Product Launch Nannie, I am ready for the product launch. I need to include some of the enhancements in the presentation. I'll submit what is already completed and then do the remaining after the meeting.. Kirstin Barnes Product Engineer AvocadoIT, Inc. GOLD: Submit presentation with product enhancements. PRED: Submit the enhancements for product launch. To: R&D Subject: Software not ready yet for deployment Hello, Unlike our plan last month, the software is still not ready for deployment. The team put together some errors last week. We must plan to make it available latest by next week. I will keep you posted.   When synchronizing is done, we want to run a bash script to delete old records on the machine and remove all activity logs. How can I do this ? What is the way to perform this operation ? Also, in the bash script, is there a way to sort the dates so that we can identify older activities ? Thanks, Rebecca.

From: Julia Roberts
To: Rebecca Anderson Subject: Run a bash script while synchronize Rebecca, We had exactly the same feature to delete activities which you mentioned in our previous release. But we no longer have that in the new version due to resource constraints. I will take to John to review this again. Thanks, Julia. GOLD: Talk to John to review bash script again. PRED: Talk to John to review the activities. To: Gopal Majumdar Subject: Updates List for 3/11 Here's the update for this week. 1. The R&D team is working on a presentation for the knowledge tranfer for v5. It should be ready within next two weeks. 2. I have received their email, but need to review the ppt. 3. Did you want to know more about the new cloud feature for automatic version management ? Or was it a different feature ? 4. I am constantly working on this. 5. Didn't we discuss this point in our last email ? 6. We are making similar tests in the desktop for v5 before migrating to the cloud. We first have to make sure things work well for the desktop. I will send you more details soon. Did you get a chance to update your blog with information about these new features ? Thanks, Ramesh. GOLD: Send Gopal more details about tests in the desktop for v5. PRED: Send Gopal more details on presentation for the knowledge transfer. This needs to be done through a formal training session, but as of now let me point out some crucial points about room reservations. 1. In case you allocate a room for general meetings and administrative work, then make sure you book it for that month, but not for long periods of time. (Karen, can you check with Renata whether this is fulfilled for our meetings next week?) 2. In case of clients who do not need the entire month, make sure to reserve only for the particular month. If it exceeds that time, the system will authomatically resolve it and reserve it for next month. 3. For room reservation, either enter the number of hours required or the % of month, but not both. I would prefer precise hours. I will inform you when we can provide training, perhaps we can next week. Thanks, Lori. GOLD: Let Karen know about the training provide for room reservations. PRED: Let Karen know about room reservations. As discussed before, we have finally come to a concrete plan. I have attached the draft for your review. Please go over it and let me know asap your suggestions so that I can send them to the organizers. Please check the agenda and the names of trainees. I'll put together the Training plan and the overall 5-day agenda as soon as I can. Matthew. GOLD: Put together the training plan and the overall day agenda of software training. PRED: Put together the draft agenda for software training.