This Email Could Save Your Life: Introducing the Task of Email Subject Line Generation

Given the overwhelming number of emails, an effective subject line becomes essential to better inform the recipient of the email’s content. In this paper, we propose and study the task of email subject line generation: automatically generating an email subject line from the email body. We create the first dataset for this task and find that email subject line generation favor extremely abstractive summary which differentiates it from news headline generation or news single document summarization. We then develop a novel deep learning method and compare it to several baselines as well as recent state-of-the-art text summarization systems. We also investigate the efficacy of several automatic metrics based on correlations with human judgments and propose a new automatic evaluation metric. Our system outperforms competitive baselines given both automatic and human evaluations. To our knowledge, this is the first work to tackle the problem of effective email subject line generation.


Introduction
Email is a ubiquitous form of online communication.An email message consists of two basic elements: an email subject line and an email body.The subject line, which is displayed to the recipient in the list of inbox messages, should tell what the email body is about and what the sender wants to convey.An effective email subject line becomes essential since it can help people manage a large number of emails.Table 1 shows an email body with three possible subject lines.
There have been several research tracks around email usage.While much effort has been focused on email summarization (Muresan et al., 2001;Nenkova and Bagga, 2003;Rambow et al., 2004), email keyword extraction and action detection (Turney, 2000;Lahiri et al., 2017;Lin et al., * Work done during the internship at Grammarly.Email Body: Hi All, I would be grateful if you could get to me today via email a job description for your current role.I would like to get this to the immigration attorneys so that they can finalise the paperwork in preparation for INS filing once the UBS deal is signed.Kind regards, Subject 1: Current Job Description Needed (COMMENT: This is good because it is both informative and succinct.)Subject 2: Job Description (COMMENT: This is okay but not informative enough.)Subject 3: Request (COMMENT: This is bad because it does not contain any specific information about the request.) Table 1: An email with three possible subject lines.2018), and email classification (Prabhakaran and Rambow, 2014;Alkhereyf and Rambow, 2017), to our knowledge there is no previous work on generating email subjects.In this paper, we propose the task of Subject Line Generation (SLG): automatically producing email subjects given the email body.While this is similar to email summarization, the two tasks serve different purposes in the process of email composition and consumption.A subject line is required when the sender writes the email, while a summary is more useful for long emails to benefit the recipient.An automatically generated email subject can also be used for downstream applications such as email triaging to help people manage emails more efficiently.Furthermore, while being similar to news headline generation or news single document summarization, email subjects are generally much shorter, which means a system must have the ability to summarize with a high compression ratio (Table 2).Therefore, we believe this task can also benefit other highly abstractive summarization such as generating section titles for long documents to improve reading comprehension speed and accuracy.
To introduce the task, we build the first dataset, Annotated Enron Subject Line Corpus (AESLC), by leveraging the Enron Corpus (Klimt and Yang, 2004)

Annotated Enron Subject Line Corpus
To prepare our email subject line dataset, we use the Enron dataset (Klimt and Yang, 2004) which is a collection of email messages of employees in the Enron Corporation.We use Enron because it can be released to the public and it contains business and personal type emails for which the subject line is already well-defined and useful.As shown in Table 2, email subjects are typically much shorter than summaries generated in previ-1 dataset available at https://github.com/ryanzhumich/AESLC ous news datasets.While being similar to news headline generation (Rush et al., 2015), email subject generation is also more challenging in the sense that it deals with different types of email subjects while the first sentence of a news article is often already a good headline and summary.

Data Preprocessing
The original Enron dataset contains 517,401 email messages from 150 user mailboxes.To extract body and subject pairs from the dataset, we take all messages from the inbox and sent folders of all mailboxes.We then perform email body cleaning, email filtering, and email de-duplication.
We first remove any content from the email body that has not been written by the author of the email.This includes automatically appended boilerplate material such as advertisements, attachments, legal disclaimers etc.Since we are interested in emails with enough information to generate meaningful subjects, we only keep emails with at least 3 sentences and 25 words in the email body.Furthermore, to ensure that the email subject truly corresponds to the content in the email body, we only take the first email of a thread and exclude replies or forward emails.So we filter out follow up messages which contain "Original Message" section in the email body or have subject lines starting with "RE:" (reply-to messages) or "FW:" (forward messages).Finally, we observe that the same message can be sent to multiple recipients so we remove duplicate emails to make sure there is no overlap between the train and test set.We only keep the subject and body while other information such as the sender/recipient identity can be incorporated in future work.

Subject Annotation
We noted that using only the original subject lines as references may be problematic for automatic evaluation purposes.First, there can be many different valid, effective subject lines for the same email, yet the original email subject is only one of them.This is similar to why automatic machine translation evaluation often relies on mul-tiple references.Second, the email subject may be too general or too vague when the sender does not put that much effort into writing.Third, the sender may assume some shared knowledge with the recipient so that the email subject contains information that cannot be found in the email body.
To address the issues above, we ask workers on Amazon Mechanical Turk to read Enron emails in our dev and test sets and write an appropriate subject line.Each email is annotated with 3 subject lines from 3 different annotators.For quality control, we manually review and reject improper email subjects such as empty subject lines, subject lines with typos, and subject lines that are too general or too vague, e.g., "Update", "Schedule", "Attention to Detail" because they contain no bodyspecific information and can be applied generically to many emails.We found that while three annotations are different, they often contain common keywords.To further quantify the variation among human annotations, we compute ROUGE-L F1 scores for each pair of annotations: 34.04, 33.38, 34.26.

Our Model
Our model is illustrated in Figure 1.Based on recent progress in news summarization (Chen and Bansal, 2018), our model generates email subjects in two stages: (1) The extractor selects multiple sentences containing salient information for writing a subject ( §3.1).( 2) The abstractor rewrites multiple selected sentences into a succinct subject line while preserving key information ( §3.2).
We employ a multi-stage training strategy ( §3.4) including a Reinforcement Learning (RL) phase because of its usefulness for text generation tasks (Ranzato et al., 2016;Bahdanau et al., 2017) to optimize the non-differentiable metrics such as ROUGE and METEOR.However, unlike ROUGE for summarization or METEOR for machine translation, there is no available automatic metric designed for email subject generation.Motivated by recent work on regression-based metrics for machine translation (Shimanaka et al., 2018) and dialog response generation (Lowe et al., 2017), we build a neural network (ESQE) to estimate the quality of an email subject given the email body ( §3.3).The estimator is pretrained and fixed during RL training phase to provide rewards for the extractor agent.
While our model is based on Chen and Bansal (2018), they assume that there is a one-to-one relationship between the summary sentence and the document sentence: every summary sentence can be rewritten from exactly one sentence in the document.They also use ROUGE to make extraction labels and to provide rewards in their RL training phase.In contrast, our model extracts multiple sentences and rewrites them together into a single subject line.We also use word overlap to make extraction labels and use our novel ESQE as a reward function.

Multi-sentence Extractor
For the first stage, we need to select multiple sentences from the email body which contain the necessary information for writing a subject.This task can be formulated as a sequence-to-sequence learning problem where the output sequence corresponds to the position of "positive" sentences in the input email body.Therefore, we use a pointer network (Vinyals et al., 2015) to first build hierarchical sentence representations during encoding and then extract "positive" sentences during decoding.
Suppose our input is an email body D which consists of |D| sentences: We first use a temporal CNN (Kim, 2014) to build individual sentence representations.For each sentence, we feed the sequence of its word vectors into 1-D convolutional filters with various window sizes.We then apply ReLU activation and then max-over-time pooling.The sentence representation is a concatenation of activations from all filters Then we use a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) to capture documentlevel inter-sentence information over CNN outputs: For sentence extraction, another LSTM as decoder outputs one "positive" sentence at each time step t. Figure 1: Our model architecture.In this example, the input email body consists of four sentences from which the extractor selects the second and the third.The abstractor generates an email subject from the selected sentences.
The quality estimator provides rewards by scoring the subject against the email body.
we choose a "positive" sentence from a 2-hop attention process.First, we build a context vector e t by attending all d j : Then, we get an extraction probability distribution o t over input sentences: where {v, W, U} are trainable parameters.We also add a trainable "stop" vector with the same dimension as the sentence representation.The decoder can choose to stop by pointing to this "stop" sentence.

Multi-sentence Abstractor
In the second stage, the abstractor takes the selected sentences from the extractor and rewrites them into an email subject.We implement the abstractor as a sequence-to-sequence encoderdecoder model with the bilinear multiplicative attention (Luong et al., 2015) and copy mechanism (See et al., 2017).The copy mechanism enables the decoder to copy words directly from the input document, which is helpful to generate accurate information verbatim even for out-of-vocabulary words.

Email Subject Quality Estimator
Since there is no established automatic metric for SLG, we build our own Email Subject Quality Estimator (ESQE).Given an email body D and a potential subject for the subject s, our quality estimator outputs a real-valued Subject Quality score SQ(D, s).The email subject and the email body are fed to a temporal CNN.
We concatenate the output of CNNs as the email body and subject pair representation.Then, a single layer feed-forward neural net follows to predict the quality score from the representation.
To train the estimator, we collect human evaluations on 3,490 email subjects.In order to expose the estimator to both good and bad examples, 2,278 of the 3,490 are the original subjects and the remaining 1,212 subjects are generated by an existing summarization system.Each subject has 3 human evaluation scores (the same human evaluation as explained in §4.1) and we train our estimator to regress the average.
The inter-annotator agreement is 0.64 by Pearson's r correlation.Even though there is no value range restriction for the estimator output, we found the scores returned by our ESQE after training are bounded from 0.0 to 4.0.

Multi-Stage Training
Supervised Pretraining.We pretrain the extractor and the abstractor separately using supervised learning.To this end, we first create "proxy" sentence labels by checking word overlap between the subject and the body sentence.For each sentence in the body, we label it as "positive" if there is some token overlap of non-stopwords with the subject, negative otherwise.The multi-sentence extractor is trained to predict "positive" sentences by minimizing the cross-entropy loss.For the multi-sentence abstractor, we create training examples by pairing the "positive" sentences and the original subject in the training set.Then the abstractor is trained to generate the subject by maximizing the log-likelihood.
RL Training for Extractor.To formulate this RL task at this stage, we treat the extractor as an agent, while the abstractor is pretrained and fixed.The ESQE provides the reward by judging the output subject.At each time step t, it observes a state s t = (D, d o t−1 ), and samples an action a t to pick a sentence from the distribution in Equation 4: where π θ denotes the policy network described in Section 3.1 with a set of trainable parameters θ.
The episode is finished in T actions until the extractor picks the "end-of-extraction" signal.Then, the abstractor generates a subject from the extracted sentences and the quality estimator calculates the score.The quality estimator is the reward received by the extractor: For training, we maximize the expected reward: with the following gradient given by the REIN-FORCE algorithm (Williams, 1992): b t is the baseline reward introduced to reduce the high variance of gradients.The baseline network has the same architecture as the decoder of the extractor.But it has another set of trainable parameters θ b and predicts the reward by minimizing the following mean squared error: We first use automatic metrics from text summarization and machine translation: (1) ROUGE (Lin, 2004) including F1 scores of ROUGE-1, ROUGE-2, and ROUGE-L.( 2) METEOR (Denkowski and Lavie, 2014).They all rely on one or more references and measure the similarity between the output and the reference.In addition, we include ESQE, which is a reference-less metric.
Human Evaluation.While those automatic scores are quick and inexpensive to calculate, only our quality estimator is designed for evaluation of subject line generation.Therefore, we also conduct an extensive human evaluation on the overall score and two aspects of email quality: informativeness and fluency.An email subject is informative if it contains accurate and consistent details with the body, and it is fluent if free of grammar errors.We show the email body along with different system outputs as potential subjects (the models are anonymous).For each subject and each aspect, the human judge chooses a rating from 1 for Poor, 2 for Fair, 3 for Good, 4 for Great.We randomly select 500 samples and have each rated by 3 human judges.

Baselines
To benchmark our method, we use several methods from the summarization field, including some recent state-of-the-art systems, because the email subject line can be viewed as a short summary of the email content.They can be clustered into two groups.
(1) Unsupervised extractive or/and abstractive summarization.LEAD-2 directly uses the first two sentences as the subject line.We choose lead-2 to include both the greeting and the first sentence of main content.TextRank (Mihalcea and Tarau, 2004) and LexRank (Erkan and Radev, 2004) are two graph-based ranking models to extract the most salient sentence as the subject line.Shang et al. ( 2018) use a graph-based framework to extract topics and then generate a single abstractive sentence for each topic under a budget constraint.
(2017) augments the standard encoder-decoder network by adding the ability to copy words from the source text and using the coverage loss to avoid repetitive generation.Paulus et al. (2018) propose neural intra-attention models with a mixed objec-    (Narayan et al., 2018a).It is unclear how they perform to generate email subject lines of extremely abstractive summarization.We train these models on our dataset.

Implementation Details
Our Model.
We pretrain 128-dimensional word2vec (Mikolov et al., 2013) on our corpus as initialization and update word embeddings during training.We use single layer bidirectional LSTMs with 256 hidden units in all models.The convolutional sentence encoders have filters with window sizes (3,4,5) and there are 100 filters for each size.The batch size is 16 for all training phases.We use the Adam optimizer (Kingma and Ba, 2015) with learning rates of 0.001 for supervised pretraining and 0.0001 for RL.We apply gradient clipping (Pascanu et al., 2013) with L2-norm of 2.0.The training is stopped early if the validation performance is not improved for 3 consecutive epochs.All experiments are performed on a Tesla K80 GPU.All submodels can converge within 1-2 hours and 10 epochs so the whole training takes about 4 hours.
Baselines.For TextRank and LexRank, we use the sumy2 implementation which uses the snowball stemmer, the sentence and word tokenizer from NLTK3 .For Shang et al. (2018), we use their extension of the Multi-Sentence Compression Graph (MSCG) of Filippova (2010) and a budget of 10 words in the submodular maximization.We choose the number of communities from [1,2,3,4,5] based on the dev set and we find that 1 works best.For the Pointer-Generator Network from See et al. (2017), we follow their implementation4 and use a batch size 16.For Paulus et al. (2018), we use an implementation from Keneshloo et al. ( 2018)5 .We did not include the intra-temporal attention and the intra-decoder attention because they hurt the performance.For Hsu et al. (2018), we follow their code6 with a batch size 16.All training is early stopped based on the dev set performance.

Automatic Metric Evaluation
We report the automatic metric scores against the original subject and the subjects generated by Turkers (human annotations) as references in Tables 3a and 3b respectively.Table 4 also shows the ESQE scores.Overall, our method outperforms the other baselines in all metrics except METEOR.Other systems can achieve higher ME-TEOR scores because METEOR emphasizes recall (recall weighted 9 times more than precision) and other extractive systems such as LexRank can generate longer sentences as subject lines.
In Table 3a, where the original subject is the singular reference, the score of our system is rated close to and even higher than the human annotation on both sets.This is because our system is trained on the original subject and is likely a better domain fit.In Table 3b, all systems use two human annotations as the reference to have a fair comparison to the human-to-human agreement in the last row.Our system output is actually rated a bit higher than the original subject.This is because the original subject can differ from the human annotation when the sender and the recipient share some background knowledge hidden from the email content.Furthermore, in the last row, the human-to-human agreement is much higher than all the system outputs and the original subject.This indicates that different annotators write Dev Test LEAD-2 1.56 1.55 TextRank (Mihalcea and Tarau, 2004) 1.59 1.59 LexRank (Erkan and Radev, 2004) 1.57 1.56 Shang et al. (2018) 2.10 2.09 See et al. (2017) 2.22 2.19 Paulus et al. (2018) 2.30 2.30 Hsu et al. (2018) 1.44 1.46 Narayan et al. (2018a) 1

Human Evaluation
Table 5 shows that our system is rated higher than the baselines on overall, informative, and fluent aspects.For overall scores, the baselines are all between 1.5 and 2.0, indicating the subjects are usually considered as poor or fair (recall that the scale is 1-4, with 4 being the highest).Our system is 2.28, while the original subject and human annotation are between 2.5 and 3.0.This means more than half of our system outputs are at least fair, and the original subject and human annotation are often good or great.We also find that in 89 out of 500 emails, our system outputs have ratings higher than or equal to the original and human annotated subjects.Furthermore, the raters prefer the human annotated subject to the original subject.

Metric Correlation Analysis
It is important to check if the automatic metric scores can truly reflect the generation quality and serve as valid metrics for subject line generation.Therefore, in Table 6, we investigate their correlations with the human evaluation.To this end, we take the average of three human ratings and then calculate Pearson's r and Spearman's ρ between different automatic scores and the average human rating.We also report the inter-rater agreement in the last row by checking the correlation between the third human rating and the average of the other two.We find that the inter-rater agreement is moderate with 0.64 for Pearson's r and 0.58 for Spearman's ρ.We would recommend ESQE because it has the highest correlations while being referenceless.

Case Study
Table 7 shows examples of our model outputs.Our model works well by first picking multiple sentences containing information such as named entities and dates and then rewriting them into a succinct subject line preserving the key information.
In Example 7a, our model extracts sentences with the name of the company and position "KWI President of the Americas".It also captures the importance of the opportunity for this position.Similarly, in Example 7b, our model identifies "Western Power Trading" for "filings".In Example 7c, our model identifies the date of degree "December 2011" and action item "application".However, we also found our model can fail on emails about novel topics, as in Example 7d where the topic is scheduling farewell drinks.Our model only captures the name of the restaurant but not the purpose and topic since it has not seen this kind of email in training.
Another related line of research is natural language generation.Our task is most similar to single document summarization because the email subject line can be viewed as a short summary of the email content.Therefore, we use different summarization models as baselines with techniques such as graph-based extraction and compression, sequence-to-sequence neural abstractive summarization with the hierarchical attention, copy, and coverage mechanisms.In addition, RL has become increasingly popular for text generation to optimize the non-differentiable metrics and to reduce the exposure bias in the traditional "teaching forcing" supervised training (Ranzato et al., 2016;Bahdanau et al., 2017;Zhang and Lapata, 2017;Sakaguchi et al., 2017).For example, Narayan et al. (2018b) use RL for ranking sentences in pure extractive summarization.

Conclusions and Future Work
In this paper, we introduce the task of email subject line generation.We build a benchmark dataset (AESLC) with crowdsourced human annotations on the Enron corpus and evaluate automatic metrics for this task.We propose our model of subject generation by Multi-Sentence Selection and Rewriting with Email Subject Quality Estimation Reward.Our model outperforms several competitive baselines and approaches human-level performance.
In the future, we would like to generalize it to multiple domains and datasets.We are also interested in generating more effective and appropriate subjects by incorporating prior email conversations, social context, the goal and style of emails, personality, among others.

A Amazon Mechanical Turk
Figure 2 shows the Amazon Mechanical Turk interface for workers to write the email subject from the body.Figure 3 shows the interface for the email subject evaluation.For quality control, we include a random subject.Annotators who consistently give high ratings for Random subjects or low ratings for Human Annotation subjects are excluded from our analysis.This filtering resulted in a total of 389 examples with 3 valid ratings each of which we take the average.

Figure 2 :
Figure 2: Amazon Mechanical Turk job interface for the email subject annotation.

Figure 3 :
Figure 3: Amazon Mechanical Turk job interface for the email subject evaluation.All the other system outputs are in the same job but are not shown here for brevity.
and crowdsourcing.Furthermore, in order

Table 2 :
Annotated Enron Subject Line Corpus compared with other datasets.
Denoting the decoder hidden state as h t ,
* indicates there is no statistically significant difference from our system with p < 0.01 under a paired t-test.tive of supervised training and policy learning.

Table 5 :
Human evaluation.* indicates the difference from our system is statistically significant with p < 0.01 under a paired t-test.

Table 7 :
Case study.The sentences extracted by our model are underlined.(a)(b)(c): Our model can generate effective subjects by extracting and rewriting multiple sentences containing salient information.(d): Our model fails to generate reasonable subjects for the novel topic of "farewell" which is not seen during training.