Work Hard, Play Hard: Email Classification on the Avocado and Enron Corpora

In this paper, we present an empirical study of email classification into two main categories “Business” and “Personal”. We train on the Enron email corpus, and test on the Enron and Avocado email corpora. We show that information from the email exchange networks improves the performance of classification. We represent the email exchange networks as social networks with graph structures. For this classification task, we extract social networks features from the graphs in addition to lexical features from email content and we compare the performance of SVM and Extra-Trees classifiers using these features. Combining graph features with lexical features improves the performance on both classifiers. We also provide manually annotated sets of the Avocado and Enron email corpora as a supplementary contribution.


Introduction
Email has quickly become a crucial communication medium for both individuals and organizations. Kiritchenko and Matwin (2011) show that a typical user daily receives 40-50 emails. Because of its popularity, different research problems related to email classification tasks have arisen. These tasks include spam-filtering, assigning priority to messages, and foldering messages according a user-specified strategy (Klimt and Yang, 2004). In spite of the popularity of email, many classification tasks have been hampered due the lack of availability of task-related data, due to the privacy issues surrounding email. However, two large data sets are available. First, a large dataset of real emails, the Enron corpus, was made publicly available by the Federal Energy Regulatory Commission (FERC) during the legal investigation of the company's collapse. Second, in February 2015, the Linguistic Data Consortium distributed a data set of emails from an anonymous defunct information technology company referred as Avocado (Oard et al., 2015).
In this paper, we present an empirical study on email classification into two categories: Business and Personal. We train only on the Enron corpus, but test on both the Enron and Avocado corpora for this classification task in order to investigate how dependent on the training corpus the learned models are. In addition, we provide new annotated datasets based on the two corpora 1 .
We manually annotated datasets based on the Enron and Avocado corpora for this classification task. We use lexical features as well as social network features extracted from the email exchange network of both Enron and Avocado. The experiments show that when the social network features combined with lexical features outperforms the lexical features alone. We first present some related work on both the Enron and Avocado corpora (Section 2). Then in Section 3, we describe the datasets and the annotation scheme used in this paper. We discuss lexical features in Section 4, and show how to extract social network features from the email exchange in Section 5. Finally, we present some experiments with different settings (Section 6). The experiments show that adding features extracted from graphs of the email exchange to the lexical features improves the classification performance.

Related Work
Since the Enron corpus has been made publicly available, many researchers have worked on the Enron corpus with different tasks. To our knowledge, the previous effort most closely related to this paper is that of Jabbari et al. (2006). They released a large set of manually annotated emails, in which they categorize a subset of more than 12,000 Enron emails into two main categories: "Business" and "Personal" and then into subcategories "Core Business" and "Close Personal". These sub-categories represent the two main categories respectively. The "Core Business" category has more than 4,500 emails while the "Close Personal" has more than 1,800. We compare our data to their data in detail in Section 3. Agarwal et al. (2012) released a gold standard of the Enron power hierarchy and predict the dominance relations between two employees using the degree centrality of the email exchange network. They released this gold standard of the Enron corpus with thread structure as a MongoDB database. Hardin et al. (2014) study the relation between six social network centrality measures and the hierarchal ranking of Enron employees. Mitra and Gilbert (2013) study gossip in the Enron corpus. They use the data set in Jabbari et al. (2006) to study the proportion of gossip in business and personal emails and find that gossip appears in both personal and business emails and at all levels of the organizational hierarchy. They use an NER classifier to label person names in emails then classify emails mentioning a person not in the recipient list nor the sender as gossip.
A related task is to predict the recipient of an email. Graus et al. (2014) propose a generative model to predict the recipient of an email using the email communication graph and the email content. The model is trained on Enron and tested on Avocado. The full enterprise email exchange network is used to build the communication graph as a directed graph, as we do in Section 5. They report that the optimal performance is achieved by combining the communication graph and email content.

Datasets and Annotation Scheme
As a part of the work in this paper, we have used the Amazon Mechanical Turk (AMTurk) crowdsourcing platform to annotate a subset of the Enron corpus. In addition, due to license constraints, we have in-house annotated a subset of the Avocado corpus. We use these two sets as well as the dataset distributed by Jabbari et al. (2006) (which we refer to as the "Sheffield set") for the classification task in this paper.

Labeling
Unlike Jabbari et al. (2006), we are interested in maintaining the thread structure of emails (for future work). Annotators were given email threads of various lengths and asked to annotate each email in the thread and to annotate the thread as a whole. However, classifying email content into business and personal can be a subjective task. For example, if an email talks about an invitation to a picnic for the employees families, one annotator might label this email as business email with the perspective that it talks about a business-related event. On the other hand, another annotator might have a perspective that this is personal event even though it is organized by the company.
We have provided instructions for the annotators to annotate each email with one of the following labels and criteria: 1. Business: The content of the message is clearly professional (even if the language used is very friendly) and it does not contain any personal content; it should be related to the company work.
2. Somehow Business: The main purpose of the message is professional but it has some personal parts.
3. Mixed: the content of the message belongs to two or more of the categories (typically because the sender combines different content in one email).

Somehow Personal:
The main purpose of the message is personal but it has some businessrelated content.
5. Personal: The content of the message is clearly personal (even if the language used is very formal) and it does not contain any professional part.
6. Cannot Determine: If the there is no enough content to determine the category.
We added some detailed instructions to deal with certain cases:  • If a message is about a social event inside the company, such as celebrating a new baby of an employee, or a career promotion, it belongs to the second category ("somehow business").
• If a message is about a social event outside the company but still related to the company, such as a picnic (usually family members are invited), it belongs to the fourth category ("somehow personal").
• If a message is about a social event which is not related to the company such as a charity but company employees are encouraged to participate, it belongs to the fourth category ("somehow personal").
• If a message is too short to determine its category (or even empty), it should have the same category as the message it is responding to, or the message it is forwarding.
• If a message is ambiguous, try to read other messages in the thread to clarify.
• If a message is spam or in the rare case that the first message of a thread is very short or empty, say "cannot determine".

Annotators
In the AMTurk task (i.e. Enron), each email thread was annotated by three different turkers. The group of turkers differs from a thread to another. We first ran several batches on AMTurk in which we assigned 5 annotators to each HIT; by studying the resulting data sets, we found that 3 annotators is sufficient and less costly, and most of the data was annotated using 3 Turkers.
To determine the consensus label, we give each of the categories in the above list a numerical label between 1 and 6, with 6 being "cannot determine" and otherwise a larger number indicating that the email is more personal. First, we discard any "cannot determine" label. Therefore, if there is one or more labels other than "cannot determine" we limit voting to these labels. If all labels are "cannot determine", the result of voting is "cannot determine" too. Then, we compute the majority vote of all labels from the three turkers, in case of ties, we take the floor of the mean of ties. For instance, if the labels are {1, 2, 6} the majority vote result is {1, 2}. The mean is 1.5 and the floor is 1. Thus, the final label is 1. There are 5,372 (50.8%) emails in which all annotators gave the same label. The number of emails for each category with consensus among all annotators as follows: The average standard deviation of ordinal values (i.e. 1: business, 2: Somehow Business ... etc) in Enron emails = 0.37. For computing the average of standard deviation, we exclude any "Cannot Determine" label before computing the standard deviation per email, and if the email has less than two labels other than "Cannot Determine", we exclude that email too. We do so because "Cannot Determine" has no actual ordinal value. For the annotation of the Avocado corpus, we hired two in-house undergraduate students to annotate two overlapping subsets of the Avocado corpus, using the same instructions as we gave the Turkers. The licensing conditions for this corpus appear to prohibit using AMTurk. In case of disagreement in Avocado ∪ (described in 3.4), we arbitrarily choose the first annotator's label for consistency, unless the first is "cannot determine", in which case we choose the second. The average standard deviation of ordinal values (i.e. 1: business, 2: Somehow Business ... etc) in Avocado emails = 0.08. Since we have only two annota-tors, we exclude any email labeled "Cannot Determine" by any annotator. The inter-annotator agreement in Avocado emails κ = 0.58 (Cohen's kappa). 2 The complex labeling scheme described here will be useful for different tasks in the future. However, for the goal of this paper, we aim to group these labels into binary classes: business and personal. Therefore, we normalize the labels as follows: we group "Business" and "Somehow Business" into one category "Business", and "Personal", "Somehow Personal" and "Mixed" into one category "Personal". "Cannot Determine" remains the same.
Finally we exclude emails with labels other than "Business" or "Personal" (i.e. emails labeled as "Cannot determine"). These emails are discarded in both training and evaluation. This label is very rare; it occurs only 0.26% of the time in the Enron data, and 0.38% in the Avocado data.

Enron Datasets
The annotated emails by turkers are a subset of the Enron corpus released by Agarwal et al. (2012), which has more than 36,000 threads and 270,000 emails. We choose this version of Enron because it maintains the thread structure of emails. From this collection, we have randomly sampled total of 3,941 threads with different numbers of emails per thread (2, 3, 4, and 5). The total number of emails is 10,573. We exclude 198 threads (5%) and 27 additional emails (0.26%) labeled as "Cannot determine". The sample has 3,222 emails overlapping with the Sheffield set of Jabbari et al. (2006) (after excluding "Cannot determine" emails). We also exclude all emails in the Sheffield set that we could not match with an email in (Agarwal et al., 2012). After obtaining the final labels as described in 3.2, we got 3,743 threads and 10,546 emails labeled as either "Business" or "Personal" from the Enron corpus. Table 1 shows the summary of the Enron datasets with the following notations: • Enron T : The threads and emails obtained from AMTurk as in 3.2.
• Sheffield all : All the Sheffield set except those that we could not match in (Agarwal et al., 2012).
2 we treat classes as completely different categories when computing Cohen's kappa • Sheffield sub : A subsample of the the Sheffield set ("Business Core" and "Personal Close").
• Enron ∩A : The intersection between Enron T and Sheffield all in which both agree in labels.
• Enron ∩D : The intersection between Enron T and Sheffield all in which disagree in labels.
• Enron ∩ : The intersection between Enron T and Sheffield all .
In case of disagreement, we use Sheffield all labels.

Avocado Datasets
The Avocado Email Collection has 62,278 threads and 937,958 emails. We have randomly sampled total of 2,000 threads and 5,339 emails from the Avocado corpus with different number of emails per thread as in Enron.
As described in Section 3.2, each annotator labeled 1,200 threads, with 400 threads in common. The first annotator has 3,197 emails, while the second has 3,207, and 1,065 emails are in common. After obtaining the final labels as described in Section 3.2, we got total of 1,976 threads and 5,280 emails labeled as either "Business" or "Personal" from the Avocado corpus. Table 2 shows the summary of the Avocado datasets with the following notations: • Avocado 1 : The threads and emails labeled by the first annotator as in 3.2.
• Avocado 2 : The threads and emails labeled by the second annotator as in 3.2.
• Avocado ∩A : The intersection between Avocado 1 and Avocado 2 in which both agree in labels.
• Avocado ∩D : The intersection between Avocado 1 and Avocado 2 in which they disagree in labels.

Train, Development and Test Sets
For the binary classification task in this paper, only emails are used as data points. We defer the classification of threads to future work. We use three datasets for the experiments, namely: Enron ∪ , Enron ∩A , and Avocado ∪ (described in Section 3.3 and Section 3.4). Enron ∪ and Enron ∩A are divided into into train, development and test sets with 50%, 25% and 25% of the emails respectively. Avocado ∪ is divided equally into development and test sets (since we will not train on Avocado). For the rest of this paper, we refer to the train, development and test sets by subscripts tr, dev, and tes respectively.

Lexical and Local Features
For the classification task, we use pre-trained GloVe embedding vectors as lexical features (Pennington et al., 2014). There are various word vector sets available online, each trained from different corpora and embedded into various dimension sizes. We use GloVe pre-trained word vector sets such that each email is represented by a vector of a fixed number of dimensions equal to the dimensionality of GloVe word vector set. We average all word vectors in the email using the pre-trained word vectors as follows: Here, f e j ,v i is the frequency of the word corresponding to vector v i in email e j , v i is the word embedding vector in GloVe set. Both the body and subjects are included in the email content. In addition to the contextual features, we use the number of recipient and the length of the email (in words) as meta-information that can be extracted from the email locally without looking at the email exchange network.

Social Network Features
The email exchange network can be represented as social networks with different structures. One possible structure is to represent the email exchange network as a bipartite graph with two disjoint sets of nodes, emails and employees (i.e. email addresses) such that edges connect emails with employees, as edges between an email and employees exist if and only if their email address appears as either the sender or a recipient in that email; we refer to this structure as the email-centered network. Another structure is a graph (not necessarily bipartite) whose nodes represent employees (i.e. email addresses) and whose edges represent email communication such that and edge exists if there is a least one email has been exchanged between the two end nodes; we refer to this structure as the address-centered network. Figure 1 illustrates these two types of graphs. In both graphs we normalize multiple email addresses belonging to the same person into one email address (node).
For each corpus (i.e. Enron and Avocado), we construct directed and undirected graphs from these two networks (i.e. email-centered and address-centered). In directed graphs, each edge has a source and destination node, which shows explicitly the directionality of the email (i.e. sender and recipients), while in undirected graphs, the directionality of communication is not reflected within edges. In the case of the addresscentered graph, the edge weight reflects the number of emails that have been exchanged between the two ends and the direction; in the case of the email-centered network, the weights are always 1. Different features from these types of graphs can be extracted.
We use the whole exchange network, including all labeled and unlabeled emails to build these graphs. We include features from both the sender and the recipients (either in the "to" or "cc" list ). In case of the email has multiple recipients, we e 1 e 2 e 3 e n . . . where: σ(s, t) is the number of shortest paths between s and t σ(s, t|v) is the number of these paths that pass through v Eigenvector centrality ;, , †, ‡,u For a node v: xv where: x is the eigenvector corresponding to the largest eigenvalue of A Ax = λx Closeness centrality ;, , ‡,w,u v,u) where: d(v, u) is the shortest-path distance between v and u.
Auth Score ;, , †, ‡,w,u The authority score for a node using HITS algorithm (Kleinberg, 1999) Hub Score ;, , †, ‡,w,u The hub score for a node using HITS algorithm.
; Extracted from directed graphs. Extracted from undirected graphs. † Features of senders/recipients in the Address-centered network. ‡ Features of emails in the Email-centered network. w Uses edge weights. u All edge weights are considered equal to 1. average the value corresponding to each feature. Table 3 summarizes the social network features.

Experiments
In this section, we present empirical results on the email classification task by conducting different  Table 4: Grid-search parameter space. B: Business, P: Personal. Balanced: class weights are adjusted inversely proportional to class frequencies in the training set experiments on lexical and social network feature sets. We use three metrics to measure the performance, namely: accuracy score, Business F-1 score and Personal F-1 score. We are mainly interested in improving the Personal F-1 score since it is the minority class. We compare the performance of SVM classifiers and extremely randomized trees (commonly known as Extra-Trees) (Geurts et al., 2006) as implemented in the scikitlearn python library (Pedregosa et al., 2011). We tune the hyper-parameters using grid-search with 3-fold cross-validation on the training set. Table  4 shows the grid-search space for the two classifiers. As a preprocessing step, we apply logarithmic transformation on the network and metainformation feature values to be approximately normal in distribution. Then, all feature values (i.e. lexical, network and meta-info) are standardized to have zero-mean and unit-variance.

Obtaining Best GloVe Vector Set
First, in order to obtain the GloVe vector set that maximizes the performance, we experiment with different GloVe pre-trained vectors as lexical features (meta-information features are not included). Table 5 shows the results of classification of different GloVe pre-trained vector sets trained on Enron ∩A tr and tested on Enron ∩A dev . In addition, a bag-of-words (BOW) model is shown as a baseline. In this model, we represent each email as a vector of frequencies (term counts), then we select the top 500 words using χ 2 feature selection method. In all models (i.e. GloVe vectors and BOW), we use SVM classifiers and we tune parameters using grid-search.
The results show that, in general, more training data is better, and more dimensions are better. However, the best set is the 300-dimensional 42B.300d which is trained on a large 42 billion token corpus, rather than the larger 840 B wordsbased embeddings. We use these embeddings in all further experiments.

Experiments with Different Features and Sets
In this subsection, we perform experiments with different models tested on Enron ∪ dev and Avocado ∪ dev . We assume that the ultimate application of our work is a setting in which we train models on a company (i.e. Enron) and apply it to another company (i.e. Avocado). First, we tune the hyper-parameters using gridsearch with 3-fold cross-validation on Enron ∪ tr and Enron ∩A tr three times: first, using network and meta-information features only, second, using lexical (embedding) features only, third, using all features.
Then, we select the best SVM and Extra-trees models with the lexical features only and the models with all features. We apply a paired t-test on the personal F-1 scores of of the models (i.e. SVM and Extra-trees models with lexical features only and with all features) using 10-fold crossvalidation.
The results of the paired t-test show that the improvement obtained from adding the network features is statistically significant on Enron ∪ tr (p < 0.05), but not on Enron ∩A tr (p > 0.05) using both SVM and Extra-trees classifiers.
For evaluating how well the models will perform in an intra-corpus setting, we test on Enron ∪dev , using models trained on Enron ∪tr with different classifiers and feature sets.    show that adding network features helps in retrieving more personal emails (increasing the personal recall) when using both classifiers. In addition, it is clear from the results that the network features are more effective with Extra-Trees since adding them improves all the scores.
To evaluate the cross-corpora performance, we test on Avocado ∪ dev using different models trained on Enron ∪ tr and Enron ∩A tr . Table 7 summarizes the cross-corpora results. We use Enron ∩ Atr in this experiment to test how well a model performs on another corpus when training on a dataset with few but high-confidence labels, in comparison with training on a larger dataset with labels of lesser confidence. The results show that a model trained on a large dataset with lesser confidence labels (i.e. Enron ∪ tr ) using lexical feature alone can retrieve many personal emails, but with a poor precision. Unlike the intra-corpus setting, adding network features always increases the personal precision but decreases the personal recall. However, the best performance as measured by f-measure is achieved by combining the network and lexical features, and using SVMs, which is the same best configuration as in the intra-corpus evaluation setting. For the inter-corpora evaluation, the best result is achieved using the smaller training corpus with higher quality labels.
In both settings (i.e. intra-corpus and crosscorpora), Extra-Trees classifiers suffer in retrieving personal emails causing a decrease in the F-1 personal score in comparison with SVM classifiers.

Performance on the test set
Finally, we select the models with the highest F-1 score each both task (intra-corpus and cross-corpora), and then we test these models on Enron ∪ ts and Avocado ∪ ts . Table 8 shows the per-formance of the best models on the test sets. The results show that in an intra-corpus setting, we can achieve a high personal F-1 score. Also, it is possible to get a good performance on a corpus (i.e. Avocado) when training on another one (Enron).

Conclusion and Future Work
In this paper, we have shown that classifying emails into business and personal can be predicted with good performance using conventional classifiers trained with pre-trained word embeddings that are available online. We performed different experiments on two corpora, Enron and Avocado. The cross-corpora results show that it is possible to classify emails of a company using models trained on another company with a good performance. In addition, we have shown that including features obtained from the graphs representing the email exchange network improves the classification performance.
We observe that the percentage of personal email decreases from 20% (in Enron) to less than 10% (in Avocado). It is not clear whether this is due to the nature of two companies or due to the spread of free email services such as Hotmail and Gmail.
In the future, we plan to experiment with adding more network features that can capture more global network features using approaches such as graph spectral analysis and graph kernels.