CEREC: A Corpus for Entity Resolution in Email Conversations

We present the first large scale corpus for entity resolution in email conversations (CEREC). The corpus consists of 6001 email threads from the Enron Email Corpus containing 36,448 email messages and 38,996 entity coreference chains. The annotation is carried out as a two-step process with minimal manual effort. Experiments are carried out for evaluating different features and performance of four baselines on the created corpus. For the task of mention identification and coreference resolution, a best performance of 54.1 F1 is reported, highlighting the room for improvement. An in-depth qualitative and quantitative error analysis is presented to understand the limitations of the baselines considered.


Introduction
Entity resolution is defined as linking referring spans of text that point to the same discourse entity by CoNLL 2012(Pradhan et al., 2012 and MUC (Grishman and Sundheim, 1996) shared tasks. The corpora used for this task primarily consist of text from news (Pradhan et al., 2012;Cybulska and Vossen, 2014;Recasens et al., 2010;Grishman and Sundheim, 1996), web-logs and transcripted dialogs.
This research focusses on the entity resolution task for email conversations. Example 1 shows a sample email message and the corresponding entities. The boldfaced tokens represent entities and the numbers beside them represent coreference chain identifiers. An Entity is defined as an object or a group of objects in the real world and a span of text referring to an entity is called a Mention. When all mentions in a text which refer to the same real-world entity are linked together, they form a coreference chain. Dakle et al. (2020) first studied entity resolution in email conversations using a small annotated corpus. Following the same task definition, this paper builds on their work and makes the following key contributions: 1. A large corpus for entity resolution in email conversations (CEREC), weakly annotated for mentions and coreference chains, is presented. Detailed corpus statistics are also discussed. The corpus will be released along with the paper 1 .
This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/. 1 https://github.com/paragdakle/emailcoref 2. Experiments with several baseline models are carried out and their results are reported. A qualitative and quantitative error analysis of the results is presented.
The paper is organized as follows: Section 2 reviews related work done on email processing and corpora for emails and entity resolution. Section 3 describes the corpus creation process and reports statistics on the created corpus. It also explores the addition of features and experiments to evaluate the same. Section 4 presents the baseline models, experiments carried out, and the results obtained. Error analysis of the results is covered in Section 5, followed by a discussion on the problem of missing context in Section 6. Section 7 concludes the work done in this paper.
One of the earliest works highlighting the challenges of the thread-like nature of email conversations was carried out by Lewis and Knowles (1997). Murakoshi et. al (2000) proposed the creation of extended contribution trees to better understand the conversation structure of email threads. The impact of coreference resolution on conversations in a threaded format was first studied by Hendrickx and Hoste (2009) using a corpus of blogs and commented news for opinion mining. Although the impact was negative, it was attributed to the poor performance of the coreference system. Coreference resolution for email conversations is an unexplored problem with, to the best of our knowledge, the only work done by Dakle et. al (2020).
Numerous corpora have been used for email processing over time. University emails (Cohen and others, 1996;Cohen et al., 2004), email users survey (Whittaker and Sidner, 1996;Brutlag and Meek, 2000), private emails (Manco et al., 2002;Corston-Oliver et al., 2004), simulated emails (Lam, 2002), and email archives (Nenkova and Bagga, 2004) are few of the initial sources for email corpora. The Enron Email Corpus (Klimt and Yang, 2004) was the first large public corpus containing emails of 150 employees of the Enron Corporation. Similarly, the Avocado Research Email Collection (Douglas Oard, 2015) consists of emails from 282 accounts of a now-defunct IT company.
The task of coreference resolution, specifically entity resolution, has received attention in the natural language research community since the 1960s with noun-phrase and pronomial resolution being the early forms of the task. Although multiple corpora released over the years contain a small fraction of telephonic speech text, only a few corpora have focused on the study of the task in a purely conversational setting. Character Identification Corpus (Chen and Choi, 2016) was the first corpus to focus on the entity-linking task in this setting. It was constructed using TV show transcripts with annotations for speakers in a multi-party conversation. Aktaş et. al (2018) used a Twitter corpus to study the performance of Stanford statistical coreference system (Clark and Manning, 2015). They evaluated a corpus with 185 threads containing 278 coreference chains and reported a mediocre performance by the model.

Seed Corpus
Dakle et al. (2020) in their study on entity resolution released a small manually annotated corpus containing 46 email threads from the Enron Email Corpus 2 (Klimt and Yang, 2004). The Enron Email Corpus is a multi-lingual corpus with the majority of email threads in English. The corpus consists of email threads organized in a directory structure for each user. The annotated corpus consists of 245 email messages with 866 coreference chains containing 5,834 mentions. Each mention refers to an entity of the type PERSON, ORGANIZATION, LOCATION, and DIGITAL 3 . We use the seed corpus (SC) as the starting point 4 for this work and the underlying Enron Email Corpus as the base corpus to create CEREC. Additionally, the annotation guidelines elaborated by Dakle et al. (2020) are followed in this research.

Extraction and Filtering
The first step in creating the larger corpus is to shortlist email threads from the Enron Email Corpus. An email thread conversation is considered to be a valid conversation if it contains 4 or more email messages. However, to increase the size of this shortlisted pool of email threads, we do not restrict the scope only to email threads in the inbox directory. For each user, email threads in all directories except all documents, discussion threads, drafts, deleted items, sent items, sent, sent mail, and sent are considered. Since, email threads in previous directories are either auto-generated, discarded, or are part of other email threads, they are omitted. A total of 9,724 email threads with a minimum of 4 email messages in each thread are obtained after including additional directories.
On obtaining the initial set of candidate email threads, the following types of email threads are manually filtered from the resulting set: 1. Duplicates: An email thread that is part of a larger email thread or is a duplicate belongs to this category. The multi-recipient nature of email conversations results in one email thread possibly being present in directories of multiple users.
2. No content: Threads in which more than half of the email messages containing no body fall in this category.
3. Invalid attachments: The Enron Email Corpus consists of email threads with inline document attachments. Some email threads contain attachments as long hexadecimal strings and hence are labeled as invalid content.
4. Non-English content: Email threads in the base Enron corpus consists of messages or text in English, Spanish, Russian, German, and French. The scope of this work being restricted to English, email threads containing text in any other language are discarded.
In addition to the above types, we also discard any email threads which overlap partially or fully with those in SC. This is done as eventually SC will be used as the test set for all experiments. After filtering from the initial set, 6144 email threads are obtained. Table 1 gives a distribution of the initial email threads in each of the filtering categories. Furthermore, the unlabelled corpus contains a total of 37,315 email messages with an average thread length of 6 email messages.

Annotation
The annotation procedure is divided into two parts: mention annotation and coreference annotation. For both parts, pre-trained SpanBERT (Joshi et al., 2019a) variant of the model proposed by Joshi et. al (2019b) 5 is used 6 . Henceforth, we will refer to this model as VanillaSpanBERT (for additional description of the model see 4.1).

Mention Annotation
Given an email thread, correctly identifying spans of text which refer to an entity is the task of mention identification. Here, mention identification task is framed as identifying a single coreference chain which consists of all spans of text referring to a valid entity. A valid entity is an entity of the type PERSON, ORGANIZATION, LOCATION or DIGITAL. Consider Example 1, here the single coreference chain will be ["g..barkowsky@enron.com", "theresa.staab@enron.com", "Barkowsky, Gloria G.", "Staab, Theresa", "I", "you", "Crestone and Lost Creek"]. Framing the task in this manner helps in speeding up the annotation process as it eliminates the need to perform architectural changes and carrying out experiments to test each change.  First, a VanillaSpanBERT model is trained on SC for the mention identification task. Next, this trained model is used to obtain predictions on the unlabelled corpus. From these predictions, approximately 2% (143 email threads) are manually corrected and a training set of 94 email threads and a validation set of 49 email threads is created. Table 2 shows the count of the type of changes done during the manual correction of these 143 email threads and the corresponding precision, recall, and F1-score of the trained model. In addition to correcting the predictions, we also correct sentence boundaries for these email threads. The remaining 6,001 email threads will be referred to as mention annotated corpus (MAC). The motivation to create a training and validation set is to compare the performance of models trained on gold annotated (94 email threads) and weakly annotated (MAC) training sets, respectively. These models will be referred to as M-VanillaSpanBERT 94 and M-VanillaSpanBERT 6001 respectively. Table 3 reports the results of these two models on SC. From the results, two inferences can be drawn: 1. The model M-VanillaSpanBERT 6001 performs equally well than its counterpart trained on a gold annotated corpus. Weak annotations by definition are either incomplete or contain incorrect annotations. However, based on the correction evaluation statistics (Table 2) and experiment results, an assumption that they are gold mention annotations for obtaining weak coreference annotation can be made.
2. The performance of the model M-VanillaSpanBERT 6001 illustrates the robustness of the model to ignore the noise in the weakly annotated corpus.
Finally, both SC and the training set containing 94 email threads are used to train a VanillaSpanBERT to obtain mention annotations on 6001 email threads, thereby further improving the quality of mention annotations.  Table 3: Results of two models trained on 94 gold annotated and 6,001 weakly annotated documents respectively.

Coreference Annotation
Post completing mention annotation on the unlabelled corpus, the next step is to perform entity coreference annotation. For this task, an approach similar to the one undertaken for obtaining mentions annotations is used. First, a gold validation set is created to assist in understanding the training performance. A set of 34 email threads is selected from the validation set used for mention annotation. Two annotators performed annotation only on the previously gold-annotated mentions. Second, a VanillaSpanBERT model is trained on the coreference annotations of SC to obtain annotations on the MAC. Mention annotations from MAC are provided as input during the coreference annotation process. The final annotated corpus will be referred to as CEREC. Table 4 provides different corpus statistics. Although the corpus contains a large number of mention annotations, 29,600 of them have been added by the model during the coreference annotation process. In addition to this, 100,385 mentions added during the mention annotation process have not been annotated by the model in this step.

Environment and Hyperparameters
All mention annotation experiments are carried out using the spanbert base model with a maximum segment length of 256 and on an NVIDIA GeForce GTX 1080 Ti GPU with 8 12gb cores. The base variant of the SpanBERT model trains 2x faster than the large variant only for a loss of 0.1 F1 points. On the other hand, for coreference annotations, spanbert large with a maximum segment length of 512 outperforms the previous configuration by 7 F1 points. However, this large variant is trained for 10 epochs, and on the CPU due to memory constraints. The genre feature is also removed from all models. All remaining hyperparameters in both settings are left unchanged.

Feature Addition
Training using additional features like speaker information and genre indicators on top of coreference annotations has proved to be helpful in the past. On the same lines, we evaluate three features specific to conversational texts which have a thread-like structure.
1. Message identifier (MI): For an email thread T containing N email messages, message identifier for a token x belonging to message i (i∈{0, 1, ..., N-1}) is i.

Section information (SI):
An email message is divided into three sections: header, body, and footer 7 . The feature assigns one of the header, body and footer classes to each token in an email message.
3. Reversing an email (REV): Reversing email messages in a thread refers to ordering the messages as per the time in the email header. This is expected to enhance the model's understanding of the conversation flow in the thread.
For the evaluation, VanillaSpanBERT is used and SC with 43 email threads is used as the training set. The validation set used during the mention annotation process is used with a 14-20 email thread split to create a validation and testing set. A single annotator was used to perform feature annotation on all 77 email threads.   Table 5 shows that the addition of SI improves the performance of the model in all scenarios. SI provides information which is useful in identification mentions used for pronoun resolution. All mentions in To or Cc, or the mention in From are used to resolve pronouns like I, you, me, us, etc 8 .
Reversing the email thread (REV) in temporal order reduces the average F1. This disproves the hypothesis presented before. However, it is important to note that the test size for these experiments consisted of only 20 email threads. Finally, the addition of MI does not help the model. MI provides the model with message boundary information which can be used to merge inter email message clusters but fails to have a positive impact in the current setting.

Baselines
Header baseline1 (Hb1): A simple baseline of resolving pronouns based on the participants in the email header is constructed. All first person singular pronouns ("I", "me", "my", "mine", "myself") are chained to the sender, and second-person pronouns ("you", "your", "yours", "yourself", "yourselves") to the recipients respectively. First-person plural pronouns ("we", "us", "our", "ours", "ourselves") are linked to both the sender and the recipients of the email message. In addition to this, all non-pronomial mentions which are the same or have overlapping words are chained together. This baseline is rule-based and does not consider the surrounding context. Header baseline2 (Hb2): This is similar to Header baseline1 except for how first-person plural pronouns are resolved. In this baseline, all first-person plural pronouns in an email message are chained together into one coreference chain and not to the sender or recipients of that message. Furthermore, each firstperson plural pronoun chain in an email message is merged with the corresponding chains in every other message of that email thread.
c2f-coref (C2F): The model proposed by  is used for this baseline 9 . This was the first end-to-end neural coreference resolution model. It uses highway LSTMs to generate embeddings for each span and then with a span-ranking model decides which of the previous spans is a suitable antecedent (if any). The inputs to the LSTMs are embedding representations from a language model (Peters et al., 2018). VanillaSpanBERT (SBERT) : Joshi et. al (2019b) proposed a BERT (Devlin et al., 2018) version of the C2F model . Joshi et. al (2019b) introduced BERT to obtain all input embedding representations. For this baseline, the SpanBERT (Joshi et al., 2019a) variant of the model is used as the baseline owing to its performance gains.

Experimental Setup
The training set for these experiments is CEREC containing 6001 email threads and the validation set contains 34 email threads, the one used for coreference annotation. The SC containing 43 email threads is used as the test set. Mention detection and coreference resolution are the two tasks evaluated in these experiments. The following three experiments are carried out: • Exp1: Use the Hb1 and Hb2 baselines for evaluating coreference resolution given mention annotations as input. Additionally, these baselines also use section information (SI) to identify mentions present in an email header.
• Exp2: Use the C2F and SBERT baselines to evaluate both mention detection and coreference resolution tasks. Compared to the SBERT baselines, the C2F baseline does not enforce a maximum sentence length restriction and has a higher hyperparameter value for maximum training sentences.
The genre feature is removed for both C2F and SBERT baselines since it does not apply to this corpus. For the C2F baseline, the hyperparameters max span width, max training sentences and epochs are set to 20, 30 and 10 respectively. This is done to make training tractable on the environment. For the SBERT baseline, the spanbert base model is used with a maximum segment length of 256, and training is carried out on an NVIDIA GeForce GTX 1080 Ti GPU with 8 12gb cores.

Evaluation Metrics
This work follows the standard experimental setup used in the CoNLL 2012 Shared task. Primary evaluation is done using the unweighted average of MUC, B 3 , and CEAFE metrics (Pradhan et al., 2012) 10 . In addition to this, scores using the LEA metric (Moosavi and Strube, 2016) are also reported. Table 6 shows results of Exp1 and Exp2 for all baselines. First, it can be seen that how first-person plural pronouns are resolved in the header baselines does not have a significant impact on the average F1 score. Second, the average F1 score of SBERT is just 0.23 F1 points higher than the C2F baseline. This shows that increasing the maximum sentence length and maximum training sentences do not help C2F in outperforming SBERT. Both models perform equally well. Compared to the results reported by Dakle et. al (2020), the SBERT baseline performs slightly better. Finally, the large difference in F1 scores of the Exp1 baselines and Exp2 baselines is because Exp1 baselines use mention annotations and the SI feature.

Error Analysis
This section presents error analysis performed on the predictions obtained by the baselines on a subset of 15 email threads selected randomly from SC. The selected 15 email threads contain a total of 282 coreference chains with 1261 mentions. To gain an in-depth understanding of the errors, human evaluation is performed. Errors are broadly divided into four categories. These are similar to the categories used by  Table 6: Evaluation results on SC. Avg. F1 score is computed using MUC, B 3 and CEAFE metrics. Aktaş et. al (2018) andDakle et. al (2020) in their work. Table 7 shows the distribution of errors into these categories for each of the baselines.

Missing references in the chain
A reference that is present in a gold coreference chain but absent in the predicted chains is termed as a missing reference. Hb1 and Hb2 baselines use mention annotations as input to perform coreference chaining. Owing to this reason, only the deep learning baselines are considered for this error category. Missing references are further divided into three types to understand the limitations of the baselines.
1. Missing pronoun references: This error type contributes to 5-6% of all missing references.
2. Missing references in email header: A missing email address or name of a participant in the email message present in the email header is considered in this type. This error type contributes to 23-24% of all missing references.
3. Other missing references: All missing non-pronomial references present in the email body are considered in this error type. For C2F and SBERT, the distribution range of these missing references with respect to entity types is: PER -23-30%, ORG -19-23%, LOC -16-19%, and DIG -31-38%.

Missing chains
In this error category, coreference chains that are present in the gold annotations but absent in the predictions are considered. Since Hb1 and Hb2 use mention annotations as input, counts for this error category for these baselines are not reported. The models C2F and SBERT in the original work Joshi et al., 2019b) were trained on CoNLL 2012 shared task corpus, which did not contain any singletons. Both C2F and SBERT baselines report similar numbers for this error category. About 82-85% of chains in this error category are of lengths 1 or 2.

Incorrectly chained references
All mentions in a coreference chain are considered to refer to the same entity. A mention or reference in a predicted coreference chain which does not refer to the same entity is considered to be incorrectly chained. These references are further broken down into pronoun references and other references. All baselines report a close count for pronoun references with C2F reporting the worst one. SpanBERT owing to its higher context capturing capabilities does a better job at resolving pronomial references than C2F. For other references, C2F and SBERT baselines report approximately 4 times the counts reported by Hb1 and Hb2. This highlights the effectiveness of rule-based approaches and the possible benefits of having a hybrid approach.

Decomposed chains
A gold coreference chain which is present in the predicted chains in the form of two or more chains is called as a decomposed chain. An email thread consists of multiple email messages. A model may perform well when the scope is restricted to a single email message but may fail to link entity chains belonging to different email messages. In addition to this, paraphrasing of a mention can also result in multiple chains being created. Counts are reported for both the number of original chains and the number of chains that are created. It is evident by the high number of decomposed chains for Hb1 and Hb2 baselines that deep learning models do a better job of linking chains across email messages and handling paraphrasing. However, this also increases incorrectly chained references. Moosavi et. al.(2016) in their work on coreference metrics, highlight the limitations of the CEAFE metric. Identifying entity mentions correctly but splitting a single chain into multiple parts can lower the CEAFE metric score. Compared to the rule-based systems, deep learning models identify less number of mentions with fewer chain splits (see Table 7). We recognize this as the reason for Exp2 models to obtain a higher CEAFE precision score over Exp1 systems. However, the large number of missing references significantly reduces the CEAFE recall for Exp2 models resulting in Exp1 models having a higher F1 score.   Aktaş et. al (2018) and Dakle et. al (2020) highlight the challenges encountered for the entity resolution task in a conversational thread-like setting. This section points out an additional challenge corroborates on the difficulty of the task. Conversations using any media generally follow a tree-like structure, where multiple topics may branch off the initial topic but still follow a topic flow. In this flow, every message provides a piece of the whole context which helps in understanding the thread. The deletion of an intermediate message can result in creating ambiguity in the resolution of entities. The deletion of an intermediate email message not only results in the loss of the email text but also the loss of inclusion or exclusion of participants or change of email subject. Carenini et. al (2005) emphasized this issue in their work on the discovery of hidden emails.

Conclusion
This paper presents CEREC, the first large annotated corpus for the entity resolution in email conversations task. The corpus consists of 6001 email threads with 38,996 coreference chains. The two steps in the construction of the corpus along with the results of the experiments involved and statistics of the resulting corpus are explained. The construction process is carried out with minimal human intervention. We also evaluate the addition of features specific to text in a conversational thread-like setting. Two rulebased and two deep learning baselines are used for evaluation of the corpus. Qualitative and quantitative error analysis is presented on the predictions obtained using all baselines highlighting the avenues for improvement. Future work will consist of evaluating probable solutions for the entity resolution task. We also plan to conduct additional experiments to understand the effect of features presented in this paper using a larger corpus.