Crawling and Preprocessing Mailing Lists At Scale for Dialog Analysis

This paper introduces the Webis Gmane Email Corpus 2019, the largest publicly available and fully preprocessed email corpus to date. We crawled more than 153 million emails from 14,699 mailing lists and segmented them into semantically consistent components using a new neural segmentation model. With 96% accuracy on 15 classes of email segments, our model achieves state-of-the-art performance while being more efficient to train than previous ones. All data, code, and trained models are made freely available alongside the paper.


Introduction
Email is perhaps the most reliable and ubiquitous means of digital communication. Notwithstanding the mainstream adoption of social media for private communication as of about 2010, email prevails unrivaled for workplace communication and beyond. Compared to social media, however, emails have attracted much less research attention in the fields of computational linguistics, natural language processing, and information retrieval. Key reasons for the neglect can be found in the presumed difficulty of obtaining emails at scale, the lack of open technologies to parse them, and that, despite their importance, they are hardly considered en vogue.
Although mailing lists as a rich and accessible source for emails have been tapped before, this has never been done at scale. Our contributions in this respect are (1) the Webis Gmane Email Crawl 2019, a crawl of more than 153 million emails from a wide range of mailing lists, (2) the Chipmunk email segmenter, a newly developed end-to-end neural model, and (3) the complete preprocessing of the crawled emails using our model to construct the largest corpus of "ready-to-use" emails to date. Our corpus encompasses more than 20 years worth of discussions on a diverse set of topics, including important political and societal issues. 1 https://webis. de/publications.html?q=ACL+2020 We believe that providing the research community with access to clean and preprocessed communication data from emails will foster open research in several areas, such as the analysis of dialogs and discourse, stylometry, language evolution, argument mining, as well as information retrieval, and the synthesis of conversations and argumentation.

Related Work
For research purposes, the three primary sources of email data are public mailing lists and newsgroups, volunteered or leaked private email datasets, and email databases at companies and service providers. The WestburyLab USENET corpus Westbury, 2009, 2013) was crawled between 2005 and 2011. More widely employed has been the "20 newsgroups" corpus (Lang, 1995). The W3C corpus compiles the public W3C mailing lists (Wu, 2005), Jiang et al. (2013) examined 8 years of patch submissions to the Linux Kernel Mailing List, and Niedermayer et al. (2017) inspected the process of standardization across IETF bodies via its mailing lists. The CSpace corpus consists of 15,000 student dialogs volunteered for research during a management course at CMU (Kraut et al., 2004).
All of the above have been extensively analyzed (Minkov et al., 2005(Minkov et al., , 2006Lawson et al., 2010), yet the most widely studied corpus remains the leaked Enron corpus (Klimt and Yang, 2004), built as part of the U.S. FERC's investigation into the Enron Corporation. It has been subject to studies on speech act and dialog analysis (Goldstein et al., 2006), named entities (Lawson et al., 2010), and word usage patterns (Keila and Skillicorn, 2005), among many others. Another recently leaked dataset comprises the Clinton emails that surfaced during the 2016 U.S. presidential election (De Felice and Garretson, 2018). Regarding email data at companies and service providers, not many researchers are able to disclose their datasets (Avigdor-Elgrabli et al., 2018).
Regardless of their source, emails are usually unstructured and difficult to process even for human readers (Sobotta, 2016). Thus, many approaches have been proposed for cleansing newsgroup and email data. As one of the earliest, de Carvalho and Cohen (2004) developed a specialized method for detecting and removing signatures based on typical text indicators. Tang et al. (2005) developed a high-accuracy model for detecting blocks of noncontent in emails using a mixture of SVM models and hard-coded rules. An unsupervised approach was employed by Contractor et al. (2010), who applied a noisy channel model for filtering out noncontent. Similarly, Bettenburg et al. (2011) used spell checking techniques for uncovering technical artifacts like source code, disentangling them from the main content. A more general approach, befittingly named Zebra, was published by Lampert et al. (2009), who split messages into a series of structural and semantic "zones", such as author text and signature. Finally, Repke and Krestel (2018) developed Quagga, the first neural end-to-end model inspired by Lampert et al.'s Zebra, which showed very substantial performance improvements. Most machine learning-based approaches rely on classifying lines of text, either by detecting the start and the end of structural blocks with specialized models, or by assessing each line individually via its surrounding context.

The Webis Gmane Email Corpus 2019
Our dataset was crawled from Gmane, 2 a popular email-to-newsgroup gateway, which allows users to subscribe to mailing lists via the NNTP newsgroup protocol that formed the basis for the Usenet. While Gmane's web portal has been offline for years and was recently replaced by a minimal website under a new domain name, the newsgroup portal is still alive and messages from active mailing lists arrive every day. Unlike a mailing list server, a newsgroup server keeps an archive of messages, allowing a user to download the history of a newsgroup even if they did not participate in it from 2 https://news.gmane.io or rather: nntp://news.gmane.io the beginning. Traditional newsgroup servers often have a limited retention period, though fortunately, Gmane archived all messages since its launch in 2002. About a million messages date back even further to the year 2000 and a small number even to the early 90's. The latest message in our corpus is from mid-May 2019, which is when we stopped crawling. Considering this enormous time span and the uncertain future of Gmane, we see archiving these messages as both a great research opportunity and an attempt at preserving our digital heritage.
Following the style of the Usenet, Gmane groups are ordered in a hierarchy of subjects under the common gmane root. This hierarchy makes it easy to categorize mailing lists into topical domains giving a rough overview of what is being talked about. The majority of groups is of a generally technical nature (e.g., in gmane.comp or gmane.linux), a large number of other categories exists, most notably culture, politics, science, education, music, games, and recreation. Below these main categories, a plethora of individual subjects are found. A cursory topic modeling study reveals not only software development discussions, but also debates about environmental issues, climate change, gender equality, mobility, health, business, international conflicts, general political concerns, philosophy, religious beliefs, and many more.

Acquisition
We crawled all 14,699 groups of which 64 turned out empty. Gmane provides another 18,450 groups under the gwene hierarchy for headlines and snippets from RSS feeds. We crawled those as well, but have not analyzed nor added them to the dataset. The crawling process ran slowly over a period of months, producing 604 GiB of compressed WARC files. The total number of messages across all groups sums up to 153,310,330 usable mails. The largest individual group is the Linux Kernel Mailing List with 2.4 million messages followed by the KDE bug tracking list with 2 million. Excluding any obvious bug tracking or software patch submission lists, 113 million messages remain. Further excluding the largest hierarchies comp, linux, and os, 24 million messages are left, which boil down to 7.8 million when restricted to the seven exemplary hierarchies mentioned above. 6.4 million of these are English-language, the rest is mostly German, French, and Spanish. The 153 million messages were posted by 6.4 million unique sender addresses and the influx volume amounts to over 710,000 messages per month. This number is a bit lower at 610,000 when only considering the past five years. The top 10 groups account for an average of 1.2 million messages each and the top 10,000 groups for 15,250, while the bottom 5,000 groups have on average 100 messages.

Preprocessing
Emails are a noisy data source in need of heavy preprocessing. The Usenet and early-day mailing lists developed (n)etiquettes for how to write proper messages. These included quoting as little as possible, replying inline, separating signatures by two hyphens, and restricting their length to four lines. Email-the more recent in particular-obeys none of those. For the most part, messages consist of large blocks of nested quotations-often mutilated by the 78-character limit, various formats for introducing quotations, exuberant unstructured personal signatures, and automated signatures added by the author's user agent or the mailing list server. Moreover, technical emails often contain fragments of source code, log data, or diffs. Automated emails also contain semi-structured templates like ASCIIformatted tables. Extracting the content of such unstructured messages proves difficult and long threads pose a challenge even to human readers.
We started the preprocessing by parsing the MIME contents into pure plaintext. To preserve the privacy of users, the name parts of email addresses were replaced with a 16-byte base64 prefix of the address's SHA-256 hashes with @example.com appended as the authority part. Headers were reduced to the set necessary for retaining date-time, subject, thread, sender, and recipient information. Finally, the contents of each email were segmented and annotated using our model described in Section 4, allowing for easy extraction of not only the main content, but also other structured information. The final corpus is packaged as compressed line-based JSON files that can be easily indexed into Elasticsearch using its bulk API.

The Chipmunk Email Segmenter
Cleansing email plaintexts is laborious and first requires splitting them into different functional and semantic segments (also sometimes called zones). Our first attempt at this was a re-implementation of the classic approach by Tang et al. Despite our best efforts, its handcrafted feature set, and the need to train two individual SVMs for each type of content block caused generalizability and scalability issues on our much larger and more diverse dataset. Also, a context window of three lines was not nearly enough to reliably identify all types of content blocks, and making the window larger did not yield satisfying results due to the simplicity and the lack of shared weights among the individual models. We also needed a much more fine-grained segmentation, which not even the more recent neural approach by Repke and Krestel could deliver without substantial changes, so it was decided to develop a new email segmenter.
We identified 15 common segments recurring in emails: (1) paragraphs (main content), (2) salutations, (3) closings, (4) quotations, (5) quotation markers (quotation author and date), (6) inline email headers, (7) personal signatures, (8) automated MUA signatures (i.e., mail user agent, but also mailing list details or advertising), (9) source code, (10) source code diffs, (11) log data, and (12) technical noise (e.g., inline attachments or PGP signatures), (13) semi-structured tabular data, (14) ornaments (e.g., separator lines), and (15) structural section headings (e.g., in a call for papers). We annotated segments in a stratified sample of 3,033 emails from a range of different groups, totaling 170,309 line annotations. Annotated segments are mostly unambiguous so that a single annotator can produce consistent and high-quality annotations in multiple correction passes. Although the sample is technically multilingual, most emails are in English. Of the 3,033 emails, we set aside 300 for model validation and extracted another sample of 1.5 million emails and concatenated them to a single file of 80 million lines (2.8 GiB). Here we replaced all email addresses with the token @EMAIL@, all URLs with @URL@, mapped numbers to the digit 0, replaced all hexadecimal values with @HASH@, runs of four or more indenting spaces with @INDENT@, split words on special characters (mainly for tokenizing quotations and source code), and normalized Unicode characters to NFKC. We used this processed dump to train a fastText embedding (Grave et al., 2017) with a default vector dimension of 100.

Model Architecture
The segmentation model has a hybrid RNN-CNN architecture as depicted in Figure 1. For each line, we define a context window of c = 4 lines before  Figure 1: Architecture of the Chipmunk email segmenter. Embeddings for the current and previous lines (max length n = 12 words) and a 2D line context window (c = 4) are fed into separate inputs. We use batch normalization after the RNN and the first CNN layer and a dropout chance of 0.25 before the final softmax layer. and after the current line and build an embedding matrix of dimensions (2c + 1, n, 100), n being the maximum word token count per line. Longer lines are truncated by discarding tokens between the first 75% and the last 25% of the line preserving both line beginnings and endings with preference to beginnings, where more structural markers are found under left-to-right writing. Shorter lines and the top or the bottom of the context matrix are padded if required. We feed the line embeddings into separate 128-unit Bi-GRU encoders and the context matrix into a 2D CNN. The idea is that, unlike normal text, plaintext emails have a spatial layout where the horizontal and the vertical axis both convey structural information (most importantly the first column). The CNN performs 128 convolutions with a filter size of 4 × 4, then another 128 convolutions with a filter size of 3 × 3, and finally a max pooling of 2 × 2. After either of the Bi-GRUs and the first convolution, we add in a batch normalization. The CNN output is fed into a 128-dimensional dense layer, concatenated with the other outputs, and then regularized with a dropout of 0.25 before being passed to the softmax layer with outputs for the 15 segment labels and <empty> for blank lines. All layers have ReLU as their activation function. We train the model using a mini-batch size of 128 and the Adam optimizer with hinge loss. Choosing this over crossentropy is a decent trade-off between accuracy and generalizability. While crossentropy tends to find a closer fit, giving higher accuracy on very similar data, this comes at the expense of uncertain decisions and early overfitting. Hinge loss prefers larger margins, generalizing better to new and entirely unseen data in a line-wise classification scenario with strict block boundaries.

Evaluation
To evaluate our model, we compare it with two others from the literature in two different settings. Table 1 compiles an overview of the evaluation results. A confusion matrix for our model is found in Table 2 in the appendix. Our model achieves 96% accuracy over all classes. Mapped to binary decisions between paragraphs and non-paragraphs, the accuracy goes up to 98%. The recall on the paragraph class is 93% (see Table 2). The majority class are quotations with 33%, followed by patches with 16%. Paragraphs come in at 11%. Note that the patch class is overrepresented not because we sampled primarily patch emails, but because patches tend to be longer than normal emails. Still, we achieve an overall high accuracy on all classes. A typical segmentation is provided as an example in Figure 3. To test the model's ability to generalize to unseen data, we annotated 300 emails from the Enron corpus, whose class distribution differs significantly from mailing lists: The emails are much shorter and most lines belong to paragraphs (36%) or empty lines (26%). Quotations account for 8% and code or patches are non-existent. Though significantly lower, our model still shows an acceptable accuracy of about 88%. The excessive use of inline headers containing multiple lines of forwarding addresses appears to be the main challenge for our model, which is expected considering that forwarding emails to dozens of recipients is rare on mailing lists. Furthermore, the proprietary Enron mail user agent had an unusual forwarding and quotation style quite unlike the more common Thunderbird, GMail, or Outlook notations.
Finally, we compared our model against Quagga, the state-of-the-art neural segmentation model by Repke and Krestel and a re-implementation of Tang et al.'s SVM email cleaning approach. Unfortunately, a training routine was missing from Quagga's source code, so we re-implemented this part as closely to the original as possible with one notable exception. We changed the way the model handles quotations. The original model did not have a quotation class and was instead trained to ignore quotation indicators so as to predict normal content segments within quotations also. This is very different from how our model handles quotations and it renders the reconstruction of a conversation from the segments alone impossible. We prefer our approach to classify quotations as a separate segment, which retains the structure of emails and one can simply strip the quotation indicators and then apply the model recursively. We trained our own Quagga on all 16 classes for 20 epochs (the model started overfitting after more epochs). Although the original model was trained and tested on only five classes, the extended and retrained model performs only slightly worse than ours with 94% accuracy overall and very similar scores for most of the frequent classes. The degradation on the Enron corpus appears to be worse than in our model (with the exception of the log data class). In conclusion, we can say that both models perform equally well, though our model achieves overall better generalization. In terms of training speed, we found our approach to be faster and more efficient, since it relies on a 2D context window instead of a vertical RNN for sequences of lines.
The model by Tang et al. required a great deal of feature engineering and the training of many separate models. For simplicity, and in accordance with the original paper, we mapped all labels to the reduced set of content, quotation, header, signature, code (patch), and <empty>. Despite the smaller number of classes, the model's accuracy lags behind the neural models with 80% on Gmane and only 72% on the Enron corpus.

Ethical Considerations
The distribution of email data raises ethical concerns, such as possible violations of privacy and legal requirements, which we addressed to the best of our ability. All emails in our corpus are from public mailing lists and by policy, Gmane only accepts such lists whose users are comfortable with their emails being publicly readable. At the time of writing, the original messages in our corpus are openly available to anyone through the NNTP interface and other mailing list archives. Nevertheless, we took measures to avoid abuse of the readily parsed and compiled form of the data, one being the aforementioned anonymization of email addresses to inhibit trivial mass harvesting. Furthermore, we enforce a strict release policy in compliance with the GDPR academic exemptions. Access to the data is granted solely to researchers and academic institutions and we prohibit further distribution for non-academic purposes.

Summary
This paper contributes the largest email corpus to date. The corpus is targeted mainly at discussion and dialog-based research in NLP. We gave an overview of the topics discussed in the corpus, demonstrating that it is a valuable source for several NLP tasks, such as argument mining. Despite the prevalence of technical conversations, various important and controversial societal issues are covered in the corpus as well. To minimize user overhead, we developed a new neural model for segmenting emails with high precision and recall, which achieves state-of-the-art performance, allowing for fine-grained extraction of structural elements from emails. All the resources developed in this paper are freely available.  A slight yet notable confusion between MUA signatures, personal signatures, and closings can be observed, which are sometimes hard to discern even for humans. The heading class is the least prevalent of all and thus missing training data. Empty line misclassification is corrected afterwards.

Corpus Statistics
Languages