A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal

Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries and has important applications in story clustering for newsfeeds, presentation of search results, and timeline generation. However, there is a lack of datasets that realistically address such use cases at a scale large enough for training supervised models for this task. This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters. We build this dataset by leveraging the Wikipedia Current Events Portal (WCEP), which provides concise and neutral human-written summaries of news events, with links to external source articles. We also automatically extend these source articles by looking for related articles in the Common Crawl archive. We provide a quantitative analysis of the dataset and empirical results for several state-of-the-art MDS techniques.


Introduction
Text summarization has recently received increased attention with the rise of deep learning-based endto-end models, both for extractive and abstractive variants. However, so far, only single-document summarization has profited from this trend. Multidocument summarization (MDS) still suffers from a lack of established large-scale datasets. This impedes the use of large deep learning models, which have greatly improved the state-of-the-art for various supervised NLP problems (Vaswani et al., 2017;Paulus et al., 2018;Devlin et al., 2019), and makes a robust evaluation difficult. Recently, several larger MDS datasets have been created: Zopf (2018); Liu et al. (2018); Fabbri et al. (2019). However, these datasets do not realistically resemble use Human-written summary Emperor Akihito abdicates the Chrysanthemum Throne in favor of his elder son, Crown Prince Naruhito. He is the first Emperor to abdicate in over two hundred years, since Emperor Kökaku in 1817. Headlines of source articles (WCEP) • Defining the Heisei Era: Just how peaceful were the past 30 years? • As a New Emperor ls Enthroned in Japan, His Wife Won't Be Allowed to Watch Sample Headlines from Common Crawl • Japanese Emperor Akihito to abdicate after three decades on throne • Japan's Emperor Akihito says he is abdicating as of Tuesday at a ceremony, in his final official address to his people • Akihito begins abdication rituals as Japan marks end of era cases with large automatically aggregated collections of news articles, focused on particular news events. This includes news event detection, news article search, and timeline generation. Given the prevalence of such applications, there is a pressing need for better datasets for these MDS use cases.
In this paper, we present the Wikipedia Current Events Portal (WCEP) dataset, which is designed to address real-world MDS use cases. The dataset consists of 10,200 clusters with one human-written summary and 235 articles per cluster on average. We extract this dataset starting from the Wikipedia Current Events Portal (WCEP) 1 . Editors on WCEP write short summaries about news events and provide a small number of links to relevant source articles. We extract the summaries and source articles from WCEP and increase the number of source articles per summary by searching for similar articles in the Common Crawl News dataset 2 . As a result, we obtain large clusters of highly redundant news articles, resembling the output of news clustering applications. Table 1 shows an example of an event summary, with headlines from both the original article and from a sample of the associated additional sources. In our experiments, we test a range of unsupervised and supervised MDS methods to establish baseline results. We show that the additional articles lead to much higher upper bounds of performance for standard extractive summarization, and help to increase the performance of baseline MDS methods.
We summarize our contributions as follows: • We present a new large-scale dataset for MDS, that is better aligned with several real-world industrial use cases.
• We provide an extensive analysis of the properties of this dataset.
• We provide empirical results for several baselines and state-of-the-art MDS methods aiming to facilitate future work on this dataset.

Datasets for MDS
Datasets for MDS consist of clusters of source documents and at least one ground-truth summary assigned to each cluster. Commonly used traditional datasets include the DUC 2004 (Paul and James, 2004) and TAC 2011 (Owczarzak and Dang, 2011), which consist of only 50 and 100 document clusters with 10 news articles on average. The MultiNews dataset (Fabbri et al., 2019) is a recent large-scale MDS dataset, containing 56,000 clusters, but each cluster contains only 2.3 source documents on average. The sources were hand-picked by editors and do not reflect use cases with large automatically aggregated document collections. MultiNews has much more verbose summaries than WCEP. Zopf (2018) created the auto-hMDS dataset by using the lead section of Wikipedia articles as summaries, and automatically searching for related documents on the web, resulting in 7,300 clusters. The WikiSum dataset (Liu et al., 2018) uses a similar approach and additionally uses cited sources on Wikipedia. The dataset contains 2.3 million clusters. These Wikipedia-based datasets also have long summaries about various topics, whereas our dataset focuses on short summaries about news events.

Dataset Construction
Wikipedia Current Events Portal: WCEP lists current news events on a daily basis. Each news event is presented as a summary with at least one link to external news articles. According to the editing guidelines 3 , the summaries must be short, up to 30-40 words, and written in complete sentences in the present tense, avoiding opinions and sensationalism. Each event must be of international interest. Summaries are written in English, and news sources are preferably English.
Obtaining Articles Linked on WCEP: We parse the WCEP monthly pages to obtain a list of individual events, each with a list of URLs to external source articles. To prevent the source articles of the dataset from becoming unavailable over time, we use the 'Save Page Now' feature of the Internet Archive 4 . We request snapshots of all source articles that are not captured in the Internet Archive yet. We download and extract all articles from the Internet Archive Wayback Machine 5 using the newspaper3k 6 library.
Additional Source Articles: Each event from WCEP contains only 1.2 sources on average, meaning that most editors provide only one source article when they add a new event. In order to extend the set of input articles for each of the ground-truth summaries, we search for similar articles in the Common Crawl News dataset 7 .
We train a logistic regression classifier to decide whether to assign an article to a summary, using the original WCEP summaries and source articles as training data. For each event, we label the article-summary pair for each source article of the event as positive. We create negative examples by pairing each event with source articles from other events of the same date, resulting in a positive-negative ratio of 7:100. The features used by the classifier are listed in Table 2.
tf-idf similarity between title and summary tf-idf similarity between body and summary No. entities from summary appearing in title No. linked entities from summary appearing in body We use unigram bag-of-words vectors with TF-IDF weighting and cosine similarity for the first two features. The entities are phrases in the WCEP summaries that the editors annotated with hyperlinks to other Wikipedia articles. We search for these entities in article titles and bodies by exact string matching. The classifier achieves 90% Precision and 74% Recall of positive examples on a hold-out set.
For each event in the original dataset, we apply the classifier to articles published in a window of ±1 days of the event date and add those articles that pass a classification probability of 0.9. If an article is assigned to multiple events, we only add it to the event with the highest probability. This procedure increases the number of source articles per summary considerably (Table 4).
Final Dataset: Each example in the dataset consists of a ground-truth summary and a cluster of original source articles from WCEP, combined with additional articles from Common Crawl. The dataset has 10,200 clusters, which we split roughly into 80% training, 10% validation and 10% test ( Table 3). The split is done chronologically, such that no event dates overlap between the splits. We also create a truncated version of the dataset with a maximum of 100 articles per cluster, by retaining all original articles and randomly sampling from the additional articles.  Table 3 shows the number of clusters and of articles from all clusters combined, for each dataset partition. Table 4 shows statistics for individual clusters. We show statistics for the entire dataset (WCEPtotal), and for the truncated version (WCEP-100) used in our experiments. The high mean cluster size is mostly due to articles from Common Crawl.

Quality of Additional Articles
To investigate how related the additional articles obtained from Common Crawl are to the summary they are assigned to, we randomly select 350 for manual annotation. We compare the article title and the first three sentences to the assigned summary, and pick one of the following three options: 1) "on-topic" if the article focuses on the event described in the summary, 2) "related" if the article mentions the event, but focuses on something else, e.g., follow-up, and 3) "unrelated" if there is no mention of the event. This results in 52% on-topic, 30% related, and 18% unrelated articles. We think that this amount of noise is acceptable, as it resembles noise present in applications with automatic content aggregation. Furthermore, summarization performance benefits from the additional articles in our experiments (see Section 5).

Extractive Strategies
Human-written summaries can vary in the degree of how extractive or abstractive they are, i.e., how much they copy or rephrase information in source documents. To quantify extractiveness in our dataset, we use the measures coverage and density defined by Grusky et al. (2018): Given an article A consisting of tokens a 1 , a 2 , ..., a n and its summary S = s 1 , s 2 , ..., s n , F (A, S) is the set of token sequences (fragments) shared between A and S, identified in a greedy manner. Coverage measures the proportion of words from the summary appearing in these fragments. Density is related to the average length of shared fragments and measures how well a summary can be described as a series of extractions. In our case, A is the concatenation of all articles in a cluster.  Figure 1 shows the distribution of coverage and density in different summarization datasets. WCEP-10 refers to a truncated version of our dataset with a maximum cluster size of 10. The WCEP dataset shows increased coverage if more articles from Common Crawl are added, i.e., all words of a summary tend to be present in larger clusters. High coverage suggests that retrieval and copy mechanisms within a cluster can be useful to generate summaries. Likely due to the short summary style and editor guidelines, high density, i.e., copying of long sequences, is not as common in WCEP as in the MultiNews dataset.

Setup
Due to scalability issues of some of the tested methods, we use the truncated version of the dataset with a maximum of 100 articles per cluster (WCEP-100). The performance of the methods that we consider starts to plateau after 100 articles (see Figure 2). We set a maximum summary length of 40 tokens, which is in accordance with the editor guidelines in WCEP. This limit also corresponds to the optimal length of an extractive oracle optimizing ROUGE F1-scores 8 . We recommend to evaluate models with a dynamic (potentially longer) output length using F1-scores and optionally to provide Recall results with truncated summaries. Extractive methods should only return lists of full untruncated sentences up to that limit. We evaluate lowercased versions of summaries and do not modify groundtruth or system summaries otherwise. We compare and evaluate systems using F1-score and Recall of ROUGE-1, ROUGE-2, and ROUGE-L (Lin, 2004). In the following, we abbreviate ROUGE-1 F1-score and Recall with R1-F and R1-R, etc.

Methods
We evaluate the following oracles and baselines to put evaluation scores into perspective: • ORACLE (MULTI): Greedy oracle, adds sentences from a cluster that optimize R1-F of the constructed summary until R1-F decreases.
• ORACLE (SINGLE): Best of oracle summaries extracted from individual articles in a cluster.
• LEAD ORACLE: The lead (first sentences up to 40 words) of an individual article with the best R1-F score within a cluster.
• RANDOM LEAD: The lead of a randomly selected article, which is our alternative to the lead baseline used in single-document summarization.
• BERTREG: Similar framework to TSR but with sentence embeddings computed by a pretrained BERT model (Devlin et al., 2019). Refer to Appendix A.1 for more details.
We tune hyperparameters of the methods described above on the validation set of WCEP-100 (Appendix A.2). We also test a simple abstractive baseline, SUBMODULAR + ABS: We first create an extractive multi-document summary with a maximum of 100 words using SUBMODULAR. We pass this summary as a pseudo-article to the abstractive bottom-up attention model (Gehrmann et al., 2018) to generate the final summary. We use an implementation from OpenNMT 9 with a model pretrained on the CNN/Daily Mail dataset. All tested methods apart from ORACLE (MULTI & SINGLE) observe the length limit of 40 tokens. Table 5 presents the results on the WCEP test set. The supervised methods TSR and BERTREG show advantages over unsupervised methods, but not by a large margin, which poses an interesting challenge for future work. The high extractive bounds defined by ORACLE (SINGLE) suggest that identifying important documents before summarization can be useful in this dataset. The dataset does not favor lead summaries: RANDOM LEAD is of low quality, and LEAD ORACLE has relatively low Fscores (although very high Recall). The SUBMOD-ULAR + ABS heuristic for applying a pre-trained abstractive model does not perform well. Figure 2 shows how the performance of several methods on the test set increases with different amounts of additional articles from Common Crawl. Using 10 additional articles causes a steep improvement compared to only using the original source articles from WCEP. However, using more than 100 articles only leads to minimal gains.

Conclusion
We present a new large-scale MDS dataset for the news domain, consisting of large clusters of news articles, associated with short summaries about news events. We hope this dataset will facilitate the creation of real-world MDS systems for use cases such as summarizing news clusters or search results.   We conducted extensive experiments to establish baseline results, and we hope that future work on MDS will use this dataset as a benchmark. Important challenges for future work include how to scale deep learning methods to such large amounts of source documents and how to close the gap to the oracle methods.

A Appendices
A.1 BERTREG This method uses a regression model to score and rank sentences. For a particular sentence, we obtain a contextualized embedding from a pre-trained BERT model 10 . We concatenate the embedding with several statistical and surface-form sentence features shown in Table 6. length (in tokens) position stop word ratio mean tf mean tf-idf mean tf-icf mean cluster-df The corpus-level document and cluster frequencies (cf) in tf-idf and tf-icf are obtained from the training set. cluster-df refers to the document frequency within a particular cluster. We feed this concatenated sentence vector to a feedforward network with one hidden layer of size 256. The model is trained to predict the R1 F-score between a sentence and the summary of a cluster, using the mean squared error loss. We found the F-score to work better than Precision or Recall. We use the SGD optimizer, a learning rate of 0.02, and train for 8 epochs with batch size 8. To construct a summary, we predict scores using this model, rank sentences, and greedily pick sentences from the ranked list under a redundancy constraint, as used in TSR.

A.2 Implementation Details for Extractive Methods
We implement the methods TEXTRANK, CEN-TROID, TSR and BERTREG in a commonly used framework that greedily selects sentences from a ranked list while avoiding redundancy . We measure redundancy as the proportion of bigrams in a new sentence that appear in an already selected sentence. For each method, we tune threshold values for redundancy from 0 to 1 in steps of 0.1. For SUBMODULAR, we tune a parameter called diversity with values 1 to 10 in steps of 1, which has a similar role as the redundancy threshold. We use 100 randomly selected clusters from the validation set in WCEP-100 for parameter tuning. We set a minimum sentence length of 7 tokens which avoids summaries slighly shorter than the 40 token limit to be padded with very short or broken sentences.