MLSUM: The Multilingual Summarization Corpus

We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.


Introduction
The document summarization task requires several complex language abilities: understanding a long document, discriminating what is relevant, and writing a short synthesis. Over the last few years, advances in deep learning applied to NLP have contributed to the rising popularity of this task among the research community (See et al., 2017;Kryściński et al., 2018;Scialom et al., 2019). As with other NLP tasks, the great majority of available datasets for summarization are in English, and thus most research efforts focus on the English language. The lack of multilingual data is partially countered by the application of transfer learning techniques enabled by the availability of pre-trained multilingual language models. This approach has recently established itself as the de-facto paradigm in NLP (Guzmán et al., 2019).
Under this paradigm, for encoder/decoder tasks, a language model can first be pre-trained on a large corpus of texts in multiple languages. Then, the model is fine-tuned in one or more pivot languages for which the task-specific data are available. At inference, it can still be applied to the different languages seen during the pre-training. Because of the dominance of English for large scale corpora, English naturally established itself as a pivot for other languages. The availability of multilingual pre-trained models, such as BERT multilingual (M-BERT), allows to build models for target languages different from training data. However, previous works reported a significant performance gap between English and the target language, e.g. for classification (Conneau et al., 2018) and Question Answering (Lewis et al., 2019) tasks. A similar approach has been recently proposed for summarization (Chi et al., 2019) obtaining, again, a lower performance than for English.
For specific NLP tasks, recent research efforts have produced evaluation datasets in several target languages, allowing to evaluate the progress of the field in zero-shot scenarios. Nonetheless, those approaches are still bound to using training data in a pivot language for which a large amount of annotated data is available, usually English. This prevents investigating, for instance, whether a given model is as fitted for a specific language as for any other. Answers to such research questions represent valuable information to improve model performance for low-resource languages.
In this work, we aim to fill this gap for the automatic summarization task by proposing a largescale MultiLingual SUMmarization (MLSUM) dataset. The dataset is built from online news outlets, and contains over 1.5M article-summary pairs in 5 languages: French, German, Spanish, Russian, and Turkish, which complement an already established summarization dataset in English.
The contributions of this paper can be summarized as follows: 1. We release the first large-scale multilingual summarization dataset; 2. We provide strong baselines from multilingual abstractive text generation models; 3. We report a comparative cross-lingual analysis of the results obtained by different approaches.
2 Related Work

Multilingual Text Summarization
Over the last two decades, several research works have focused on multilingual text summarization. Radev et al. (2002) developed MEAD, a multidocument summarizer that works for both English and Chinese. Litvak et al. (2010) proposed to improve multilingual summarization using a genetic algorithm. A community-driven initiative, MultiLing (Giannakopoulos et al., 2015), benchmarked summarization systems on multilingual data. While the MultiLing benchmark covers 40 languages, it provides relatively few examples (10k in the 2019 release). Most proposed approaches, so far, have used an extractive approach given the lack of a multilingual corpus to train abstractive models (Duan et al., 2019). More recently, with the rapid progress in automatic translation and text generation, abstractive methods for multilingual summarization have been developed. Ouyang et al. (2019) proposed to learn summarization models for three low-resource languages (Somali, Swahili, and Tagalog), by using an automated translation of the New York Times dataset.. Although this showed only slight improvements over a baseline which considers translated outputs of an English summarizer, results remain still far from human performance. Summarization models from translated data usually under-perform, as translation biases add to the difficulty of summarization.
Following the recent trend of using multi-lingual pre-trained models for NLP tasks, such as Multilingual BERT (M-BERT) (Pires et al., 2019) 1 or XLM (Lample and Conneau, 2019), Chi et al. (2019) proposed to fine-tune the models for summarization on English training data. The assumption is that the summarization skills learned from English data can transfer to other languages on which the model has been pre-trained. However a significant performance gap between English and the target language is observed following this process. This emphasizes the crucial need of multilingual training data for summarization. 1 https://github.com/google-research/ bert/blob/master/multilingual.md

Existing Multilingual Datasets
The research community has produced several multilingual datasets for tasks other than summarization. We report two recent efforts below, noting that both i) rely on human translations, and ii) only provide evaluation data.
The Cross-Lingual NLI Corpus The SNLI corpus (Bowman et al., 2015) is a large scale dataset for natural language inference (NLI). It is composed of a collection of 570k human-written English sentence pairs, associated with their label, entailment, contradiction, or neutral. The Multi-Genre Natural Language Inference (MultiNLI) corpus is an extension of SNLI, comparable in size, but including a more diverse range of text. Conneau et al. (2018) introduced the Cross-Lingual NLI Corpus (XNLI) to evaluate transfer learning from English to other languages: based on MultiNLI, a collection of 5,000 test and 2,500 dev pairs were translated by humans in 15 languages.
MLQA Given a paragraph and a question, the Question Answering (QA) task consists in providing the correct answer. Large scale datasets such as (Rajpurkar et al., 2016;Choi et al., 2018;Trischler et al., 2016) have driven fast progress. 2 However, these datasets are only in English. To assess how well models perform on other languages, Lewis et al. (2019) recently proposed MLQA, an evaluation dataset for cross-lingual extractive QA composed of 5K QA instances in 7 languages. XTREME The Cross-lingual TRansfer Evaluation of Multilingual Encoders benchmark covers 40 languages over 9 tasks. The summarization task is not included in the benchmark.
XGLUE In order to train and evaluate their performance across a diverse set of cross-lingual tasks, Liang et al. (2020) recently released XGLUE, covering both Natural Language Understanding and Generation scenarios. While no summarization task is included, it comprises a News Title Generation task: the data is crawled from a commercial news website and provided in form of article-title pairs for 5 languages (German, English, French, Spanish and Russian).

Existing Summarization datasets
We describe here the main available corpora for text summarization. Document Understanding Conference Several small and high-quality summarization datasets in English (Harman and Over, 2004;Dang, 2006) have been produced in the context of the Document Understanding Conference (DUC). 3 They are built by associating newswire articles with corresponding human summaries. A distinctive feature of the DUC datasets is the availability of multiple reference summaries: this is a valuable characteristic since, as found by Rankel et al. (2013), the correlation between qualitative and automatic metrics, such as ROUGE (Lin, 2004), decreases significantly when only a single reference is given. However, due to the small number of training data available, DUC datasets are often used in a domain adaptation setup for models first trained on larger datasets such as Gigaword, CNN/DM (Nallapati et al., 2016;See et al., 2017) or with unsupervised methods (Dorr et al., 2003;Mihalcea and Tarau, 2004;Barrios et al., 2016a).
Gigaword Again using newswire as source data, the english Gigaword (Napoles et al., 2012;Rush et al., 2015;Chopra et al., 2016) corpus is characterized by its large size and the high diversity in terms of sources. Since the samples are not associated with human summaries, prior works on summarization have trained models to generate the headlines of an article, given its incipit, which induces various biases for learning models.
New York Times Corpus This large corpus for summarization consists of hundreds of thousand of articles from The New York Times (Sandhaus, 2008), spanning over 20 years. The articles are paired with summaries written by library scientists. Although (Grusky et al., 2018) found indications of bias towards extractive approaches, several research efforts have used this dataset for summarization (Hong and Nenkova, 2014;Durrett et al., 2016;Paulus et al., 2017).
CNN / Daily Mail One of the most commonly used dataset for summarization (Nallapati et al., 2016;See et al., 2017;Paulus et al., 2017;Dong et al., 2019), although originally built for Question Answering tasks (Hermann et al., 2015a). It consists of English articles from the CNN and The 3 http://duc.nist.gov/ Daily Mail associated with bullet point highlights from the article. When used for summarization, the bullet points are typically concatenated into a single summary.
NEWSROOM Composed of 1.3M articles (Grusky et al., 2018), and featuring high diversity in terms of publishers, the summaries associated with English news articles were extracted from the Web pages metadata: they were originally written to be used in search engines and social media. LCSTS The Large Scale Chinese Short Text Summarization Dataset (Hu et al., 2015) is built from 2 million short texts from the Sina Weibo microblogging platform. They are paired with summaries given by the author of each text. The dataset includes 10k summaries which were manually scored by human for their relevance. With the increasing interest for cross-lingual models, the NLP community have recently released multilingual evaluation datasets, targeting classification (XNLI) and QA (Lewis et al., 2019) tasks, as described in 2.2, though still no large-scale dataset is avaulable for document summarization.

MLSUM
To fill this gap we introduce MLSUM, the first large scale multilingual summarization corpus. Our corpus provides more than 1.5 millions articles in French (FR), German (DE), Spanish (ES), Turkish (TR), and Russian (RU). Being similarly built from news articles, and providing a similar amount of training samples per language (except for Russian), as the previously mentioned CNN/Daily Mail, it can effectively serve as a multilingual extension of the CNN/Daily Mail dataset.
In the following, we first describe the methodology used to build the corpus. We then report the corpus statistics and finally interpret the performances of baselines and state-of-the-art models.

Collecting the Corpus
The CNN/Daily Mail (CNN/DM) dataset (see Section 2.3) is arguably the most used large-scale dataset for summarization. Following the same methodology, we consider news articles as the text input, and their paired highlights/description as the summary. For each language, we selected an online newspaper which met the following requirements: 1. Being a generalist newspaper: ensuring that a broad range of topics is represented for each language allows to minimize the risk of training topic-specific models, a fact which would hinder comparative cross-lingual analyses of the models.
2. Having a large number of articles in their public online archive.
3. Providing human written highlights/summaries for the articles that can be extracted from the HTML code of the web page.

• Internet Haber 8 (Turkish)
For each outlet, we crawled archived articles from 2010 to 2019. We applied one simple filter: all the articles shorter than 50 words or summaries shorter than 10 words are discarded, so as to avoid articles containing mostly audiovisual content. Each article was archived on the Wayback Machine, 9 allowing interested research to re-build 4 www.lemonde.fr 5 www.sueddeutsche.de 6 www.elpais.com 7 www.mk.ru 8 www.internethaber.com 9 web.archive.org, using https://github. com/agude/wayback-machine-archiver or extend MLSUM. We distribute the dataset as a list of immutable snapshot URLs of the articles, along with the accompanying corpus-construction code, 10 allowing to replicate the parsing and preprocessing procedures we employed. This is due to legal reasons: the content of the articles is copyrighted and redistribution might be seen as infringing of publishing rights. Nonetheless, we make available, upon request, an exact copy of the dataset used in this work. A similar approach has been adopted for several dataset releases in the recent past, such as Question Answering Corpus (Hermann et al., 2015b) or XSUM (Narayan et al., 2018a).
Further, we provide recommended train/validation/test splits following a chronological ordering based on the articles' publication dates. In our experiments below, we train/evaluate the models on the training/test splits obtained in this manner. Specifically, we use: data from 2010 to 2018, included, for training; data for 2019 (~10% of the dataset) for validation (up to May 2019) and test (May-December 2019). While this choice is arguably more challenging, due to the possible emergence of new topics over time, we consider it as the realistic scenario a successful summarization system should be able to deal with. Incidentally, this also bring the advantage of excluding most cases of leakage across languages: it prevents a model, for instance, from seeing a training sample describing an important event in one language, and then being submitted for inference a similar article in another language, published around the same time and dealing with the same event.

Dataset Statistics
We report statistics for each language in ML-SUM in Table 1, including those computed on the CNN/Daily Mail dataset (English) for quick comparison. MLSUM provides a comparable amount of data for all languages, with the exception of Russian with ten times less training samples. Important characteristics for summarization datasets are the length of articles and summaries, the vocabulary size, and a proxy for abstractiveness, namely the percentage of novel n-grams between the article and its human summary. From Table 1 Coupled with the significantly lower amount of articles available from its online source, the task can be seen as more challenging for Russian than for the other languages in MLSUM. Conversely, similar characteristics are shared among other languages, for instance French and German.

Topic Shift
With the exception of Turkish, the article URLs in MLSUM allow to identify a category for a given article. In Figure 1 we show the shift over categories among time. In particular, we plot the 6 most frequent categories per language.

Models
We experimented on MLSUM with the established models and baselines described below. Those include supervised and unsupervised methods, extractive and abstractive models. For all the experiments, we train models on a per-language basis. We used the recommended hyperparameters for all languages, in order to facilitate assessing the robustness of the models. We also tried to train one model with all the languages mixed together, but we did not see any significant difference of performance.

Extractive summarization models
Oracle Extracts the sentences, within the input text, that maximise a given metric (in our experiments, ROUGE-L) given the reference summary. It is an indication of the maximum one could achieve with extractive summarization. In this work, we rely on the implementation of Narayan et al. (2018b).
Random In order to elaborate and compare the performances of the different models across languages, it is useful to include an unbiased model as a point of reference. To that purpose, we define a simple random extractive model that randomly extracts N words from the source document, with N fixed as the average length of the summary.
Lead-3 Simply selects the three first sentences from the input text. Sharma et al. (2019), among others, showed that this is a robust baseline for several summarization datasets such as CNN/DM, NYT and BIGPATENT.
TextRank An unsupervised algorithm proposed by Mihalcea and Tarau (2004). It consists in computing the co-similarities between all the sentences in the input text. Then, the most central to the document are extracted and considered as the summary. We used the implementation provided by Barrios et al. (2016b).

Abstractive summarization models
Most of the models for abstractive summarization are neural sequence to sequence models (Sutskever et al., 2014), composed of an encoder that encodes the input text and a decoder that generates the summary. See et al. (2017) proposed the addition of the copy mechanism (Vinyals et al., 2015) on top of a sequence to sequence LSTM model. This mechanism allows to efficiently copy out-of-vocabulary tokens, leveraging attention (Bahdanau et al., 2014) over the input. We used the publicly available OpenNMT implemen- METEOR The Metric for Evaluation of Translation with Explicit ORdering (Banerjee and Lavie, 2005) was designed for the evaluation of machine translation output. It is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. METEOR is often reported in summarization papers (See et al., 2017;Dong et al., 2019) in addition to ROUGE.

Pointer-Generator
Novelty Because of their use of copy mechanisms, some abstractive models have been reported to rely too much on extraction (See et al., 2017;Kryściński et al., 2018). Hence, it became a common practice to report the percentage of novel ngrams produced within the generated summaries.   (2019) learned a metric from human annotation. All these models were only trained on English datasets, preventing us to report them in this paper. The availability of MLSUM will enable future works to build such metrics in a multilingual fashion.

Results and Discussion
The results presented below allow us to compare the models across languages, and investigate or hypothesize where their performance variations may come from. We can distinguish the following factors to explain differences in the results: 1. Differences in the data, independently from the language, such as the structure of the article, the abstractiveness of the summaries, or the quantity of data; 2. Differences due to the language itself -either due to metric biases (e.g. due to a different morphological type) or to biases inherent to the model.
While the first fold of differences have more to do with domain adaptation, the second fold motivates further the development of multilingual datasets, since they are the only mean to study such phenomenon.
Turning to the observed results, we report in Table 2 the ROUGE-L and METEOR scores obtained by each model for all languages. We note that the overall order of systems (for each language) is preserved when using either metric (modulo some swaps between Lead 3 and Pointer Generator, but with relatively close scores).

Russian, the low-resource language in MLSUM
For all experimental setups, the performance on Russian is comparatively low. This can be explained by at least two factors. First, the corpus is the most abstractive (see Table 1, limiting the performance figures obtained for the extractive models (Random, LEAD-3, and Oracle). Second, one order of magnitude less training data is available for Russian than for the other MLSUM languages, a fact which can explain the impressive improvement of performance (+66% in terms of ROUGE-L, see Table 2) between a not pretrained model (Pointer Generator) and a pretrained model (M-BERT).

How abstractive are the models?
We report the novelty (i.e. the percentage of novel words in the summary) in Figure 2. As previous works reported (See et al., 2017), pointer-generator networks are poorly abstractive, relying too much on their copy mechanism. It is particularly true for Russian: the lack of data probably makes it easier to learn to copy than to cope with natural language generation. As expected, pretrained language models such as M-BERT are consistently more abstractive, and by a large margin, since they are exposed to other texts during pretraining.

Model Biases toward Languages
Consistency among ROUGE scores The Random model obtains comparable ROUGE-L scores across all the languages, except for Russian. This can be explained by the aforementioned Russian corpus characteristics: highest novelty, shortest summaries, and longest input documents (see Table 1). Thus, in the following, for pair-wise languagebased comparisons we focus only on scores obtained, by the different models, on French, German, Spanish, and Turkish -since we cannot draw meaningful interpretations over Russian as compared to other languages.
Abstractiveness of the datasets The Oracle performance can be considered as the upper limit for an extractive model since it extracts the sentences that provide the best ROUGE-L. We can observe that while being similar for English and German, and to some extent Turkish, the Oracle performance is lower for French or Spanish.
However, as described in figure 1, the percentage of novel words are similar for German (14.96), French (15.21) and Spanish (15.34). This may indicate that the relevant information to extract from the article is more spread among sentences for Spanish and French than for German. This is confirmed with the results of Lead-3: German and English have a much higher ROUGE-L -35.20 and 33.09 -than French or Spanish -19.69 and 13.70.  regardless Oracle. It is particularly surprising to see the low performance on German whereas, for this language, Lead-3 has a comparatively higher performance. On the other hand, the performance on English is remarkably high: the ROUGE-L is 33% higher than for Turkish, 126% higher than for French and 200% higher than for Spanish. We suspect that the TextRank parameters might actually overfit English.
In Table 3, we report the performance ratio between TextRank and Pointer Generator on our corpus, as well as on CNN/DM and two other English corpora (DUC and NewsRoom). TextRank has a performance close to the Pointer Generator on English corpora (ratio between 0.85 to 1.21) but not in other languages (ratio between 0.37 to 0.65). This suggests that this model, despite its generic and unsupervised nature, might be highly biased towards English.
The benefits of pretraining We hypothesize that the closer an unsupervised model performance to its maximum limit, the less improvement would come from pretraining. In Figure 3, we plot the improvement rate from TextRank to Oracle, against that of Pointer-Generator to M-BERT.
Looking at the correlation emerging from the plot, the hypothesis appears to hold true for all languages, including Russian -not plotted for scaling reasons (x = 808; y = 40), with the exception of English. This exception is probably due to the aforementioned bias of TextRank towards the English language.
Pointer Generator and M-BERT Finally, we observe in our results that M-BERT always outperforms the Pointer Generator. However, the ratio is not homogeneous across the different languages as reported in Table 3. In particular, the improvement for German is much more important than the one for French. Interestingly, this observation is in line with the results reported for Machine Translation: the Transformer (Vaswani et al., 2017) outperforms significantly ConvS2S (Gehring et al., 2017) for English to German but obtains comparable results for English to French -see Table 2 in Vaswani et al. (2017).
Neither model is pretrained, nor based on LSTM (Hochreiter and Schmidhuber, 1997), and they both use BPE tokenization (Shibata et al., 1999). Therefore, the main difference is represented by the selfattention mechanism introduced in the Transformer, while ConvS2S used only source to target attention.
We thus hypothesise that self-attention plays an important role for German but has a limited impact for French. This could find an explanation in the morphology of the two languages: in statistical parsing, Tsarfaty et al. (2010) considered German to be very sensitive to word order, due to its rich morphology, as opposed to French. Among other reasons, the flexibility of its syntactic ordering is mentioned. This corroborates the hypothesis that self-attention might help preserving information for languages with higher degrees of word order freedom.

Possible derivative usages of MLSUM
Multilingual Question Answering Originally, CNN/DM was a Question Answering dataset (Hermann et al., 2015a). The hypothesis is that the information in the summary is also contained in the pair article. Hence, questions can be generated from the summary sentences by masking the Named Entities contained therein.
The masked entities represent the answers, and thus a masked question should be answerable given the source article. So far, no multilingual training dataset has been proposed for Question Answering.
This methodology could be thus applied on ML-SUM as a first step toward a large-scale multilingual Question Answering corpus. Incidentally, this would also allow progressing towards multilingual Question Generation, a crucial component to employ the neural summarization metrics mentioned in Section 5.
News Title Generation While the release of MLSUM hereby described covers only articlesummary pairs, the archived news articles also include the corresponding titles. The accompanying code for parsing the articles allows to easily retrieve the titles and thus use them for News Title Generation.
Topic detection A topic/category can be associated with each article/summary pair, by simply parsing the corresponding URL. A natural application of this data for summarization would be for template based summarization (Perez-Beltrachini et al., 2019), using it as additional features. However, it can also be a useful multilingual resource for topic detection.
We presented MLSUM, the first large-scale Multi-Lingual SUMmarization dataset, comprising over 1.5M article/summary pairs in French, German, Russian, Spanish, and Turkish. We detailed its construction, and its complementary nature to the CNN/DM summarization dataset for English. We reported extensive preliminary experiments, highlighting biases observed in existing summarization models as well as analyzing and investigating the relative performances across languages of state-ofthe-art approaches. In future work, we plan to add other languages including Arabic and Hindi, and to investigate the adaptation of neural metrics to multilingual summarization. Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence summary El aeropuerto ha estado hasta las 15.00 con slo dos pistas por ausencia de 5 de los 18 controladores areos.-Varias aerolneas han denunciado demoras de "hasta 60 minutos con los pasajeros embarcados" body El espacio har un repaso cronolgico de la vida de la Esteban desde el momento en el que una completa desconocida comenz a aparecer en los medios en 1998 como la novia de Jesuln de Ubrique hasta llegar a hoy en da, convertida en la princesa del pueblo, en concreto del popular madrileo distrito de San Blas donde vive, tal y como algunos la han calificado, y protagonista de portadas de revistas, diarios y portales web y de aparecer incluso entre los personajes ms populares de