MM-AVS: A Full-Scale Dataset for Multi-modal Summarization

Multimodal summarization becomes increasingly significant as it is the basis for question answering, Web search, and many other downstream tasks. However, its learning materials have been lacking a holistic organization by integrating resources from various modalities, thereby lagging behind the research progress of this field. In this study, we release a full-scale multimodal dataset comprehensively gathering documents, summaries, images, captions, videos, audios, transcripts, and titles in English from CNN and Daily Mail. To our best knowledge, this is the first collection that spans all modalities and nearly comprises all types of materials available in this community. In addition, we devise a baseline model based on the novel dataset, which employs a newly proposed Jump-Attention mechanism based on transcripts. The experimental results validate the important assistance role of the external information for multimodal summarization.


Introduction
Multimodal summarization refines salient information from one or more modalities, including text, image, audio, and video ones (Evangelopoulos et al., 2013;Li et al., 2017). Given the rapid dissemination of multimedia data over the Internet, multimodal summarization has been widely explored in recent years. Meanwhile, some multimodal datasets (Li et al., 2017;Sanabria et al., 2018;Li et al., 2020a) have been introduced to advance the development of this research field. However, a majority of them are restricted in scale and too oriented, such as being less than one hundred examples or merely containing Chinese texts. Moreover, the materials from different modalities are rarely collected across the board, especially videos and their accompanying materials that possess abundant external information for multimodal comprehension and fusion.
In this work, we introduce a full-scale Multimodal Article and Video Summarization (MM-AVS) dataset 1 with documents, summaries, images, captions, videos, audios, transcripts, and titles in English. The significance of MM-AVS for the multimodal summarization community includes but not limited to: 1) MM-AVS is a large-scale multimodal collection compared with existing video containing dataset and its generation codes 1 has been released, which can be readily extended for existing and future multimodal summarization approaches; 2) MM-AVS is collected from CNN 2 and Daily Mail 3 , which makes it available to more researchers due to English-based and comparable with the popular text-based CNN/Daily Mail corpus; and 3) MM-AVS firstly collects nearly all types of materials from all modalities, inclusively with videos, audios, transcripts, images, captions, and titles that are rarely assembled.
In addition, we implement a general multimodal summarization baseline based on transcripts for multimodal summarization on MM-AVS. This method employs a Jump-Attention mechanism to align features between text and video. Further, we use the multi-task learning to simultaneously optimize document and video summarizations. Evaluations on MM-AVS illustrate the benefits of external information such as videos and transcripts for multimodal summarization without alignment.

Related Work
Multi-modal summarization generates a condensed multimedia summary from multi-modal materials, such as texts, images, and videos. For instance, UzZaman et al. (2011) introduced an idea of illustrating complex sentences as multimodal summaries by combining pictures, structures, and sim-Dataset Doc Summary Image Video Title Abs.Sum Ext.Label Image Caption Video Audio* Transcript* MSMO  MMSS  E-DailyMail (Chen and Zhuge, 2018) EC-product (Li et al., 2020a) MMS (Li et al., 2017) How2 (Sanabria et al., 2018) MM-AVS Table 1: Comparisons of multimodal corpus. (* means that the audio or transcript is separated from video.) plified compressed texts. Libovickỳ et al. (2018) and Palaskar et al. (2019) studied abstractive text summarization for open-domain videos. Li et al. (2017) constructed MMS dataset and developed an extractive multi-modal summarization method that automatically generated a textual summary based on a topic-related set of documents, images, audios, and videos. Zhu et al. ( , 2020 combined image selection and output to alleviate the modality bias based on the MSMO dataset. Chen and Zhuge (2018) extended Daily Mail with images and captions to E-Daily Mail dataset and employed a hierarchical encoder-decoder model to align sentences and images. Recently, an aspectaware model and a large-scale Chinese e-commerce product summarization dataset EC-product were introduced to incorporate visual information for e-commerce product summaries (Li et al., 2020a).
The above mentioned datasets are rarely constructed comprehensively, which ignore the abundant visual information underlying in videos. The only video-containing work is restricted in scale, which hampers its use for deep-learning based methods. In this study, we will build a full-scale multimodal dataset to address these issues.

MM-AVS Dataset
To facilitate a straightforward comparison for the multimodal summarization approaches with the text-based ones, MM-AVS extends CNN/DM collections to multimodalities. Each example of MM-AVS contains a document accompanying with multi-sentence summary, title, images, captions, videos, and their corresponding audios and transcripts.  tends visual information that most existing benchmarks ignore (such as MSMO , MMSS , E-DailyMail (Chen and Zhuge, 2018), and EC-product (Li et al., 2020b)). MMS (Li et al., 2017) and How2 (Sanabria et al., 2018) also take videos into account; however, MMS only contains 50 examples that are too limited for deep learning and How2 excludes documents, which are the most critical materials for summarization. MM-AVS also keeps image captions for deep descriptions of images as well as document titles for the topic extraction. Further, MM-AVS contains extractive labels for training convenience. In the manner of providing abundant multimodal information, MM-AVS is applicable for existing and future multimodal research in different learning tasks.

Dataset construction
The concrete statistics of MM-AVS are shown in Table 2 4 , incorporating textual and visual modules: Textual module. Following (Nallapati et al., 2016), we have crawled all the summary bullets of each story in the original order to obtain a multisentence reference, where each bullet is treated as a sentence. Given that the reference is an abstractive summary written by humans, we construct the label of each sentence as (Nallapati et al., 2017) does. Sentences in the document are selected to maximize the ROUGE (Lin, 2004) score with respect to the gold summary by a greedy approach. As for the document and title, we keep their original formats as shown in the websites. Visual module. To enrich visual information for multimodal summarization, we collect images and videos for each example. Image caption is preserved to assist further explorations such as feature extraction and alignment to documents. Given long videos, we separate the audios and extract the transcripts 5 to alleviate the pre-process pressure for large-scale or online learning.

Feature Extraction
We utilize the hierarchical bi-directional long short term memory (BiLSTM) (Nallapati et al., 2017) based on word and sentence levels to read tokens and induce a representation for each sentence denoted as s i . Each sentence in a transcript is denoted as t j . In terms of videos, we employ ResNet (He et al., 2016) for feature extraction and BiLSTM to model the sequential pattern in video frames. Each image is represented as m k .

Feature Alignment-Jump Attention
Given that the transcript extracted from a video shares the same modality with a document and accurately aligns with a video, we take it as a bridge to deepen the relationship between two modalities. We apply the jump attention based on transcripts to assist modality alignment, which focuses on transcripts to video images and then on documents to 5 We use IBM Watson Speech for the text service https://www.ibm.com/watson/services/ speech-to-text/.
transcript attention context. The video-aware context cd2v i is denoted as where N T and N M are the lengths of transcripts and image frames. b j i and d k j are the attention weights and can be calculated as follows (taking d k j for illustration): where V is the training parameter, q j and r k are the feature mappings of each modality that are calculated as q j = tanh(W m m k + b m ) and r k = tanh(W t t j + b t ). The jump attention can be reversed to obtain an article context vector for video summarization.

Feature Fusion
Given that modalities may be not accurately aligned, we employ late+ fusion by fusing unimodal decisions. Inspired by , we induce noise filters to eliminate noises as F (W s f (s i ), W c g(cd2v i )), where the filters W s and W c are calculated as follows: where β is a smoothed coefficient for penalty intensity, and f (·), g(·), and F (·) are feedforward networks.

Multi-task Training
We employ the multi-task training to enhance summarization. The loss function is the weight mix of   each task loss as follows: where L ts is the training loss for extractive summarization, y n andŷ n represent the true and predicted labels, and α ts and α vs are balance parameters. Following , we use unsupervised learning by reinforcement learning methods for video summarization whose loss can be separated into the diversity reward R div (measuring frames dissimilarity) and the representativeness reward R rep (measuring similarity between summary and video) as follows: where M is the set of the selected video frames and d(·) is the dissimilarity function.

Experiments
We conduct experiments on the MM-AVS dataset and evaluate the performance by ROUGE (Lin, 2004). R-1, R-2, and R-L respectively represent ROUGE-1, ROUGE-2, and ROUGE-L F1-scores, which are widely used to calculate the n-grams overlapping between decoded summaries and references.  Table 5: Manual summary quality evaluation.

Assistance of External Information
Videos, audios, or transcripts are less concerned than documents and images, as revealed in Table 1. Accordingly, the multimodal corpus assembling all of them has been absent so far, till MM-AVS is built in this study. To verify the importance of these materials for multimodal summarization, we test a text-only baseline and its two extensions. As for the baseline, we construct a hierarchical framework that concentrates on word and sentence levels with a feedforward classification layer. Its two extensions respectively take videos and transcripts for additional considerations. As shown in Table 3, both videos and transcripts can contribute to improving multimodal summarizaions by fusing documents. This validates that the external information complementary for texts can facilitate capturing the core ideas of documents and inducing high-quality summaries.

Analysis of Transcript
To further investigate the nature of transcripts, we compare them with documents and references. As shown in Table 4, the video transcripts in MM-AVS are distinct from documents with low overlaps, indicating that they are not repeating documents but provide useful assistant information. While Table 4 also illustrates that the transcripts are lowly correlated with references, suggesting that transcripts can assist summary generation but are not enough for the final excellent summaries.

Manual Evaluation
The document, video, and document with video summarization results on 200 groups of MM-AVS examples are scored by five computer science graduates in terms of their informativeness (Inform) and satisfaction (Satis). Each summary is scored from external information such as videos for excellent summaries.

Conclusions
In this work, we contribute a full-scale dataset for multimodal summarization, which extensively assembles documents, summaries, images, captions, videos, audios, transcripts, and titles. A novel multimodal summarization framework is proposed based on this dataset to be taken as a baseline for the future research in this community.