VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

A popular multimedia news format nowadays is providing users with a lively video and a corresponding news article, which is employed by influential news media including CNN, BBC, and social media including Twitter and Weibo. In such a case, automatically choosing a proper cover frame of the video and generating an appropriate textual summary of the article can help editors save time, and readers make the decision more effectively. Hence, in this paper, we propose the task of Video-based Multimodal Summarization with Multimodal Output (VMSMO) to tackle such a problem. The main challenge in this task is to jointly model the temporal dependency of video with semantic meaning of article. To this end, we propose a Dual-Interaction-based Multimodal Summarizer (DIMS), consisting of a dual interaction module and multimodal generator. In the dual interaction module, we propose a conditional self-attention mechanism that captures local semantic information within video and a global-attention mechanism that handles the semantic relationship between news text and video from a high level. Extensive experiments conducted on a large-scale real-world VMSMO dataset show that DIMS achieves the state-of-the-art performance in terms of both automatic metrics and human evaluations.


Introduction
Existing experiments  have proven that multimodal news can significantly improve users' sense of satisfaction for informativeness. As one of these multimedia data forms, introducing news events with video and textual descriptions is becoming increasingly popular, and has been employed as the main form of news reporting by news media including BBC, Weibo, CNN, and Daily Mail. An illustration is shown in Figure 1, where the news contains a video with a cover picture and a full news article with a short textual summary. In such a case, automatically generating multimodal summaries, i.e., choosing a proper cover frame of the video and generating an appropriate textual summary of the article can help editors save time and readers make decisions more effectively.
There are several works focusing on multimodal summarization. The most related work to ours is , where they propose the task of generating textual summary and picking the most representative picture from 6 input candidates. However, in real-world applications, the input is usually a video consisting of hundreds of frames. Consequently, the temporal dependency in a video cannot be simply modeled by static encoding methods. Hence, in this work, we propose a novel task, Video-based Multimodal Summarization with Multimodal Output (VMSMO), which selects cover frame from news video and generates textual summary of the news article in the meantime.
The cover image of the video should be the salient point of the whole video, while the textual summary should also extract the important information from source articles. Since the video and the article focus on the same event with the same report content, these two information formats complement each other in the summarizing process. However, how to fully explore the relationship between temporal dependency of frames in video and semantic meaning of article still remains a problem, since the video and the article come from two different space.
Hence, in this paper, we propose a model named Dual-Interaction-based Multimodal Summarizer (DIMS), which learns to summarize article and video simultaneously by conducting a dual interaction strategy in the process. Specifically, we first employ Recurrent Neural Networks (RNN) to encode text and video. Note that by the encoding RNN, the spatial and temporal dependencies between images in the video are captured. Next, we design a dual interaction module to let the video and text fully interact with each other. Specifically, we propose a conditional self-attention mechanism which learns local video representation under the guidance of article, and a global-attention mechanism to learn high-level representation of videoaware article and article-aware video. Last, the multimodal generator generates the textual summary and extracts the cover image based on the fusion representation from the last step. To evaluate the performance of our model, we collect the first large-scale news article-summary dataset associated with video-cover from social media websites. Extensive experiments on this dataset show that DIMS significantly outperforms the state-of-the-art baseline methods in commonly-used metrics by a large margin.
To summarize, our contributions are threefold: • We propose a novel Video-based Multimodal Summarization with Multimodal Output (VMSMO) task which chooses a proper cover frame for the video and generates an appropriate textual summary of the article.
• We propose a Dual-Interaction-based Multimodal Summarizer (DIMS) model, which jointly models the temporal dependency of video with semantic meaning of article, and generates textual summary with video cover simultaneously.
• We construct a large-scale dataset for VMSMO, and experimental results demonstrate that our model outperforms other baselines in terms of both automatic and human evaluations.

Related Work
Our research builds on previous works in three fields: text summarization, multimodal summarization, and visual question answering. Text Summarization. Our proposed task bases on text summarization, the methods of which can be divided into extractive and abstractive methods (Gao et al., 2020b). Extractive models Narayan et al., 2018;Luo et al., 2019;Xiao and Carenini, 2019) directly pick sentences from article and regard the aggregate of them as the summary. In contrast, abstractive models (Sutskever et al., 2014;See et al., 2017;Wenbo et al., 2019;Gui et al., 2019; generate a summary from scratch and the abstractive summaries are typically less redundant. Multimodal Summarization. A series of works Palaskar et al., 2019;Chan et al., 2019;Gao et al., 2020a) focused on generating better textual summaries with the help of multimodal input. Multimodal summarization with multimodal output is relatively less explored.  proposed to jointly generate textual summary and select the most relevant image from 6 candidates. Following their work, Zhu et al. (2020) added a multimodal objective function to use the loss from the textual summary generation and the image selection. However, in the real-world application, we usually need to choose the cover figure for a continuous video consisting of hundreds of frames. Consequently, the temporal dependency between frames in a video cannot be simply modeled by several static encoding methods. Visual Question Answering. Visual Question Answering (VQA) task is similar to our task in taking images and a corresponding text as input. Most works consider VQA task as a classification problem and the understanding of image subregions or image recognition becomes particularly important (Goyal et al., 2017;Malinowski et al., 2015;Wu et al., 2016;Xiong et al., 2016). As for the interaction models, one of the state-of-theart VQA models ) proposed a positional self-attention with a co-attention mechanism, which is faster than the recurrent neural network (RNN). Guo et al. (2019)   question-answer synergistic network, where candidate answers are coarsely scored according to their relevance to the image and question pair and answers with a high probability of being correct are re-ranked by synergizing with image and question.

Problem Formulation
Before presenting our approach for the VMSMO, we first introduce the notations and key concepts. For an input news article X = {x 1 , x 2 , . . . , x T d } which has T d words, we assume there is a ground truth textual summary Y = {y 1 , y 2 , . . . , y Ty } which has T y words. Meanwhile, there is a news video V corresponding to the article, and we assume there is a ground truth cover picture C that extracts the most important frame from the video content. For a given article X and the corresponding video V , our model emphasizes salient parts of both inputs by conducting deep interaction. The goal is to generate a textual summary Y that successfully grasp the main points of the article and choose a frame picture C that covers the gist of the video.

Overview
In this section, we propose our Dual Interactionbased Multimodal Summarizer (DIMS), which can be divided into three parts in Figure 2: • Feature Encoder is composed of a text encoder and a video encoder which encodes the input article and video separately.
• Dual Interaction Module conducts deep interaction, including conditional self-attention and global-attention mechanism between video segment and article to learn different levels of representation of the two inputs.
• Multi-Generator generates the textual summary and chooses the video cover by incorporating the fused information.

Text encoder
To model the semantic meaning of the input news text X = {x 1 , x 2 , . . . , x T d }, we first use a word embedding matrix e to map a one-hot representation of each word x i into to a high-dimensional vector space. Then, in order to encode contextual information from these embedding representation, we use bi-directional recurrent neural networks (Bi-RNN) (Hochreiter and Schmidhuber, 1997) to model the temporal interactions between words: where h x t denotes the hidden state of t-th step in Bi-RNN for X. Following (See et al., 2017;Ma et al., 2018), we choose the long short-term memory (LSTM) as the Bi-RNN cell.

Video Encoder
A news video usually lasts several minutes and consists of hundreds of frames. Intuitively, a video can be divided into several segments, each of which corresponds to different content. Hence, we choose to encode video hierarchically. More specifically, we equally divide frames in the video into several segments and employ a low-level frame encoder and a high-level segment encoder to learn hierarchical representation. Frame encoder. We utilize the Resnet-v1 model (He et al., 2016) to encode frames to alleviate gradient vanishing (He et al., 2016) and reduce computational costs: where m i j is the j-th frame in i-th segment and F v (·) is a linear transformation function.
Segment encoder. As mentioned before, it is important to model the continuity of images in video, which cannot be captured by a static encoding strategy. We employ RNN network as segment encoder due to its superiority in exploiting the temporal dependency among frames Zhao et al. (2017): S i j denotes the hidden state of j-th step in Bi-RNN for segment s i , and the final hidden state S i T f denotes the overall representation of the segment s i , where T f is the number of frames in a segment.

Dual Interaction Module
The cover image of the video should contain the key point of the whole video, while the textural summary should also cover extract the important information from source articles. Hence, these two information formats complement each other in the summarizing process. In this section, we conduct a deep interaction between the video and article to jointly model the temporal dependency of video and semantic meaning of text. The module consists of a conditional self-attention mechanism that captures local semantic information within video segments and a global-attention mechanism that handles the semantic relationship between news text and video from a high level.
Conditional self-attention mechanism. Traditional self-attention can be used to obtain contextual video representation due to its flexibility in relating two elements in a distance-agnostic manner. However, as illustrated in Xie et al. (2020), the semantic understanding often relies on more complicated dependencies than the pairwise one, especially conditional dependency upon a given premise. Hence, in the VMSMO task, we capture the local semantic information of video conditioned on the input text information.
Our conditional self-attention module shown in Figure 3 is composed of a stack of N identical layers and a conditional layer. The identical layer learns to encode local video segments while the conditional layer learns to assign high weights to the video segments conditioned on their relationship to the article. We first use a fully-connected layer to project each segment representation S i T f into the query Q i , key K i , and value V i . Then, the scaled dot-product self-attention is defined as: where d stands for hidden dimension and T s is the segment number in a video.Ŝ i is then fed into the feed-forward sub-layer including a residual connection (He et al., 2016) and layer normalization (Ba et al., 2016). Next, we highlight the salient part of the video under the guidance of article. Taking the article information h x T d as condition, the attention score on each original segment representation S i T f is calculated as: The final conditional segment representation S c i is denoted as β iŜi .
Global-attention mechanism. The globalattention module grounds the article representation on the video segments and fuses the information of the article into the video, which results in an articleaware video representation and a video-aware article representation. Formally, we utilize a two-way attention mechanism to obtain the co-attention between the encoded text representation h x t and the encoded segment representation S i T f : We use E t i to denote the attention weight on the t-th word by the i-th video segment. To learn the alignments between text and segment information, the global representations of video-aware articlê h x t and article-aware videoŜ c i are computed as:

Multi-Generator
In the VMSMO task, the multi-generator module not only needs to generate the textual summary but also needs to choose the video cover. Textual summary generation. For the first task, we use the final state of the input text representation h x T d as the initial state d 0 of the RNN decoder, and the t-th generation procedure is: where d t is the hidden state of the t-th decoding step and h c t−1 is the context vector calculated by the standard attention mechanism (Bahdanau et al., 2014), and is introduced below.
To take advantage of the article representation h x t and the video-aware article representationĥ x t , we apply an "editing gate" γ e to decide how much information of each side should be focused on: Then the context vector h c t−1 is calculated as: .
Finally, the context vector h c t is concatenated with the decoder state d t and fed into a linear layer to obtain the generated word distribution P v : Following See et al. (2017), we also equip our model with pointer network to handle the out-ofvocabulary problem. The loss of textual summary generation is the negative log likelihood of the target word y t : Cover frame selector. The cover frame is chosen based on hierarchical video representations, i.e., the original frame representation M i j and the conditional segment representation S c i with the articleaware segment representationŜ c i : where y c i,j is the matching score of the candidate frames. The fusion gates γ 1 f and γ 2 f here are determined by the last text encoder hidden state h x T d : We use pairwise hinge loss to measure the selection accuracy: where y c negative and y c positive corresponds to the matching score of the negative samples and the ground truth frame, respectively. The margin in the L pic is the rescale margin in hinge loss.
The overall loss for the model is: 5 Experimental Setup

Dataset
To our best knowledge, there is no existing largescale dataset for VMSMO task. Hence, we collect the first large-scale dataset for VMSMO task from Weibo, the largest social network website in China.
Most of China's mainstream media have Weibo accounts, and they publish the latest news in their accounts with lively videos and articles. Correspondingly, each sample of our data contains an article with a textual summary and a video with a cover picture. The average video duration is one minute and the frame rate of video is 25 fps. For the text part, the average length of article is 96.84 words and the average length of textual summary is 11.19 words. Overall, there are 184,920 samples in the dataset, which is split into a training set of 180,000 samples, a validation set of 2,460 samples, and a test set of 2,460 samples.

Comparisons
We compare our proposed method against summarization baselines and VQA baselines. Traditional Textual Summarization baselines: Lead: selects the first sentence of article as the textual summary (Nallapati et al., 2017). TexkRank: a graph-based extractive summarizer which adds sentences as nodes and uses edges to weight similarity (Mihalcea and Tarau, 2004). PG: a sequence-to-sequence framework combined with attention mechanism and pointer network (See et al., 2017). Unified: a model which combines the strength of extractive and abstractive summarization (Hsu et al., 2018). GPG: Shen et al. (2019) proposed to generate textual summary by "editing" pointed tokens instead of hard copying.

Multimodal baselines:
How2: a model proposed to generate textual summary with video information (Palaskar et al., 2019). Synergistic: a image-question-answer synergistic network to value the role of the answer for precise visual dialog (Guo et al., 2019). PSAC: a model adding the positional self-attention with co-attention on VQA task .
MSMO: the first model on multi-output task, which paid attention to text and images during generating textual summary and used coverage to help select picture . MOF: the model based on MSMO which added consideration of image accuracy as another loss (Zhu et al., 2020).

Evaluation Metrics
The quality of generated textual summary is evaluated by standard full-length Rouge F1 (Lin, 2004) following previous works (See et al., 2017;. R-1, R-2, and R-L refer to unigram,  (Hsu et al., 2018) 23.0 6.0 20.9 GPG (Shen et al., 2019) 20.1 4.5 17.3 our models DIMS 25.1 9.6 23.2 bigrams, and the longest common subsequence respectively. The quality of chosen cover frame is evaluated by mean average precision (MAP)  and recall at position (R n @k) . R n @k measures if the positive sample is ranked in the top k positions of n candidates.

Implementation Details
We implement our experiments in Tensorflow (Abadi et al., 2016) on an NVIDIA GTX 1080 Ti GPU. The code for our model is available online 2 . For all models, we set the word embedding dimension and the hidden dimension to 128. The encoding step is set to 100, while the minimum decoding step is 10 and the maximum step is 30. For video preprocessing, we extract one of every 120 frames to obtain 10 frames as cover candidates. All candidates are resized to 128x64. We regard the frame that has the maximum cosine similarity with the ground truth cover as the positive sample, and others as negative samples. Note that the average cosine similarity of positive samples is 0.90, which is a high score, demonstrating the high quality of the constructed candidates. In the conditional selfattention mechanism, the stacked layer number is set to 2. For hierarchical encoding, each segment contains 5 frames. Experiments are performed with a batch size of 16. All the parameters in our model are initialized by Gaussian distribution. During training, we use Adagrad optimizer as our optimizing algorithm and we also apply gradient clipping with a range of [−2, 2]. The vocabulary size is limited to 50k. For testing, we use beam search with beam size 4 and we decode until an end-ofsequence token is reached. We select the 5 best checkpoints based on performance on the validation set and report averaged results on the test set.   We first examine whether our DIMS outperforms other baselines as listed in Table 1 and Table 2. Firstly, abstractive models outperform all extractive methods, demonstrating that our proposed dataset is suitable for abstractive summarization. Secondly, the video-enhanced models outperform traditional textural summarization models, indicating that video information helps generate summary. Finally, our model outperforms MOF by 17.8%, 68.4%, 29.6%, in terms of Rouge-1, Rouge-2, Rouge-L, and 6.3%, 15.2% in MAP and R@1 respectively, which proves the superiority of our model. All our Rouge scores have a 95% confidence interval of at most ±0.55 as reported by the official Rouge script. In addition to automatic evaluation, system performance was also evaluated on the generated textual summary by human judgments on 70 randomly selected cases similar to Liu and Lapata (2019). Our first evaluation study quantified the degree to which summarization models retain key information from the articles following a questionanswering (QA) paradigm (Narayan et al., 2018).
A set of questions was created based on the gold summary. Then we examined whether participants were able to answer these questions by reading system summaries alone. We created 183 questions in total varying from two to three questions per gold summary. Correct answers were marked with 1 and 0 otherwise. The average of all question scores is set to the system score.
Our second evaluation estimated the overall quality of the textual summaries by asking participants to rank them according to its Informativeness (does the summary convey important contents about the topic in question?), Coherence (is the summary fluent and grammatical?), and Succinctness (does the summary avoid repetition?). Participants were presented with the gold summary and summaries generated from several systems better on autometrics and were asked to decide which was the best and the worst. The rating of each system was calculated as the percentage of times it was chosen as best minus the times it was selected as worst, ranging from -1 (worst) to 1 (best). Both evaluations were conducted by three nativespeaker annotators. Participants evaluated summaries produced by Unified, How2, MOF and our DIMS, all of which achieved high perfromance in automatic evaluations. As shown in Table 3, on both evaluations, participants overwhelmingly prefer our model. All pairwise comparisons among systems are statistically significant using the paired student t-test for significance at α = 0.01.  Figure 4: Visualizations of global-attention matrix between the news article and two frames in the same video.

Ablation Study
Next, we conduct ablation tests to assess the importance of the conditional self-attention mechanism (-S), as well as the global-attention (-G) in Table 2. All ablation models perform worse than DIMS in terms of all metrics, which demonstrates the preeminence of DIMS. Specifically, the globalattention module contributes mostly to the textual summary generation, while the conditional selfattention module is more important for choosing cover frame.

Analysis of Multi-task learning
Our model aims to generate textural summary and choose cover frame at the same time, which can be regarded as a multi-task. Hence, in this section, we examine whether these two tasks can complement each other. We separate our model into two single-task architecture, named as DIMS-textual summary and DIMS-cover frame, which generates textural summary and chooses video cover frame, respectively. The result is shown in Table 2. It can be seen that the multi-task DIMS outperforms single-task DIMS-textual summary and DIMS-cover frame, improving the performance of summarization by 20.8% in terms of ROUGE-L score, and increasing the accuracy of cover selection by 7.0% on MAP.

Visualization of dual interaction module
To study the multimodal interaction module, we visualize the global-attention matrix E t i in Equation 8 on one randomly sampled case, as shown in Figure 4. In this case, we show the attention on article words of two representative images in the video. The darker the color is, the higher the attention weight is. It can be seen that for the left figure, the word hand in hand has a higher weight than picture, while for the right figure, the word Book Fair has the highest weight. This corresponds to the fact that the main body of the left frame is two old men, and the right frame is about reading books.
Article: On August 26, in Shanxi Ankang, a 12-year-old junior girl Yu Taoxin goose-stepped like parade during the military training in the new semester, and won thousands of praises. Yu Taoxin said that her father was a veteran, and she worked hard in military training because of the influence of her father. Her father told her that military training should be strict as in the army. 8月26日，  We show a case study in Table 4, which includes the input article and the generated summary by different models. We also show the questionanswering pair in human evaluation and the chosen cover. The result shows that the summary generated by our model is both fluent and accurate, and the cover frame chosen is also similar to the ground truth frame.
In this paper, we propose the task of Video-based Multimodal Summarization with Multimodal Output (VMSMO) which chooses a proper video cover and generates an appropriate textual summary for a video-attached article. We propose a model named Dual-Interaction-based Multimodal Summarizer (DIMS) including a local conditional self-attention mechanism and a global-attention mechanism to jointly model and summarize multimodal input. Our model achieves state-of-the-art results in terms of autometrics and outperforms human evaluations by a large margin. In near future, we aim to incorporate the video script information in the multimodal summarization process.