Multimodal Sentence Summarization via Multimodal Selective Encoding

This paper studies the problem of generating a summary for a given sentence-image pair. Existing multimodal sequence-to-sequence approaches mainly focus on enhancing the decoder by visual signals, while ignoring that the image can improve the ability of the encoder to identify highlights of a news event or a document. Thus, we propose a multimodal selective gate network that considers reciprocal relationships between textual and multi-level visual features, including global image descriptor, activation grids, and object proposals, to select highlights of the event when encoding the source sentence. In addition, we introduce a modality regularization to encourage the summary to capture the highlights embedded in the image more accurately. To verify the generalization of our model, we adopt the multimodal selective gate to the text-based decoder and multimodal-based decoder. Experimental results on a public multimodal sentence summarization dataset demonstrate the advantage of our models over baselines. Further analysis suggests that our proposed multimodal selective gate network can effectively select important information in the input sentence.


Introduction
Text summarization is a task that condenses a long sentence to a short version. Existing researches (Rush et al., 2015;Chopra et al., 2016;Zeng et al., 2016;Tan et al., 2017;Zhou et al., 2017;Li et al., 2020b) produce summary only from the text. However, it has been proved that human understands the text relying on multimodal information (Waltz, 1980;Srihari, 1994;He and Deng, 2017), such as linguistic and visual signals. In this paper, we focus on the multimodal summarization task (Li et al., 2018a) that generates summary simultaneously drawing knowledge from coupled text and image, which can facilitate other applications such as image captioning Vinyals et al., 2015), multimodal news summarization (Narayan et al., 2017;Chen and Zhuge, 2018), and electronic commerce (e-commerce) product description generation (Chen et al., 2019;Elad et al., 2019;Li et al., 2020a), etc.
Intuitively, it is easier for a reader to grasp the highlight of a news event by viewing an image than by reading a long text. Hence we believe that the image will benefit text summarization system. Figure 1 illustrates this phenomenon. For a given source sentence, a paired image visualizes a set of event highlight words, which highly correlates with the reference summary.
Multimodal sequence-to-sequence (seq2seq) learning has been widely explored in machine translation (MT) Caglayan et al., 2017;Helcl et al., 2018;Grönroos et al., 2018) in recent years, and the performances of their models surpass text-only models (Barrault et al., 2018). The main difference between multimodal MT and multimodal text summarization is: the model for MT is required to convert the same semantics from the input sentence and the paired image to the output, while for summarization, the model is expected to select the important information from the input. Li et al. (2018a) propose a hierarchical attention model for the multimodal sentence summarization task, while the image is not involved in the process of text encoding. Obviously, it will be easier for the decoder to generate Source sentence: british prime minister tony blair has rescued a danish swimmer from the shark infested sea during his holiday in seychelles in africa , a government spokesman said friday . Reference summary: uk pm rescues swimmer from sea . an accurate summary if the encoder can filter out trivial information when encoding the input sentence. Based on this idea, we propose a multimodal selective mechanism which aims to select the highlights from the input text using visual signals, and then the decoder generates the summary using the filtered encoding information. Concretely, an encoder reads the input text and generates the hidden representations. Then, multimodal selective gates measure the relevance between the input words and the image to construct the selected hidden representation. Finally, a decoder generates the summary using the selected hidden representation. We explore the capacity of the image to select salient content from the input text at different levels including global image descriptor, activation grids, and object proposals in the image. Accordingly, as shown in Figure 2, we design three visual selective gates: global-level, grid-level, and object-level gates. Furthermore, some abstract concepts, such as "guilty" and "freedom", can hardly be well represented by the image, and thus we combine textual and visual selective gates to construct multimodal selective gates. In addition, we argue that a good summarizer should adequately capture the highlight words. Thus, we impose a modality regularization on our model that encourages the semantic similarity between (image, generated summary) higher than that between (image, source text).
Our main contributions are as follows: • We propose a novel multimodal selective mechanism that can use both the textual and visual signals to select the important information from the source text.
• We propose a visual-guided modality regularization module to encourage the model focus on the key information in the source.
• The experimental results on a multimodal sentence summarization dataset demonstrate that our proposed system can take advantage of multimodal information and outperform baseline methods.
2 Related Work 2.1 Abstractive Sentence Summarization Rush et al. (2015) first propose a seq2seq model to generate the summary for a sentence. Chopra et al. (2016) and    ..  Figure 2: The framework of our model. We design visual selective gates including global-level, gridlevel and object-level visual gates to select salient encoding information. We also integrate the textual and visual selective gates to construct multimodal selective gates. In this figure, the summary is generated with the text-based decoder, and we also apply multimodal selective gates to the multimodal-based decoder (Li et al., 2018a) in our work.

Multimodal Seq2seq Models
Libovický and Helcl (2017) propose multi-source seq2seq learning with hierarchical attention.  use images as source words to improve translation quality. Delbrouck and Dupont (2017) adjust various attention for the visual modality.  propose attention mechanisms for textual and visual modalities and combine them to decode target words. Narayan et al. (2017) develop extractive summarization with side information including images and captions. , Chen and Zhuge (2018) and Zhu et al. (2020) propose to generate multimodal summary for multimodal news document. Li et al. (2018a) first introduce the multimodal sentence summarization task, and they propose a hierarchical attention model, which can pay different attention to image patches, words, and different modalities when decoding target words. Li et al. (2020a) propose a aspect-aware multimodal summarization model for e-commerce products. We use visual signals to enhance the encoder for the multimodal sentence summarization task, aiming to discriminate the important source information when encoding the input sentence.

Overview
The input of the multimodal text summarization task is a pair of text and image, and the output is a textual summary. We propose visual selective gates to encourage important source information encoded into the second-level hidden sequence. We argue that the text can be pertinent to visual information at the level of the whole image, the image parts, and the object proposals in the image. Thus, as shown in Figure 2, we design three visual selective gates: global-level, grid-level, and object-level gates. Next, the summary decoder produces the summary based on the second-level hidden states. We apply two types of decoders including the text-based decoder and multimodal-based decoder.

Text Encoder
Given a source text x = (x 1 , · · · , x n ), an encoder builds the first-level hidden state sequence h = (h 1 , · · · , h n ). We apply a Bidirectional GRU to encode the source sentence forwardly and backwardly into two sequences of the hidden states:

Image Encoder
Given an image, we apply a pre-trained VGG (VGG19) and a Faster R-CNN (Ren et al., 2017) model to extract visual features at multiple levels including the whole image, the image parts, and the object proposals in the image. We extract the 4096-dimensional fully-connected layer of VGG19 (fc7 layer) as the image global feature v g . We extract a 7 × 7 × 512 feature map of the last pooling layer (pool5 layer) of VGG19 as the image grid-level feature v r . For object proposal features v o of an image, we use Faster R- CNN (Ren et al., 2017) initialized with ResNet-101 (He et al., 2016) pretrained for classification on ImageNet (Deng et al., 2009), and then we retrain it on Visual Genome Dataset (Krishna et al., 2017). v o j ∈ R 2048 is obtained from the ROI pooling layer in the Region Proposal Network. We choose the top 36 object proposals after non-maximum suppression (Neubeck and Gool, 2006).

Textual Selective Gates
In this section, we briefly describe the selective mechanism (Zhou et al., 2017). The selective encoding (Zhou et al., 2017) extends the seq2seq model by constructing a second-level hidden state h i to control the information flow from the encoder to the decoder as follows: where σ denotes the sigmoid function, W and U are parameter matrices. r is the overall representation for the text x.
Then, h i is computed as follows: Finally, the decoder generates the summary depending on the second-level hidden sequence h as a standard seq2seq model does.

Visual Selective Gates
We design three kinds of visual selective gates to select salient encoding information: global-level, gridlevel, and object-level gates. We further combine textual and visual selective gates to construct multimodal selective gates.
Global-Level Visual Selective Gate To conduct a global-level visual selective gate sGate g , we explore the association between text hidden state h i and global visual feature v g :

Grid-Level Visual Selective Gate
Grid-level image features retain the spatial information of an image, which can be used to bridge the gap between text and image patch-by-patch as follows: where v r j ∈ R 512 (j ∈ [1, 49]) is a grid-level image feature. Specifically, for each h i , we compute grid-level visual selective gate with each grid-level image patch v r j , and then we apply a max-pooling over the gates to take the maximum value of each dimension as the final selective gate for h i . The idea behind this is that different dimensions of h i represent different semantics, which correspond to different image patches, and the max operation can capture the most related image regions for each h i .
Object-Level Visual Selective Gate Grid-level image features correspond to a uniform grid of equally sized and shaped neural receptive fields for the images. In fact, salient image region, such as object proposal, is a much more natural basis for human attention (Egly et al., 1994). Thus, we compute the selective gate based on object-level image features as follows: where v o j ∈ R 2048 (j ∈ [1, 36]) is an object-level image feature.

Multimodal Selective Gates
To avoid missing necessary information which can hardly be well represented by the image, such as abstract concepts, we combine textual and visual selective gates to construct multimodal selective gates.
To jointly select important encoding information by linguistics and different levels of visual signals, we propose three kinds of multimodal selective gates as follows: is the overall representation for the source text. Then, the second-level encoding states h i is computed by one of the sGate ∈ {sGate g , sGate r , sGate o , sGate tg , sGate tr , sGate to } as follows: Next, the decoder generates summaries based on h .

Summary Decoder
To verify the generalization of our model, we apply multimodal selective gates to both the text-based decoder and multimodal-based decoder.

Text-Based Summary Decoder
The decoder for a standard text-only seq2seq model calculates the decoding state s t as follows: where . Context vector c t is computed as a weighted sum of the source annotations as follows: The probability for the next target word y t is computed using the decoder state s t and the previous emitted word y t−1 as follows: The loss function L t for each time t is as follows: Multimodal-Based Summary Decoder Li et al. (2018a) propose a multimodal-based summarization method. In this work, we adopt the summary decoder with weighted image local features initialization, and the hierarchical attention mechanism which can pay different attention to the input words and image patches, producing textual and visual context vectors. The textual context vector c txt is calculated with Equation 11. The image context vector c img t is as follows: The second-level modality attention is: where β txt t is attention weight for textual context and β img t is attention weight for visual. The probability for the next target word y t is computed based on Equation 10 and 13.

Modality Regularization
Given a source text, we argue that the paired image, especially for the foreground of an image, contains more centralized information that should be presented in the summary. To guarantee the summarizer can better capture the highlights expressed by the image, we attempt to constrain the image more related to the generated summary than to the source. To achieve this goal, we design a Modality Regularization (MR) module that scores (image, target) higher than (image, source).
We first learn representations for the source and summary by a cross-modal attention, which emphasize the words related to the image: Then we project the source, the target, and the image into the shared space: e src = tanh(W h c src ), e tgt = tanh(W s c tgt ), e v = tanh(U v v g ). A pairwise ranking loss function to encourage similarity score between (image, summary) higher than (image, source): L r = max{0, δ − cos(e tgt , e v ) + cos(e src , e v )} where δ = 2, and cos is a cosine similarity function. The final loss function is: In addition, inspired by Zhou et al. (2018a), we calculate s 0 with e src as follows: 4 Experiments

Dataset
We evaluated our methods on the public multimodal sentence summarization dataset (Li et al., 2018a). Each sample in this dataset is a triplet (sentence, image, summary). The dataset consists of 62,000 training samples, 2,000 test samples and 2,000 validation samples.

Experimental Settings
We set word embedding size to 300 and GRU hidden state size to 512. We use the full source and target vocabularies collected from the training data, which have 36,916 source words and 26,168 target words, respectively. The mini-batch size is 64, and beam search size is 10. Adam optimizer is applied with the learning rate of 0.0005, momentum parameters β 1 = 0.9 and β 1 = 0.999, and = 10 −8 . We use dropout (Srivastava et al., 2014) with probability of 0.2 and gradient clipping (Pascanu et al., 2013) with range [−1, 1]. During training, we test the ROUGE-2 (Lin, 2004) F1-score on the validation set for every 5,000 batches, and we halve the learning rate if the score drops for 5 consecutive testings.  Table 1: Experimental results of our models with the text-based decoder. "T-Selective", "V-Selective" and "MR" denote Textual Selective, Visual Selective, and Modality Regularization, respectively. Our "T + Grid V-Selective + MR" model performs significantly better than the baselines by the 95% confidence interval in the official ROUGE script. Best result among different "V-Selective" is underlined.

Comparative Methods
Lead. A baseline uses the first eight words as the summary according to the reference length. Compress. Clarke (2008) propose to compress sentence based on syntactic structure. ABS. Rush et al. (2015) use an attentive CNN encoder and a neural network language model decoder to summarize the sentence. SEASS. Zhou et al. (2017) propose a summarization model with textual selective encoding. Multi-Source. Libovický and Helcl (2017) propose a hierarchical attention model. Doubly-Attentive.  propose a doubly-attentive mechanism for multimodal machine translation. Seq2seq. Bahdanau et al. (2015) propose a standard seq2seq model for machine translation.

Experimental Results
We report ROUGE-1 (RG-1), ROUGE-2 (RG-2), and ROUGE-L (RG-L) F1-scores 1 . We conduct the experiments with different initializations for eight times, and we report the mean and standard deviation. Table 1 shows the results for text-based decoders with different selective encoding mechanisms and modality regularization (MR). Generally, our models with visual selective gates perform better than the model with textual selective gate, and global-level feature constructs the most effective visual selective gate, while the object-level feature performs worst. We argue that some background regions, such as "sea" in Figure 1 and "highway" in Figure 3, are also necessary for our task, while object proposals may ignore these background. Recognition errors caused by Region Proposal Network may also influence the summarization performance. Multimodal Selective gates lead to further improvements, and the model with "Textual + Grid-level Visual Selective" achieves the highest ROUGE score. We hypothesize the reason may be that similar to the global-level visual selective gate, the textual gate is also a global-level selective gate. Thus, compared with the global-level visual selective gate, the local-level grid-level visual gate contribute more to the textual gate. In addition, we test with combining all of three different levels of visual information and textual information to compute the selective gate, while the result is a little worse than that of "Textual + Grid-level Visual Selective". Table 2 show comparisons with other work. We can conclude that Multimodal Selective (Textual + Grid-Level Visual Selective, TGS) and Modality Regularization (MR) strategies exhibit good generalization ability for both the text-based decoder (Seq2seq) and multimodal-based decoder (MAtt), and the model "MAtt + PGNet + TGSMR" outperforms other baselines.  Table 2: Comparisons with other work. "TGSMR" denotes "Textual + Grid-Level Visual Selective + Modality Regularization". The model of "MAtt + PGNet + TGSMR" performs significantly better than other baselines by the 95% confidence interval in the official ROUGE script. "*" marks the results from Li et al. (2018a).

How Reliable is Our Selective Gates?
In this section, we explore whether the selective gates can correctly discriminate the important words from the input. To this end, we evaluate how many activated words by the selective gates are groundtruth keywords (we take the overlapping words, except for stop-words, between the input sentence and the reference summary as the ground-truth keywords). Following Zhou et al. (2017), we use the method of  to calculate the contribution of the selective gates to the final summary. Considering that the average word count of the summaries in the validation set is eight, we take the words with top-8 contribution values as the activated keywords by the selective mechanisms. Table 3 shows the results for keyword extraction. Our models with selective encoding perform better than unsupervised TextRank algorithm (Mihalcea and Tarau, 2004) (we also take the top-8 scored words as the keywords), and Multimodal Selective show advantages over Textual Selective. We further train a BiLSTM-CRF model (Huang et al., 2015) using sentence-keyword samples of the multimodal sentence summarization dataset, and the keyword extraction result for BiLSTM-CRF is better than our model, indicating a promising prospect for further development of the selective mechanism. In the future, we will dedicate our efforts to explore whether the selective gate can benefit from supervision signals of a special keyword extractor. A feasible research direction may be to explore whether the selective gate can benefit from supervision signals of a special keyword extractor.

Human Evaluation
We perform human evaluation with "Seq2seq + TGSMR" to avoid the influence of other modules such as multimodal decoder (MAtt) and copying (PGNet). We randomly choose 100 samples from the test set, and three postgraduates are involved to annotate whether "Seq2seq + TGSMR" outperforms "Seq2seq" model with respect to the informativeness and readability. The results are shown in Table 4. For example, "win" denotes that the results of "Seq2seq + TGSMR" is better than the "Seq2seq" model. We can see  Table 4: Human evaluation results (%). "Win" denotes that the generated summaries of "Seq2seq + TGSMR" is better than "Seq2seq" model. that the percentage of "win" is larger than "lose" (45.33% vs. 13.33% for informativeness, and 45% vs. 13.67% for readability. Kappa (Fleiss, 1971) is used to verified the consistency for different annotators.), demonstrating that the improvements of our multimodal summarizer are significant (p-value < 0.01, paired t-test).

Case Study
We show a sample from the test set, with comparisons of the reference summary and the summaries generated by the Seq2seq model, the text-based decoder with "Textual Selective", and "Textual + Gridlevel Visual Selective". In Figure 3, the image shows the "highway" where the event occurs, and "Textual + Grid-level visual selective" model successfully spells out the place while other models fail.
We visualize the gate values in a salience heat map for each source word. We observe that our multimodal selective gates accurately determine the salience of the source words, while textual selective gates fail, especially for the word "highway". We also show visual selective gate values for "Textual + Gridlevel Visual Selective" model in Figure 3 (b) for the target word "highway", which demonstrates that our visual grounded selective mechanism successfully captures the relationship between the image and the word that has corresponding visual semantics.  . Note that we simplify the input sentence to show the heat map.

Conclusion
This paper addresses the multimodal text summarization task, namely, transforming a pair of text and image into a summary. We propose multimodal selective encoding gates that can effectively control the information flow from the encoder to the decoder. We adopt our multimodal selective encoding strategies to the text-based and multimodal-based decoders. The experimental results on the public multimodal sentence summarization corpus prove that our proposed model significantly outperforms other comparative methods. Note that our method is not specifically designed for RNN-based models, and it can be easily applied to Transformer-based models, which is left for our future work.

Acknowledgments
This work is partially supported by National Key R&D Program of China (2020AAA0105200), National Key R&D Program of China (2018YFB2100802), and Beijing Academy of Artificial Intelligence (BAAI).