Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos

Multimodal summarization for open-domain videos is an emerging task, aiming to generate a summary from multisource information (video, audio, transcript). Despite the success of recent multiencoder-decoder frameworks on this task, existing methods lack ﬁne-grained multimodality interactions of multi-source inputs. Besides, unlike other multi-modal tasks, this task has longer multimodal sequences with more redundancy and noise. To address these two issues, we propose a multistage fusion network with the fusion forget gate module, which builds upon this approach by modeling ﬁne-grained interactions between the multisource modalities through a multistep fusion schema and controlling the ﬂow of redundant information between multimodal long sequences via a forgetting module. Experimental results on the How2 dataset show that our proposed model achieves a new state-of-the-art performance. Comprehensive analysis empirically veriﬁes the effectiveness of our fusion schema and forgetting module on multiple encoder-decoder architectures. Spe-cially, when using high noise ASR transcripts ( WER> 30% ), our model still achieves performance close to the ground-truth transcript model, which reduces manual annotation cost.


Introduction
With the popularity of video platforms, personal videos abound on the Internet. Multimodal summarization for open-domain videos, first organized as a track of the How2 Challenge at the ICML 2019 workshop, aims to integrate multisource information of videos (video, audio, transcript) into a fluent textual summary. An example can be seen in Figure 1. This study, which uses compressed text description to reflect the salient parts of videos, * Corresponding author. is of considerable significance for helping users better retrieve and recommend videos.
Existing approaches have obtained promising results. For example, Libovickỳ et al. (2018) and Palaskar et al. (2019) utilize multiple encoders to encode videos and audio transcripts and a joint decoder to decode the multisource encodings, which acquire better performance than single modality structures. Despite the effectiveness of these approaches, they only perform multimodal fusion during the decoding stage to generate a target sequence, lacking fine-grained interactions between multisource inputs to complete the missing information of each modality. For example, as shown in Figure 1, text context representations containing birds should be associated with visual semantic information containing parrots to build thorough multimodal representations.
Besides, unlike other multimodal tasks such as visual question answering (Antol et al., 2015;Gao et al., 2015) and multimodal machine translation (Elliott et al., 2015;Specia et al., 2016), a major challenge is that this task has longer input sequences with more noise and redundancy. The flow of noise information during multimodal fusion, such as redundant frames in video and noisy words in transcription, interferes with the interaction and complementarity of the effective information between modalities, which leads to a significant negative effect on the model. Moreover, when using an automatic speech recognition (ASR) system to transform audio to transcription instead of ground-truth transcription, high noise ASR-output transcripts further reduce model performance.
To address these two issues, we propose a multistage fusion network with the fusion forget gate module for multimodal summarization in videos. The model involves multiple information fusion processes to capture the correlation between multisource modalities spontaneously, and a fusion for-in this clip we 're going to file allister 's nails down with a drill . that will help smooth out the nails once again making it comfortable for you to hold your bird as well as comfortable for him . you do n't want any sharp edges on there . it just files it down . you want to use a medium speed on the drill so that you have control over it and it 's not going too fast . and all we 're doing with this is taking off the very tip of the nail after we 've trimmed it just to smooth it out . we 're not going to need to do much with it other than to make it smooth so it 's comfortable to hold him . the bird does not get hurt by the drill but you do want to make sure that the other toes are out of the way of the drill so that the drill piece is not going against their skin . once again , that can be difficult to do , you have to pry their toes apart to get them opened to drill them .

Audio Transcript
Summary after trimming your parrot 's nails , file them with a dremel to make the nail smooth ; learn more pet parrot care in this free pet care video about parrots . Figure 1: The audio transcript does not mention "parrot", only "bird" or "allister". The complete summary has to be derived from multi-source. This example is taken from the How2 dataset.

Audio
get gate is proposed to effectively suppress the flow of unnecessary multimodal noise. As illustrated in Figure 2, our proposed multistage fusion model mainly consists of four modules: 1) multisource encoders to build representations for video and audio (ground-truth or ASR-output transcript); 2) cross fusion block in which cross fusion generator (CFG) and a feature-level fusion layer are designed to generate and fuse latent adaptive streams from one modality to another at low levels of granularity; 3) hierarchical fusion decoder (HFD) in which hierchical attention networks are designed to progressively fuse multisource features carrying adaptive streams from other modalities to generate a target sequence; 4) fusion forget gate (FFG) (detailed in Figure 3) in which a memory vector and a forget vector are created for the information streams in the cross fusion block to alleviate interference from long-range redundant multimodal information.
We build our proposed model on both RNNbased (Sutskever et al., 2014) and transformerbased (Vaswani et al., 2017) encoder-decoder architectures and evaluate our approach on the largescale public multimodal summarization dataset, How2 (Sanabria et al., 2018). Experiments show that our model achieves a new state-of-the-art performance. Comprehensive ablation experiments and visualization analysis demonstrate the effectiveness of our multistage fusion schema and forgetting module.
Specially, we also evaluate the model performances under the ASR-output transcript. We use an automatic speech recognition (ASR) system (Google-Speech-V2) to generate audio transcripts (word error rate>30%) to replace the ground-truth transcripts provided by the How2 dataset. Exper-iments show that our model still achieves performance close to the model trained with ground-truth transcripts, and significantly outperforms the stateof-the-art system, which indicates the advantage of our model in the absence of ground-truth transcript annotation.
The extracted ASR-output transcripts and code will be released on https://github.com/ forkarinda/MFN.
The above abstractive summarization research mainly focuses on text and image. Sanabria et al. (2018) first release the How2 dataset for multimodal abstractive summarization for open-domain videos. The dataset provides multisource information, including video, audio, text transcription and human-generated summary. This task is more challenging due to the diversity of multimodal information in the video and the complexity of the video feature space. The task was also added to the How2 Challenge in the 2019 ICML workshop, whenever you are working on your drills , some of the drills you want to work on for these passes...

ASR/Ground-truth Text
...  which we focus on in this paper. A similar task is video captioning (Venugopalan et al., 2015a,b), which mainly places emphasis on the use of visual information to generate descriptions, but this task focuses on how to make full use of multisource and multimodal long inputs to obtain a summary and additionally needs ground-truth transcripts. Recent methods use multiencoder-decoder RNNs to process multisource inputs but lack the interaction and complementarity between multisource modalities and the ability to resist the flow of multimodal noise. To handle above two challenges, our multistage fusion model is introduced.

Multistage Fusion with Forget Gate
In this section, we will explain our model in detail. The overall architecture of our proposed model is shown in Figure 2, and the fusion forget gate inside is illustrated in Figure 3. Specifically, multistage fusion consists of the cross fusion block and hierarchical fusion decoder, which aims to model the correlation and complementarity between modalities spontaneously. In addition, the fusion forget gate is applied in the cross fusion block to filter the flow of redundant information streams. We build our model based on the RNN and transformer encoder-decoder architectures, respectively.

Problem Definition
Our multimodal summarization system takes a video and a ground-truth or ASR-output audio transcription as input and generates a textual summary that describes the most salient part of the video. Formally, the transcript is a sequence of word tokens T = (t 1 , ..., t n ) and the video representation is denoted by V = (v 1 , ..., v m ), where v m is the feature vector extracted by a pretrained model. The output summary is denoted as a sequence of word tokens S = (s 1 , ..., s l ) consisting of several sentences. The task aims to predict the best summary sequence S by finding: where θ is the set of trainable parameters.

Multisource Encoders
Encoding Video. The video encoding features are obtained by a pretrained action recognition model: a ResNeXt-101 3D convolutional neural network (Hara et al., 2018) trained for recognizing 400 different human actions in the Kinetics dataset (Kay et al., 2017).
The video representation features denoted by V = (v 1 , ..., v m ) are extracted every 16 nonoverlapping frames, where v m is the 2048-dimensional vector.
We add learnable position embeddings for video features.
Encoding Transcript. For the RNN encoder, we use a bidirectional GRU (Cho et al., 2014) to encode the text to obtain a contextualized representation for each word: For the transformer encoder, we employ an universal bidirectional transformer encoder (Vaswani et al., 2017) in which each layer is composed of a multihead self-attention layer followed by a feedforward sublayer with residual connections (He et al., 2016) and layer normalizations (Ba et al., 2016), and denoted by the following equation: We use learnable position embedding instead of sinusoidal position embedding.

Cross Fusion Generator
The cross fusion generator (CFG) is used to correlate meaningful elements across modalities. We apply the CFG to generate the adaptive fusion information from one modality encoding to another. The CFG learns two cross-modal attention maps, one is from text to video, and the other is from video to text. It is inspired by parallel co-attention (Lu et al., 2016), which computes an affinity matrix between two sequences, while we apply two unidirectional matrices instead of assigning shared parameters to both directions, and use scaled dot-product attention (Vaswani et al., 2017). At each of the crossmodal attention maps, the low-level signals from the source modality are transformed to key and value pairs to interact with the target modality as a query. Following the two maps, CFG is divided into the video-to-text fusion generator (V2TFG) and text-to-video fusion generator (T2VFG), which are detailed as follows: Text-to-video Fusion Generator (T2VFG). The T2VFG generates the most relevant video information to low-level text features by a text-to-video cross-modal attention map. The cross-modal attention consists of text queries Q T = T W Q T , video key and value pairs K T = V W K T , V T = V . The contextual video vector derived from the cross-modal attention map is calculated by where the common spatial parameter W α is used to simplify the calculations. Video-to-Text Fusion Generator (V2TFG). Similar to the T2VFG, the V2TFG aims to generate the latent adaptive text information stream for video modality. The difference between the V2TFG and T2VFG is that they flow in opposite directions. We transform the low-level video features to queries Q V = V W Q V and the text to key and value pairs where W β is a mapping of text flowing to video.

Fusion Forget Gate
Although the CFG builds an unsupervised lowlevel signal alignment between original multisource features, noise modality information generated by CFG is hard to be suppressed. In particular, when the whole modality cannot guide the task at all, the forced normalization of the softmax function in the attention structure makes the calculated fusion vector generated by the noise modality hard to be suppressed. For this reason, we propose a fusion forget gate (FFG) to filter low-level cross-modal adaptation information of each modality generated by the CFG. The FFG reads the original modal signals as well as the adaptation information derived from other modalities, and determines whether the adaptation information is noise and matches the original modality. As shown in Figure 3, we assign a video FFG and a text FFG to receive bidirectional adaption information that originated from the CFG.
Specifically, it creates a memory vector and a forget gate to control the flow of noise and mismatched information. First, we project the connected source and target modality embeddings and activate them with a sigmoid function to obtain a forget vector: Then the adaptation information passes a linear mapping to obtain a memory vector, which prevents essential information from being weighted down due to the scaling limit of the sigmoid function ranging from 0 to 1. We apply the dot-product to the memory vector and the forget vector to represent the cross-modal adaptive stream after FFG filtering, which is finally calculated as follows: where represents elementwise dot production and W V , W T , W 1 , W 2 , b V , b T , b 1 and b 2 are trainable parameters.

Feature-Level Fusion
This module combines the low-level signal T /V of the original modality with the matching adaptive stream V /T of other modalities. The fusion vector flowing through CFG and FFG has the same sequence length as the original modality so that we apply a concat&forward layer with a ReLU activation function. In addition, we specially add a residual connection inside the fusion layer to deepen the neural network's memory of the original modality. The calculation formulas are below: where W 1 , W 2 , b 1 , b 2 are trainable parameters.

Hierarchical Fusion Decoder
The HFD receives multimodal information of different granularity from multisource inputs and generates a target sequence. Inspired by hierarchical attention (Libovickỳ and Helcl, 2017), HFD transforms the decoder hidden states and multisource encodings into a context vector by three attention maps: video attention, text attention, and attention over multimodal attention (AoMA). At each decoding time step t, the decoder hidden state h t attends to video/text encodings V F /T F carrying aligned multimodal information separately via video/text attention to calculate the video/text context vector: Then, a second attention mechanism is constructed over the two context vectors, and a higher-level context vector is computed. We concatenate the two contexts and apply a new MLP attention: The context vector of hierarchical multimodal fusion is finally obtained and combined with the decoder hidden state vector to compute an output for attending the next decoder layer or caculating the vocabulary distribution.
Corresponding to the two encoders introduced in section 3.2, we design RNN-based and transformerbased decoding strategies. The formula expression and model diagram of the two structures are detailed in Appendix A.1.

How2 Dataset
We evaluate our method on the How2 dataset (Sanabria et al., 2018). The How2 dataset is a large-scale dataset of open-domain videos, spanning different topics, such as cooking, sports, indoor/outdoor activities, and music. It consists of 79,114 how-to instructional videos with an average length of 1.5 minutes and a total of 2,000 hours, accompanied by corresponding ground-truth English transcripts with an average length of 291 words, crowdsourced Portuguese translations of transcripts and user-generated summaries with an average length of 33 words. The statistics are shown in Figure 4 and Table 1.

Audio Recognition
We also extract audio transcripts by a speech recognition system (Google-Speech-V2). The word error rate (WER) of the speech-recognition output on the How2 test data is 32.9%.

Baseline Models
We compare our model with the following baseline models of single or multiple modalities: S2S (Luong et al., 2015): a standard sequenceto-sequence architecture using an RNN encoderdecoder with a global attention mechanism.
PG (See et al., 2017): a commonly used encoderdecoder summarization model with attention (Bahdanau et al., 2015), which combines copying words from source documents and outputting words from a vocabulary.
FT: a strong baseline that applies a transformerbased encoder-decoder model to a flat sequence.
VideoRNN ( MT : a transformer-based encoder-decoder architecture receiving sequence features of video for end-to-end dense video captions. HA (RNN/Transformer) (Palaskar et al., 2019): a multisource sequence-to-sequence model with a hierarchical attention approach to combine textual and visual modalities, which is currently the stateof-the-art method for the multimodal summarization task on the How2 dataset.

Implement Details
For the RNN-based models, we uniformly use a 2-layer GRU with 128-dimensional word embeddings and 256-dimensional hidden states for each direction. We truncate the maximum text sequence length to 600.
For the transformer-based models, we uniformly use a 4-layer transformer of 512 dimensions with 8 heads. We truncate the maximum text sequence length to 800, and the maximum video sequence length to 1024.
For both the two architectures, we use the crossentropy loss and Adam optimizer (Kingma and Ba, 2015). The initial learning rate is set to 1.5e −4 . All trainable parameters are randomly initialized with the Kaiming initialization (He et al., 2015). The training of the proposed models are conducted on {1, 2} GeForce RTX 2080 Ti GPUs for 50 epochs with a batch size of {4, 16}. During decoding for prediction, we use beam search with a beam size of 6 and a length penalty with α = 1 (Wu et al., 2016).
For a fair comparison, following Palaskar et al. (2019), all the methods take the same 2048-dimensional video features extracted from a ResNeXt-101 3D convolutional neural network (Hara et al., 2018) as input; the vocabulary is built based on the How2 data, and do not use pre-trained word embeddings.

Model Performance
We adopt multiple automatic metrics to comprehensively evaluate model performance: BLEU (1,2,3,4) (Papineni et al., 2002), ROUGE (1,2,L) (Lin, 2004), METEOR (Banerjee and Lavie, 2005) and CIDEr (Vedantam et al., 2015). Table 2 shows the results for different models on the How2 dataset. Table 3 shows the model performances of using automatic transcripts obtained from a speech recognition system instead of ground-truth transcripts provided by the dataset. The results show that our proposed model achieves the state-of-the-art performance in each evaluation metric on both the RNN-based and transformer-based models. It can also be seen that the performances of the pure video modality models are modest because of the frozen video features extracted from a task-independent pretraining model.
In particular, Table 3 shows that when the performances of all the prior models trained with ASR-   Table 3: Results on the How2 test set. The ASR-output transcripts is used to replace the provided ground-truth transcripts. The down arrow (↓) indicates the performance degradation when using ASR-output transcript to replace ground-truth transcript under the same model.   Table 5: Ablation analysis on RNN-based models. The ASR-output transcripts is used to replace the provided ground-truth transcripts.  Table 6: Ablation analysis on the How2 test set.
ASR-output Transcript: first thing you have to do is attach the thread to the hook . what do you want to do as security . i suggest . lacrosse and fatherhood . and then wrap . backwards that way . just enough to catch . that . standing . piece of the . trader . therefore raps is usually good . and then . you can just depends on what you doing you can leave it hanging out you can clip it off close there . but you're now . my thread is not good . come loose . some radio star attachment other materials . sometimes you can go in wrap it all the way back . no just make sure you get it on there secure . rabbits back the other way . and then start retiring or if you want to start . start time back here grab it back and keep it back . but the trick is a just make those first couple of laps trap that . then go back this way few times . and then either continue back to the back of the head . turn up to the front . and um . the . gives a good song . foundation to start time Summary: watch and learn how to tie thread to a hook to help with fly tying as explained by out expert in this free how-to video on fly tying tips and techniques .
Ground-truth Transcript: alvin dedeux : first thing you have to do is attach the thread to the hook , and what you want to do is secure it . i usually just lay it across in front of the hook and then wrap backwards that way just enough to catch that standing piece of the thread there . three or four wraps is usually good and then you can just , depending on what you 're doing , you can leave it hanging or you could clip it off close there . but now , my thread is not going to come loose so i 'm ready to start attaching my other materials . sometimes you can go ahead and wrap it all the way back , just make sure you got it on there secure , wrap it back the other way and then start your tying . or if you want to start tying back here , you 'd wrap it back here and keep it back here . but the trick is to just make those first couple of wraps , trap that thread and then go back this way a few times . and then either continue back to the back of the hook or up to the front . and that gives you a good solid foundation to start tying your fly . Figure 5: A example taken from How2 test set. For the extracted ASR-output transcripts, we use the period "." as the separator of the automatically segmented audio clips.
output transcripts drop sharply due to the high error rate (W ER = 32.9%) of speech recognition, our model still has good performance close to the models trained with ground-truth transcripts. In using ASR-output transcripts, our framework outperforms the HA 8.3 BLEU-4 points, 7.4 ROUGE-L points, 3.9 METEOR points, and 71.8 CIDEr points on the RNN-based architecture, and 6.9 BLEU-4 points, 4.9 ROUGE-L points, 2.7 ME-TEOR points, and 42.8 CIDEr points on the transformer-based architecture, which fully shows the effectiveness of our approach.

Ablations
The purpose of this study is to examine the role of the proposed multistage fusion and fusion forget gate (FFG). We divide the fusion process into transcript-to-video-fusion (T2VF) and video-totranscript fusion (V2TF) in the cross fusion block, the following FFG, and the final HFD, and retrain our approach by ablating one or more of them.
• We retrain only T2VF and only V2TF and replace HFD with a standard decoder to handle single-source multimodal encodings.
• We add the FFG to the above T2VF and V2TF models separately.
• We retain T2VF, V2TF, HFD, and remove all the FFG of the full model. Table 4 lists the results on the How2 dataset. We can observe that: 1) except that the V2TF's per-formance is weaker than the single-text modality on RNN {1a}, the performances of all the V2TF and T2VF models {3a, 1b, 3b} exceed the performances of the single-modality models. 2) Compared with using only V2TF or T2VF, using V2TF and T2VF together with HFD {5a, 5b} further improves the model effect. 3) When FFG is added, the performances of all the fusion structures improve, which is particularly evident in the RNNbased models. 4) Only one-way fusion structures with FFG {2a,4a,2b,4b} can achieve comparable and even better performance compared to the HA. These results demonstrate the effectiveness of the multistage fusion and inside FFG. Table 5 lists the results of using the ASR-output transcript instead of the provided ground-truth transcript. The observation results are similar to those observed in Table 4. In particular, we can see a greater increase in the performance of the FFG when using high noise ASR-output trancript compared to using the ground-truth transcript. This further verifies the ability of FFG to the resist the flow of multimodal noise.
Additionally, we also evaluate 1) the effect of model depth and 2) the effect of FFG on HFD. We deepen the model depth, and apply FFG to the multimodal context representation generated by the AoMA in HFD. The results in Table 6 indicate that the two measures do not improve model performance.

Modality
Method R-L Output -Reference -watch and learn how to tie thread to a hook to help with fly tying as explained by out expert in this free how-to video on fly tying tips and techniques .
Ground-truth transcript FT 0.543 learn about attaching the thread in fly tying and other fly fishing tips in this free how-to video on fly tying tips and techniques .
Video MT 0.468 learn how to attach a backing tail to fly fishing backing in this free how-to video on fly tying and techniques .
Ground-truth transcript+Video HA (RNN) 0.557 learn from our expert how to attach a hook to fly tying in this free how-to video on fly tying tips and techniques .
HA (Trm) 0.559 learn about using a bobbin in fly tying from our expert in this free how-to video on techniques for and making fly tying nymphs .
Proposed (RNN) 0.582 watch and learn from an expert how to attach the thread to fly tying in this free how-to video on fly tying tips and techniques .
Proposed (Trm) 0.574 learn some great tips on attaching the thread to the fly fishing in this free how-to video on fly tying tips and techniques .
ASR-output transcript+Video HA (RNN) 0.487 tying a knot for fly fishing is easy with these tips , get expert advice on woodworking in this free video .
HA (Trm) 0.501 tying a knot onto a knot , make sure the snap is secure and connected to the hoop knot . attach a french braid to a knot with tips from an experienced handyman in this free video on fly tying .
Proposed (RNN) 0.561 watch and learn from our expert on fly fishing tips in this free how-to video on fly tying tips and techniques .
Proposed (Trm) 0.550 learn how to use a wrapped knot to wrap a fly fishing knot in this free how-to video on fly tying tips and techniques . in this clip we 're going to file allister 's nails down with a drill . that will help smooth out the nails once again making it comfortable for ...

Qualitative Analysis
We provide some example outputs from trained models. The example is taken from the How2 test set, and we show its ground-truth transcript and the extracted ASR-output transcript in Figure 5. Table  7 lists the generated results. We can observe that: 1) compared to single-modality models, the multimodality models can generate more accurate and fluent contents. 2) In using ground-truth transcript, both HA and our proposed model generate accurate and fluent summaries. 3) In using ASR-output transcripts, our proposed model still generates a relatively accurate summary while the content generated by HA is not accurate enough, which intuitively illustrates the advantage of our model in the absence of ground-truth transcripts.
To better understand what our model has learned, we take the sample shown in Figure 1 to visualize the FFG and cross-attention in CFG. We sum the FFG weights and use the color depth of the word to represent the intensity of the FFG of controlling the flow of video to text, and demonstrate the interaction between video and text by displaying the video frame with the highest transcript-to-video attention when generating adaptive video streams. As shown in Figure 6, in the input segment, we can observe the following: 1) For some words related to the summary such as "file", "nails", the FFG retains video streams for it, in contrast, for words such as "once again", the FFG forgets most of the video information. 2) For the words that FFG remembers deeply, the corresponding video frame has a certain correlation with it, for example, "file allister's nails" point to a close-up of manicuring the parrot's nails.

Conclusions
We introduce a multistage fusion network with fusion forget gate for generating text summaries for the open-domain videos. We propose a multistep fusion schema to model fine-grained interactions between multisource modalities and a fusion forget gate module to handle the flow of multimodal noise of multisource long sequences. Experiments on the How2 dataset show the effectiveness of the proposed models. Furthermore, when using high noise speech recognition transcription, our model still achieves the effect of being close to the groundtruth transcription model, which reduces the manual annotation cost of transcripts. compute a new hidden state h t , which is defined as: The context vectors of each modality are firstly calculated by: We adopt an MLP attention for RNN-based methods. Then the second attention AoMA over the video context vectors C V and text context vectors C T are implemented as: The context vector C C of multimodal fusion and the decoder state h t are merged to get the output state y t+1 : where W 1 , W 2 , W 3 , W and b are trainable parameters.
Transformer-based HFD. Transformer-based HFD has a similar strategy as RNN-based. We mainly introduce how it absorbs multimodal information. It firstly receives target token embeddings x t through the masked multi-head self-attention and residual connection to obtain the hidden state vector h t , denoted as: Then h t is transformed into a query, separately attends to a set of key and value pairs mapped by previous encodings of each modality by the multihead encoder-decoder attention, denoted as: Similarly, the generated multimodal context vectors are fused by AoMA: The final output state reaches through the feedforward and add&norm layer like the general transformer, calculated as the following equation: y t+1 = W 2 ReLu(W 1 (C c +h t )+b 1 )+b 2 +C c +h t (26) where W 1 , W 2 , b 1 and b 2 are trainable parameters.
In addition, we also use a ROUGE evaluation library https://github.com/ neural-dialogue-metrics/rouge, which supports the evaluation of ROUGE series metrics (ROUGE-N, ROUGE-L and ROUGE-W).