MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention

This paper presents MAST, a new model for Multimodal Abstractive Text Summarization that utilizes information from all three modalities – text, audio and video – in a multimodal video. Prior work on multimodal abstractive text summarization only utilized information from the text and video modalities. We examine the usefulness and challenges of deriving information from the audio modality and present a sequence-to-sequence trimodal hierarchical attention-based model that overcomes these challenges by letting the model pay more attention to the text modality. MAST outperforms the current state of the art model (video-text) by 2.51 points in terms of Content F1 score and 1.00 points in terms of Rouge-L score on the How2 dataset for multimodal language understanding.


Introduction
In recent years, there has been a dramatic rise in information access through videos, facilitated by a proportional increase in the number of videosharing platforms. This has led to an enormous amount of information accessible to help with our day-to-day activities. The accompanying transcripts or the automatic speech-to-text transcripts for these videos present the same information in the textual modality. However, all this information is often lengthy and sometimes incomprehensible because of verbosity. These limitations in user experience and information access are improved upon by the recent advancements in the field of multimodal text summarization.
Multimodal text summarization is the task of condensing this information from the interacting modalities into an output summary. This generated output summary may be unimodal or multimodal (Zhu et al., 2018). The textual summary * indicates equal contribution Aman Khullar is presently at Gram Vaani may, in turn, be extractive or abstractive. The task of extractive multimodal text summarization involves selection and concatenation of the most important sentences in the input text without altering the sentences or their sequence in any way. Li et al. (2017) made the selection of these important sentences using visual and acoustic cues from the corresponding visual and auditory modalities. On the other hand, the task of abstractive multimodal text summarization involves identification of the theme of the input data and the generation of words based on the deeper understanding of the material. This is a tougher problem to solve which has been alleviated with the advancements in the abstractive text summarization techniques -Rush et al. (2015), See et al. (2017) and Liu and Lapata (2019). Sanabria et al. (2018) introduced the How2 dataset for large-scale multimodal language understanding, and Palaskar et al. (2019) were able to produce state of the art results for multimodal abstractive text summarization on the dataset. They utilized a sequence-to-sequence hierarchical attention based technique (Libovickỳ and Helcl, 2017) for combining textual and image features to produce the textual summary from the multimodal input. Moreover, they used speech for generating the speech-to-text transcriptions using pre-trained speech recognizers, however it did not supplement the other modalities.
Though the previous work in abstractive multimodal text summarization has been promising, it has not yet been able to capture the effects of combining the audio features. Our work improves upon this shortcoming by examining the benefits and challenges of introducing the audio modality as part of our solution. We hypothesize that the audio modality can impart additional useful information for the text summarization task by letting the model pay more attention to words that are spoken with a certain tone or level of emphasis. Through our Original text: let's talk now about how to bait a tip up hook with a maggot. typically, you're going to be using this for pan fish. not a real well known or common technique but on a given day it could be the difference between not catching fish and catching fish. all you do, you take your maggot, you can use meal worms, as well, which are much bigger, which are probably more well suited for this because this is a rather large hook. you would just, again, put that hook right through the maggot. with a big hook like this, i would probably put ten of these on it, just line the whole thing. this is going to be more of a technique for pan fish, such as, perch and sunfish, some of your smaller fish but if you had maggots, like this , or a meal worm, or two, on a hook like this, this would be a fantastic setup for trout, as well.
Text only: ice fishing is used for ice fishing. learn about ice fishing bait with tips from an experienced fisherman artist in this free fishing video.
Video-Text: learn about the ice fishing bait in this ice fishing lesson from an experienced fisherman.
MAST: maggots are good for catching perch. learn more about ice fishing bait in this ice fishing lesson from an experienced fisherman. experiments, we were able to prove that not all modalities contribute equally to the output. We found a higher contribution of text, followed by video and then by audio. This formed the motivation for our MAST model, which places higher importance on text input while generating the output summary. MAST is able to produce a more illustrative summary of the original text (see Table  1) and achieves state of the art results.
In summary, our primary contributions are: • Introduction of audio modality for abstractive multimodal text summarization.
• Examining the challenges of utilizing audio information and understanding its contribution in the generated summary.
• Proposition of a novel state of the art model, MAST, for the task of multimodal abstractive text summarization.

Methodology
In this section we describe (1) the dataset used, (2) the modalities, and (3) our MAST model's architecture. The code for our model is available online 1 .

Dataset
We use the 300h version of the How2 dataset (Sanabria et al., 2018) of open-domain videos. The dataset consists of about 300 hours of short instructional videos spanning different domains such as cooking, sports, indoor/outdoor activities, music, and more. A human-generated transcript accompanies each video, and a 2 to 3 sentence summary is available for every video, written to generate interest in a potential viewer. The 300h version is used instead of the 2000h version because the audio modality information is only available for the 300h subset. The dataset is divided into the training, validation and test sets. The training set consists of 13,168 videos totaling 298.2 hours. The validation set consists of 150 videos totaling 3.2 hours, and the test set consists of 175 videos totaling 3.7 hours. A more detailed description of the dataset has been given by Sanabria et al. (2018). For our experiments, we took 12,798 videos for the training set, 520 videos for the validation set and 127 videos for the test set.

Modalities
We use the following three inputs corresponding to the three different modalities used: • Audio: We use the concatenation of 40dimensional Kaldi (Povey et al., 2011) filter bank features from 16kHz raw audio using a time window of 25ms with 10ms frame shift and the 3-dimensional pitch features extracted from the dataset to obtain the final sequence of 43-dimensional audio features.
• Text: We use the transcripts corresponding to each video. All texts are normalized and lower-cased.
• Video: We use a 2048-dimensional feature vector per group of 16 frames, which is extracted from the videos using a ResNeXt-101 3D CNN trained to recognize 400 different actions (Hara et al., 2018). This results in a sequence of feature vectors per video. MAST is a sequence to sequence model that uses information from all three modalities -audio, text and video. The modality information is encoded using Modality Encoders, followed by a Trimodal Hierarchical Attention Layer, which combines this information using a three-level hierarchical attention approach. It attends to two pairs of modalities (δ) (Audio-Text and Video-Text) followed by the modality in each pair (β and γ), followed by the individual features within each modality (α). The decoder utilizes this combination of modalities to generate the output over the vocabulary.

Multimodal Abstractive Summarization with Trimodal Hierarchical Attention
modal Hierarchical Attention Layer and the Trimodal Decoder.

Modality Encoders
The text is embedded with an embedding layer and encoded using a bidirectional GRU encoder. The audio and video features are encoded using bidirectional LSTM encoders. This gives us the individual output encoding corresponding to all modalities at each encoder timestep. The tokens t (k) i corresponding to modality k are encoded using the corresponding modality encoders and produce a sequence of hidden states h (k) i for each encoder time step (i).

Trimodal Hierarchical Attention Layer
We build upon the hierarchical attention approach proposed by Libovickỳ and Helcl (2017) to combine the modalities. On each decoder timestep i, the attention distribution (α) and the context vector for the k-th modality is first computed indepen-dently as in : Where s i is the decoder hidden state at i-th decoder timestep, h (k) j is the encoder hidden state at j-th encoder timestep, N k is the number of encoder timesteps for the k-th modality and e (k) ij is attention energy corresponding to them. W a and U a are trainable projection matrices, v a is a weight vector and b att is the bias term.
We now look at two different strategies of combining information from the modalities. The first is a simple extension of the hierarchical attention combination. The second is the strategy used in MAST, which combines modalities using three levels of hierarchical attention.
1. TrimodalH2: To obtain our first baseline model (TrimodalH2), with 2 level attention hierarchy, the context vectors for all three modalities are combined using a second layer of attention mechanism and its context vector is computed separately by using hierarchical attention combination as in Libovickỳ and Helcl (2017): where η (k) is the hierarchical attention distribution over the modalities, c and U (k) c are modality-specific projection matrices.

MAST:
To obtain our MAST model, the context vectors for audio-text and text-video are combined using a second layer of hierarchical attention mechanisms (β and γ) and their context vectors are computed separately. These context-vectors are then combined using the third hierarchical attention mechanism (δ).
1. Audio-Text: where d (l) i , l ∈ {audio-text, video-text} is the context vector obtained for the corresponding pair-wise modality combination.
Finally, these audio-text and video-text context vectors are combined using the third and final attention layer (δ). With this trimodal hierarchical attention architecture, we combine the textual modality twice with the other two modalities in a pair-wise manner, and this allows the model to pay more attention to the textual modality while incorporating the benefits of the other two modalities.
where c f i is the final context vector at i-th decoder timestep.

Trimodal Decoder
We use a GRU-based conditional decoder (Firat and Cho, 2016) to generate the final vocabulary distribution at each timestep. At each timestep, the decoder has the aggregate information from all the modalities. The trimodal decoder focuses on the modality combination, followed by the individual modality, then focuses on the particular information inside that modality. Finally, it uses this information along with information from previous timesteps, which is passed on to two linear layers to generate the next word from the vocabulary.

Experiments
We train Trimodal Hierarchical Attention (MAST) and TrimodalH2 models on the 300h version of the How2 dataset, using all three modalities. We also train Hierarchical Attention models considering Audio-Text and Video-Text modalities, as well as simple Seq2Seq models with attention for each modality individually as baselines. As observed by Palaskar et al. (2019), the Pointer Generator model (See et al., 2017) does not perform as well as Seq2Seq models on this dataset, hence we do not use that as a baseline in our experiments. We consider another transformer-based baseline for the text modality, BertSumAbs (Liu and Lapata, 2019).
For all our experiments (except for the BerSum-Abs baseline), we use the nmtpytorch toolkit (Caglayan et al., 2017). The source and the target vocabulary consists of 49,329 words on which we train our word embeddings. We use the NLL loss and the Adam optimizer (Kingma and Ba, 2014) with learning rate 0.0004 and trained the models for 50 epochs. We generate our summaries using beam search with a beam size of 5, and then evaluate them using the ROUGE metric (Lin, 2004) and the Content F1 metric (Palaskar et al., 2019).
In our experiments, the text is embedded with an embedding layer of size 256 and then encoded using a bidirectional GRU encoder  with a hidden layer of size 128, which gives us a 256-dimensional output encoding corresponding to the text at each timestep. The audio and video frames are encoded using bidirectional LSTM encoders (Hochreiter and Schmidhuber, 1997) with a hidden layer of size 128, which gives a 256dimensional output encoding corresponding to the audio and video features at each timestep. Finally, the GRU-based conditional decoder uses a hidden layer of size 128 followed by two linear layers which transform the decoder output to generate the final output vocabulary distribution.
To improve generalization of our model, we use two dropout layers within the Text Encoder and one dropout layer on the output of the conditional decoder, all with a probability of 0.35. We also use implicit regularization by using early stopping mechanism on the validation loss with a patience of 40 epochs.

Challenges of using audio modality
The first challenge comes with obtaining a good representation of the audio modality that adds value beyond the text modality for the task of text summarization. As found by Mohamed (2014), DNN acoustic models prefer features that smoothly change both in time and frequency, like the log mel-frequency spectral coefficients (MFSC), to the decorrelated mel-frequency cepstral coefficients (MFCC). MFSC features make it easier for DNNs to discover linear relations as well as higher order causes of the input data, leading to better overall system performance. Hence we do not consider MFCC features in our experiments and use the filter bank features instead.
The second challenge arises due to the larger number of parameters that a model needs when handling the audio information. The number of parameters in the Video-Text baseline is 16.95 million as compared to 32.08 million when we add audio. This is because of the high number of input timesteps in the audio modality encoder, which makes learning trickier and more time-consuming.
To demonstrate these challenges, as an experiment, we group the audio features across input timesteps into bins with an average of 30 consecutive timesteps and train our MAST model. This makes the number of audio timesteps comparable to the number of video and text timesteps. While we observe an improvement in computational efficiency, it achieves a lower performance than the baseline Video-Text model as described in Table  2 (MAST-Binned). We also train Audio only and Audio-Text models which fail to beat the Text only baseline. We observe that the generated summaries of the Audio only model are similar and repetitive, indicating that the model failed to learn useful information relevant to the task of text summarization.

Preliminaries
Our results are given in Table 2. To demonstrate the contribution of various modalities towards the output summary, we experiment with the three modalities taken individually as well as in combination. Text only, Video only and the Audio only are attention-based S2S models  with their respective modality features taken as encoder inputs. To situate the efficacy of the encoderdecoder architecture for our task, we use the Bert-SumAbs (Liu and Lapata, 2019) as a BERT based baseline for abstractive text summarization. Audio-Text and the Video-Text are S2S models with hierarchical attention layer. The Video-Text model as presented by Palaskar et al. (2019) has been compared on the 300h version instead of the 2000h version of the dataset because the audio modality is only available in the former. TrimodalH2 model, adds the audio modality in the second-level of hierarchical attention. MAST-Binned model groups the features of the audio modality for computational efficiency. These models show alternative methods for utilizing audio modality information. We evaluate our models with the ROUGE metric (Lin, 2004) and the Content F1 metric (Palaskar et al., 2019). The Content F1 metric is the F1 score of the content words in the summaries based on a monolingual alignment. It is calculated using the METEOR toolkit (Denkowski and Lavie, 2011) by setting zero weight to function words (δ), equal weights to Precision and Recall (α), and no cross-over penalty (γ) for generated words. Additionally, a set of catchphrases like the words -in, this, free, video, learn, how, tips, expert -which appear in most summaries and act like function words instead of content words are removed from the reference and hypothesis summaries as a postprocessing step. It ignores the fluency of the output, but gives an estimate of the amount of useful content words the model is able to capture in the output.

Discussion
As observed from the scores for the Text Only model, the text modality contains the most amount of information relevant to the final summary, followed by the video and the audio modalities. The scores obtained by combining the audio-text and video-text modalities also indicate the same. The transformer-based model, BertSumAbs, fails to perform well because of the smaller amount of text data available to fine-tune the model.
We also observe that combining the text and audio modalities leads to a lower ROUGE score than the Text Only model, which indicates that the plain hierarchical attention model fails to learn well over the audio modality by itself. This observation is in line with the result obtained by the TrimodalH2 model, where we simply extend the hierarchical attention approach to three modalities.

Usefulness of audio modality
The MAST and the TrimodalH2 models achieve a higher Content F1 score than the Video-Text baseline, indicating that the model learns to extract more useful content by utilizing information from the audio modality corresponding to the characteristics of speech, in line with our initial hypothesis as illustrated in Table 1 However, the TrimodalH2 model, which simply adds the audio modality in the second level of hierarchical attention, fails to outperform the Video-Text baseline in terms of ROUGE scores. Our architecture lets the MAST model choose between paying attention to a different combination of modalities with the text modality. This forces the model to pay more attention to the text modality, thereby overcoming the shortcoming of the TrimodalH2 model and achieving better ROUGE scores, while maintaining a similar Content F1 score when compared to TrimodalH2.

Attention distribution across modalities
To understand the importance of individual modalities and their combinations, we plot their attention distribution at different levels of attention hierarchy across the decoder timesteps. Figure 4a corresponds to attention weights as calculated in Through these visualizations, we observe that the text modality dominates the generation of the output summary while giving lesser attention to the audio and video modalities (the latter being more important). These findings support the extra importance being given to the text modality in the MAST model during its interaction with the other modalities. Figures 4b and 4d highlight the modest gains through the audio modality and the challenge in its appropriate usage.

Performance across video durations
We also look at how our model performs for different video durations in our test set. Figure 3 shows the variation in the Rouge-L scores across different videos for MAST and the Video-Text baseline. The figure shows videos binned into seven groups of 25 seconds by duration. We can observe from the quartile distribution that MAST outperforms the baseline in five out of the seven groups, gives simi-lar performance for videos with a duration between 75-100 seconds, and underperforms for videos with a duration between 150-175 seconds. However, overall, by looking at the distribution of the duration of videos in our test set (Figure 2), we can observe that MAST outperforms the baseline for a vast majority of videos across durations.

Abstractive text summarization
Abstractive summarization of documents was traditionally achieved by paraphrasing and fusing multiple sentences along with their grammatical rewriting (Woodsend and Lapata, 2012). This was later improved by taking inspiration from human comprehension capabilities when Fang and Teufel (2014) implemented the model of human comprehension and summarization proposed by Kintsch and Van Dijk (1978). They did this by identifying these concepts in text through the application of co-reference resolution, named entity recognition and semantic similarity detection, implemented as a two-step competition.
The real stimulus to the field of abstractive summarization was provided by the application of neural encoder-decoder architectures. Rush et al. (2015) were among the first to achieve state-of-theart results on Gigaword (Graff et al., 2003) and the DUC-2004(Over et al., 2007 datasets and established the importance of end-to-end deep learning models for abstractive summarization. Their work was later improved upon by See et al. (2017) where they used copying from the source text to remove the problem of incorrect generation of facts in the summary, as well as a coverage mechanism to curb the problem of repetition of words in the generated summary.

Pretrained language models
Another breakthrough for the field of natural language processing came with the use of pre-trained language models for carrying out various language downstream tasks. Pre-trained language models like BERT (Devlin et al., 2018) introduced masked language modelling, which allowed models to learn interactions between left and right context words. These models have significantly changed the way word embeddings are generated by training contextual embeddings rather than static embeddings. Liu and Lapata (2019) presented how BERT could be used for text summarization and proposed a new fine-tuning schedule for abstractive summarization which adopted different optimizers for the encoder and the decoder to alleviate the mismatch between the two. BERT models typically require large amounts of annotated data to produce state-ofthe-art results. Recent works, like GAN-BERT by Croce et al. (2020) focus on solving this problem.

Advancements in speech recognition and computer vision
Parallel advancements in the field of speech recognition and computer vision have been able to give us successful methods to extract useful features of speech and images. Peddinti et al. (2015) built a robust acoustic model for speech recognition using a time-delay neural network. They were able to achieve state-of-the-art results in the IARPA AS-pIRE Challenge. Similarly, with the advancements of convolutional neural networks, the field of computer vision has progressed significantly. He et al. (2016) (Kay et al., 2017).

Summarization beyond text
The advancements in these fields have in turn also facilitated text summarization. Rott andČerva (2016) used only the input audio to generate textual summaries while Sah et al. (2017) were among the first to show the possibility of summarizing long videos and then annotating the summarized video to obtain a textual summary. These models, however, were not able to capture the information of other modalities to obtain the output textual summary and hence their limitations led to the increasing use of multimodal data. A major hindrance in the field of multimodal text summarization was the lack of datasets. Li et al. (2017) created an asynchronous benchmark dataset with humanannotated summaries for 500 videos. Sanabria et al. to present an abstractive summary of open-domain videos. These models, however, are not completely multimodal since they do not utilise the audio information. A major focus of our work is to highlight the importance of using audio data as input and incorporate it in a truly multimodal manner.

Conclusion
In this work 2 , we presented MAST, a state of the art sequence to sequence based model that uses information from all three modalities -audio, text and video -to generate abstractive multimodal text summaries. It uses a Trimodal Hierarchical Attention layer to utilize information from all modalities. We explored the role played by adding the audio modality and compared MAST with several baseline models, demonstrating the effectiveness of our approach.
In the future, we would like to extend this work by looking at alternate audio modality representations including using neural networks for audio feature extraction, and also explore the use of transformers for an end to end attention based learning. We also aim to explore the application of MAST to