Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

A major challenge for video captioning is to combine audio and visual cues. Existing multi-modal fusion methods have shown encouraging results in video understanding. However, the temporal structures of multiple modalities at different granularities are rarely explored, and how to selectively fuse the multi-modal representations at different levels of details remains uncharted. In this paper, we propose a novel hierarchically aligned cross-modal attention (HACA) framework to learn and selectively fuse both global and local temporal dynamics of different modalities. Furthermore, for the first time, we validate the superior performance of the deep audio features on the video captioning task. Finally, our HACA model significantly outperforms the previous best systems and achieves new state-of-the-art results on the widely used MSR-VTT dataset.


Introduction
Video captioning, the task of automatically generating a natural-language description of a video, is a crucial challenge in both NLP and vision communities. In addition to visual features, audio features can also play a key role in video captioning. Figure 1 shows an example where the caption system made a mistake analyzing only visual features. In this example, it could be very hard even for a human to correctly determine if the girl is singing or talking by only watching without listening. Thus to describe the video content accurately, a good understanding of the audio signature is a must.
A girl sings to a song.

Video Only:
A woman is talking in a room. Video + Audio: A girl is singing a song.  Yang et al., 2017), etc. But these techniques do not learn the cross-modal attention and thus fail to selectively attend to a certain modality when producing the descriptions.
Another issue is that little efforts have been exerted on utilizing temporal transitions of the different modalities with varying analysis granularities. The temporal structures of a video are inherently layered since the video usually contains temporally sequential activities (e.g. a video where a person reads a book, then throws it on the table. Next, he pours a glass of milk and drinks it). There are strong temporal dependencies among those activities. Meanwhile, to understand each of them requires understanding many action components (e.g., pouring a glass of milk is a complicated action sequence). Therefore we hypothesize that it is beneficial to learn and align both the high-level (global) and low-level (local) temporal transitions of multiple modalities.
Moreover, prior work only employed handcrafted audio features (e.g. MFCC) for video captioning (Ramanishka et al., 2016;Xu et al., 2017;Hori et al., 2017). While deep audio features have shown superior performance on some audio processing tasks like audio event classification , their use in video captioning needs to be validated.  Figure 2: Overview of our HACA framework. Note that in the encoding stage, for the sake of simplicity, the step size of high-level LSTM in both hierarchical attentive encoders is 2 here, but in practice usually they are set much longer. In the decoding stage, we only show the computations of the time step t (the decoders have the same behavior at other time steps).
In this paper, we propose a novel hierarchically aligned cross-modal attentive network (HACA) to learn and align both global and local contexts among different modalities of the video. The goal is to overcome the issues mentioned above and generate better descriptions of the input videos. Our contributions are fourfold: (1) we invent a hierarchical encoder-decoder network to adaptively learn the attentive representations of multiple modalities, including visual attention, audio attention, and decoder attention; (2) our proposed model is capable of aligning and fusing both the global and local contexts of different modalities for video understanding and sentence generation; (3) we are the first to utilize deep audio features for video captioning and empirically demonstrate its effectiveness over hand-crafted MFCC features; and (4) we achieve the new state of the art on the MSR-VTT dataset.
Among the network architectures for video captioning (Yao et al., 2015;Venugopalan et al., 2015b), sequence-to-sequence models (Venugopalan et al., 2015a) have shown promising results. Pan et al. Pan et al. (2016) introduced a hierarchical recurrent encoder to capture the temporal visual features at different levels. Yu et al. (2016) proposed a hierarchical decoder for paragraph generation, and most recently Wang et al. (2018) invented a hierarchical reinforced framework to generate the caption phrase by phrase. But none had tried to model and align the global and local contexts of different modalities as we do. Our HACA model does only learn the representations of different modalities at different granularities, but also align and dynamically fuse them both globally and locally with hierarchically aligned cross-modal attentions.

Proposed Model
Our HACA model is an encoder-decoder framework comprising multiple hierarchical recurrent neural networks (see Figure 2). Specifically, in the encoding stage, the model has one hierarchical attentive encoder for each input modality, which learns and outputs both the local and global representations of the modality. (In this paper, visual and audio features are used as the input and hence there are two hierarchical attentive encoders as shown in Figure 2; it should be noted, however, that the model seamlessly extends to more than two input modalities.) In the decoding stage, we employ two crossmodal attentive decoders: the local decoder and the global decoder. The global decoder attempts to align the global contexts of different modalities and learn the global cross-modal fusion context. Correspondingly, the local decoder learns a local cross-modal fusion context, combines it with the output from the global decoder, and predicts the next word.

Feature Extractors
To exploit visual and audio cues, we use the pretrained convolutional neural network (CNN) models to extract deep visual features and deep audio features correspondingly. More specifically, we utilize the ResNet model for image classification (He et al., 2016) and the VGGish model for audio classification .

Attention Mechanism
For a better understanding of the following sections, we first introduce the soft attention mechanism. Given a feature sequence (x 1 , x 2 , ..., x n ) and a running recurrent neural network (RNN), the context vector c t at time step t is computed as a weighted sum over the sequence: These attention weights {α tk } can be learned by the attention mechanism proposed in (Bahdanau et al., 2014), which gives higher weights to certain features that allow better prediction of the system's internal state.

Hierarchical Attentive Encoder
Inspired by Pan et al. (2016), the hierarchical attentive encoder consists of two LSTMs and the input to the low-level LSTM is a sequence of temporal features {f e L i } and i ∈ {1, ..., n}: where e L is the low-level encoder LSTM, whose output and hidden state at step i are o e L i and h e L i respectively. As shown in Figure 2, different from a stacked two-layer LSTM, the high-level LSTM here operates at a lower temporal resolution and runs one step every s time steps. Thus it learns the temporal transitions of the segmented feature chunks of size s. Furthermore, an attention mechanism is employed between the connection of these two LSTMs. It learns the context vector of the low-level LSTM's outputs of the current feature chunk, which is then taken as the input to the high-level LSTM at step j. In formula, where e H denotes the high-level LSTM whose output and hidden state at j are o e H j and h e H j . Since we are utilizing both the visual and audio features, there are two hierarchical attentive encoders (v for visual features and a for audio features). Hence four sets of representations are learned in the encoding stage: high-level and lowlevel visual feature sequences ({o v H j } and {o v L i }), and high-level and low-level audio feature sequences ({o a H j } and {o a L i }).

Globally and Locally Aligned
Cross-modal Attentive Decoder In the decoding stage, the representations of different modalities at the same granularity are aligned separately with individual attentive decoders. That is, one decoder is employed to align those highlevel features and learn a high-level (global) crossmodal embedding. Since the high-level features are the temporal transitions of larger chunks and focus on long-range contexts, we call the corresponding decoder as global decoder (d G ). Similarly, the companion local decoder (d L ) is used to align the low-level (local) features that attend to fine-grained and local dynamics. At each time step t, the attentive decoders learn the corresponding visual and audio contexts using the attention mechanism (see Figure 2). In addition, our attentive decoders also uncover the attention over their own previous hidden states and learn aligned decoder contexts c d L t and c d G t : Paulus et al. (2017) also show that decoder attention can mitigate the phrase repetition issue. Each decoder is equipped with a cross-modal attention, which learns the attention over contexts of different modalities. The cross-modal attention module selectively attends to different modalities and outputs a fusion context c f t : where c v t , c a t , and c d t are visual, audio and decoder contexts at step t respectively; W v , W a and W d are learnable matrices; β tv , β ta and β td can be learned in a similar manner of the attention mechanism in Section 2.2.
The global decoder d G directly takes as the input the concatenation of the global fusion context c f G t and the word embedding of the generated 797 word w t−1 at previous time step: The global decoder's output o d G t is a latent embedding which represents the aligned global temporal transitions of multiple modalities. Differently, the local decoder d L receives the latent embedding o d G t , mixes it with the local fusion context c f L t , and then learns a uniform representation o d L t to predict the next word. In formula,

Cross-Entropy Loss Function
The probability distribution of the next word is where W p is the projection matrix. w 1:t−1 is the generated word sequence before step t. θ be the model parameters and w * 1:T be the ground-truth word sequence, then the cross entropy loss

Experimental Setup
Dataset and Preprocessing We evaluate our model on the MSR-VTT dataset , which contains 10,000 videos clips (6,513 for training, 497 for validation, and the remaining 2,990 for testing). Each video contains 20 human annotated reference captions collected by Amazon Mechanical Turk. To extract the visual features, the pretrained ResNet model (He et al., 2016) is used on the video frames which are sampled at 3f ps. For the audio features, we process the raw WAV files using the pretrained VGGish model as suggested in  1 .

Evaluation Metrics
We adopt four diverse automatic evaluation metrics: BLEU, METEOR, ROUGE-L, and CIDEr-D, which are computed using the standard evaluation code from MS-COCO server (Chen et al., 2015).  Training Details All the hyperparameters are tuned on the validation set. The maximum number of frames is 50, and the maximum number of audio segments is 20. For the visual hierarchical attentive encoders (HAE), the low-level encoder is a bidirectional LSTM with hidden dim 512 (128 for the audio HAE), and the high-level encoder is an LSTM with hidden dim 256 (64 for the audio HAE), whose chunk size s is 10 (4 for the audio HAE). The global decoder is an LSTM with hidden dim 256 and the local decoder is an LSTM with hidden dim 1024. The maximum step size of the decoders is 16. We use word embedding of size 512. Moreover, we adopt Dropout (Srivastava et al., 2014) with a value 0.5 for regularization. The gradients are clipped into the range [-10, 10]. We initialize all the parameters with a uniform distribution in the range [-0.08, 0.08]. Adadelta optimizer (Zeiler, 2012) is used with batch size 64. The learning rate is initially set as 1 and then reduced by a factor 0.5 when the current CIDEr score does not surpass the previous best for 4 epochs. The maximum number of epochs is set as 50, and the training data is shuffled at each epoch. Schedule sampling (Bengio et al., 2015) is employed to train the models. Beam search of size 5 is used during the test time inference.

Comparison with State Of The Arts
In Table 1, we first list the top-3 results from the MSR-VTT Challenge 2017: v2t navigator , Aalto (Shetty and Laaksonen, 2016), and VideoLAB (Ramanishka et al., 2016). Then we compare with the state-of-the-art methods on the MSR-VTT dataset: CIDEnt-RL (Pasunuru and  Bansal, 2017), Dense-Cap (Shen et al., 2017), and HRL (Wang et al., 2018). Our HACA model significantly outperforms all the previous methods and achieved the new state of the art on BLEU-4, METEOR, and ROUGE-L scores. Especially, we improve the BLEU-4 score from 41.4 to 43.1. The CIDEr score is the second best and only lower than that of CIDEnt-RL which directly optimizes the CIDEr score during training with reinforcement learning. Note that all the results of our HACA method reported here are obtained by supervised learning only.

Result Analysis
We also evaluate several baselines to validate the effectiveness of the components in our HACA framework (see Our Models in Table 1). ATT(v) is a generic attention-based encoder-decoder model that specifically attends to the visual features only. CM-ATT is a cross-modal attentive model, which contains one individual encoder for each input modality and employs a cross-modal attention module to fuse the contexts of different modalities. CM-ATT(va) denotes the CM-ATT model consisting of visual attention and audio attention, while CM-ATT(vad) has an additional decoder attention. As presented in Table 1, our ATT(v) model achieves comparable results with the top-ranked results from MSR-VTT challenge. Comparing between ATT(v) and CM-ATT(va), we observe a substantial improvement by exploiting the deep audio features and adding cross-modal attention. The results of CM-ATT(vad) further demonstrates that decoder attention was beneficial for video captioning. Note that to test the strength of the aligned attentive decoders, we provide the results of HACA(w/o align) model, which shares almost same architecture with the HACA model, except that it only has one decoder to receive both the global and local contexts. Apparently, our HACA model obtains superior performance, which therefore proves the effectiveness of the context alignment mechanism. Figure 3: Learning curves of the CIDEr scores on the validation set. Note that greedy decoding is used during training, while beam search is employed at test time, thus the testing scores are higher than the validation scores here.

Effect of Deep Audio Features
In order to validate the superiority of the deep audio features in video captioning, we illustrate the performance of different audio features applied in the CM-ATT model in Table 2. Evidently, the deep VGGish audio features work better than the handcrafted MFCC audio features for the video captioning task. Besides, it also shows the importance of understanding and describing a video with the help of audio features.

Learning Curves
For a more intuitive view of the model capacity, we plot the learning curves of the CIDEr scores on the validation set in Figure 3. Three models are presented: HACA, HACA(w/o align), and CM-ATT. They are trained on same input modalities and all paired with visual, audio and decoder attentions. We can observe that the HACA model performs consistently better than others and has the largest model capacity.

Conclusion
We introduce a generic architecture for video captioning which learns the aligned cross-modal attention globally and locally. It can be plugged into the existing reinforcement learning methods for video captioning to further boost the performance. Moreover, in addition to the deep visual and audio features, features from other modalities can also be incorporated into the HACA framework, such as optical flow and C3D features.