Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning

This paper addresses the challenging task of video captioning which aims to generate descriptions for video data. Recently, the attention-based encoder-decoder structures have been widely used in video captioning. In existing literature, the attention weights are often built from the information of an individual modality, while, the association relationships between multiple modalities are neglected. Motivated by this, we propose a video captioning model with High-Order Cross-Modal Attention (HOCA) where the attention weights are calculated based on the high-order correlation tensor to capture the frame-level cross-modal interaction of different modalities sufficiently. Furthermore, we novelly introduce Low-Rank HOCA which adopts tensor decomposition to reduce the extremely large space requirement of HOCA, leading to a practical and efficient implementation in real-world applications. Experimental results on two benchmark datasets, MSVD and MSR-VTT, show that Low-rank HOCA establishes a new state-of-the-art.


Introduction
Video captioning has drawn much attention from natural language processing and computer vision researchers (Venugopalan et al., 2014;Bin et al., 2016;Ramanishka et al., 2016;Zanfir et al., 2016). As videos typically consist of multiple modalities (image, motion, audio, etc.), video captioning is actually a multimodal learning task. The abundant information underlying in the modalities is much beneficial for video captioning models. However, how to effectively learn the association relationships between different modalities is still a challenging problem.
In the context of deep learning based video captioning, multimodal attention mechanisms (Xu * Corresponding author. GT1: a referee coaches a wrestling match GT2: boys are wrestling in front of a crowd et al., 2017;Hori et al., 2017;Wang et al., 2018c) are shown to help deliver superior performance. Bahdanau Attention (Bahdanau et al., 2014) is widely used to calculate the attention weights according to the features of individual modalities, since Bahdanau Attention is originally proposed for machine translation which can be considered as a unimodal task. However, video captioning is a multimodal task and different modalities are able to provide complementary cues to each other when calculating the attention weights. For example, as shown in Fig. 1, a part of the video shows that two men are wrestling and another part shows that the referee coaches the match. With only the image modality, the model may pay equal attention to the frames with the competitors and referee. Considering the additional motion modality, the attention mechanism weighs more on one of them, resulting in a model with more focused attention.
Motivated by the above observations, in this paper we propose an attention mechanism called High-Order Cross-Modal Attention (HOCA) for video captioning, which makes a full use of the different modalities in video data by capturing their structure relationships. The key idea of HOCA is to consider the information of the other modalities when calculating the attention weights for one modality, different from Bahdanau Attention. In addition, we propose a low-rank version of HOCA which significantly reduces the com-putational complexity of HOCA. Specifically, the attention weights of HOCA are computed based on the similarity tensor between modalities, fully exploiting the correlation information of different modalities at each time step. Given the fact that the space requirement of the high-order tensor increases exponentially in the number of modalities and inspired by tensor decomposition, we adopt a low-rank correlation structure across modalities into HOCA to enable a good scalability to the increasing number of modalities. Such improvement largely reduces the algorithm complexity of HOCA with good results in empirical study.
Our contributions can be summarized as: (1) We propose High-Order Cross-Modal Attention (HOCA), which is a novel multimodal attention mechanism for video captioning. Compared with the Bahdanau Attention, HOCA captures the frame-level interaction between different modalities when computing the attention weights, leading to an effective multimodal modeling.
(2) Considering the scalability to the increasing number of modalities, we propose Low-Rank HOCA, which employs tensor decomposition to enable an efficient implementation of High-Order Cross-Modal Attention.
(3) Experimental results show that our method outperforms the state-of-the-art methods on video captioning, demonstrating the effectiveness of our method. In addition, the theoretical and experimental complexity analyses show that Low-Rank HOCA can implement multimodal correlation effeciently with acceptable computing cost.

Attention Mechanism
The encoder-decoder structures have been widely used in sequence transformation tasks. Some models also connect the encoder and decoder through an attention mechanism (Bahdanau et al., 2014;Luong et al., 2015). In natural language processing (NLP), (Bahdanau et al., 2014) first proposes the soft attention mechanism to adaptively learn the context vector of the target keys/values. Such attention mechanisms are used in conjunction with a recurrent network. In the context of video captioning, there are many methods (Hori et al., 2017;Xu et al., 2017) using Bahdanau Attention to learn the context vector of the temporal features of video data, including the recent state-of-the-art method, hierarchically aligned cross-modal attentive network (HACA) (Wang et al., 2018c). Such method ignores the structure relationships between modalities when computing the attention weights.

Video Captioning
Compared with image data, video data have rich multimodal information, such as image, motion, audio, semantic object, text. Therefore, the key is how to utilize the information. In the literature of video captioning, (Xu et al., 2017;Hori et al., 2017) propose the hierarchical multimodal attention, which selectively attends to a certain modality when generating descriptions. (Shen et al., 2017;Gan et al., 2017) adopt multi-label learning with weak supervision to extract semantic features of video data. (Wang et al., 2018b) proposes optimizing the metrics directly with hierarchical reinforcement learning.  extracts five types of features to develop the multimodal video captioning method and achieves promising results. (Wang et al., 2018a) performs the reconstruction of input video features by utilizing the output of the decoder, increasing the consistency of descriptions and video features. (Wang et al., 2018c) proposes the hierarchically aligned multimodal attention (HACA) to selectively fuse both global and local temporal dynamics of different modalities.
None of the methods mentioned above utilizes the interaction information of different modalities to calculate the attention weights. Motivated by this observation, we present HOCA and Low-Rank HOCA for video captioning. Different from the widely used Bahdanau Attention, our methods fully exploits the video representation of different modalities and their frame-level interaction information when computing the attention weights.
We introduce our methods in the following sections, where Section 3 introduces the details of Bahdanau Attention based multimodal video captioning (background), Section 4 gives the derivations of HOCA and Low-Rank HOCA, and the encoder-decoder structure which we propose for video captioning. Section 5 and 6 show the experimental settings and results of our methods. ber of input modalities is n, I l denotes the features of l-th modality, the space of I l is R d l ×t l where t l denotes the temporal length and d l denotes the feature dimensions. The corresponding output is word sequence.

Encoder
The input video is fed to multiple feature extractors, which can be pre-trained CNNs for classification tasks such as Inception-Resnet-v2 (Szegedy et al., 2017), I3D (Carreira and Zisserman, 2017), VGGish , each extractor corresponds to one modality. The extracted features are sent to Bi-LSTM (Hochreiter and Schmidhuber, 1997) which has a capability to process sequence data, capturing the information from both forward and backward directions. The output of Bi-LSTM is kept.

Decoder
The decoder aims to generate word sequence by utilizing the features provided by the encoder and the context information. At time step t, the decoder output can be obtained with the input word y t and the previous output h t−1 , We treat h t as query q, the features of different modalities are allocated weights with Bahdanau Attention seperately as shown in Fig. 2(a). The attention weights α I l t ∈ R t l of l-th modality can be obtained as follows: where (I l ) r l denotes the r l -th time step of I l and (α I l t ) r l is the corresponding weight, we combine the attention weights and features, obtaining the context vector ϕ t (I l ) of l-th modality, the context vectors of other modalities can be obtained in the same way. We then integrate the context vectors to predict the word as follows: where n is the number of modalities, p t denotes the probability distribution of words at time step t.

HOCA and Low-Rank HOCA
Considering multimodal features, Bahdanau Attention and its variants process different modalities separately. In such situation, the interaction between different modalities is ignored. We propose HOCA and Low-Rank HOCA to excavate this information, in addition, the tensor decomposition used in Low-Rank HOCA reduces the complexity of the high-order correlation.

High-Order Cross-Modal Attention (HOCA)
Different modalities in video data are able to provide complementary cues for each other and should be integrated effectively. Inspired by the correlation matrix which is widely used in the natural language understanding (Seo et al., 2016), we use a high-order correlation tensor to model the interaction between different modalities. Fig. 2 shows the generalized form of the structure for n modalities, the core module is "tensor multiplication" defined by ourselves. After the nonlinear mapping similar to Eqn. 3 with query h t , the features of the n modalities are in a d-dimensional common space. Note that, for convenience, we still use I l to represent the features of the l-th modality (instead of I l,t ) and omit t which denotes the time step of the decoder in the following derivations of Section 4.1 and 4.2.
Let α I l n denote the target attention weights of modality I l . We obtain α I l n through the high-order correlation tensor C n between the n modalities. C n can be obtained as below: where the (r 1 , ..., r n )-th entry of C n is the innerproduct of the r 1 -th column (time step) of I 1 , r 2th column (time step) of I 2 ,...,r n -th column (time step) of I n . • denotes element-wise multiplication and Λ denotes the element-wise multiplication •  for a sequence of tensors. is the tensor product operator and we use it to define a new operation {, ..., } for the "tensor multiplication" of multiple matrices. 1 d with space R d consists of 1. We use the vectors 1 d and 1 t i to denote the summation operation along the d and t i dimensions. In this situation, Eqn. 7 and Eqn. 8 are equivalent. The attention weights of I l can be calculated as below: ..,r l ,...,:,: , which is an (n-1)-order tensor and denotes the correlation values between the r l -th time step of I l and different time steps of the other n-1 modalities. W I l n−1 is also an (n-1)-order tensor, which has the same shape and denotes the relative importance of the correlation values.
denotes the summation function for the high-order tensor. For simplicity, the "weighted sum" module can be considered as a linear layer in multiple dimensions.

Low-Rank HOCA
One of the main drawbacks of HOCA is the generation of high-order tensor, the size of the high-order tensor will increases exponentially with the number of modalities as n i=1 t i , resulting in a lot of computation. Therefore, we implement the multimodal correlation between the different modalities in a more efficient way with lowrank approximation which has been widely used in the community of vision and language (Lei et al., 2015;Liu et al., 2018;. Following the Eqn. 7 and 8, we rewrite (C I l n ) r l as 1 : 10) Element-wise multiplication operator • is used for vector (I l ) r l and matrix I 1 . Each column of matrix I 1 multiplies vector (I l ) r l . denotes that (I l ) r l is multiplied (•) only when i = 1. We assume a low-rank factorization of the tensor W I l n−1 . Specifically, W I l n−1 is decomposed into a sum of k rank 1 tensors, where the space of w I i j is R 1×t i . Note that we set k to a constant value and use the recovered lowrank tensor to approximate W I l n−1 . The numerator in Eqn. 9 can be further derived as: Since I i is a matrix and w I i j is a vector, we directly multiply I i and corresponding w I i j 1 . Each row of the matrix I i multiplies vector w I i j , is the detailed structure of MAF module, each modality has three types of attention weights (unary, binary, and ternary), where "Ternary" denotes that the number of modalities is 3 in HOCA and Low-Rank HOCA, "Binary" and "Unary" correspond to 2 and 1. The unary attention is equal to Bahdanau Attention. In addition, "Binary(f,m)" denotes the binary attention weights with image and motion. We utilize trainable parameters to determine the importance of these attention weights when integrating them.
where we first apply the tensor multiplication to correlating the different time steps of all modalities. During the process, we sum in the d dimension to obtain the elements of the high-order tensor. Second, we sum (inner ) all the elements. For convenience, we change the operation order, we first sum in the temporal dimension, then sum in the d dimension 1 . Letting (I i ) j denote the global information of I i with importance factor w I i j , Eqn. 12 can be further derived as below: Due to the different information carried by the elements of the feature, we use a linear layer w I l to replace 1 d , The detailed structure of Low-Rank HOCA is shown in Fig. 2(c). 1 The details of the propositions are shown in the supplementary materials.

Complexity Analysis
We analyze the space complexity of Bahdanau Attention, HOCA, and Low-Rank HOCA in Fig.  2, focusing on the trainable variables and the output of each layer. For convenience, we start by calculating from the output of tanh layer, since the front structures of three methods are same.

Bahdanau Attention
The size of the trainable variable in second linear(d) layer is d×1, and the size of the output is n i=1 t i . The space complexity is O( n i=1 t i + d). HOCA The size of the output of tensor multiplication is n i=1 t i . The size of the trainable variables and corresponding output in weighted sum . Low-Rank HOCA The rank is set to k. The size of the trainable variable and corresponding output in linear(t) layer is n i=1 t i ×k+n×k×d. The size of the output in B l and mul layer is n × d + n i=1 t i × d. The size of the trainable variable and the corresponding output in second linear(d) Bahdanau Attention and Low-Rank HOCA both scale linearly in the number of modalities while HOCA scales exponentially. Therefore, HOCA will have explosive complexity when the number of modalities is big, and Low-Rank HOCA can solve this problem effectively. In this section, we mainly introduce our encoder-decoder structure combined with highorder attention (HOCA and Low-Rank HOCA). As shown in Fig. 3(a), the features of different modalities, i.e. image(F), motion(M), audio(S) are extracted in the encoder. These features are sent to the decoder for generating words. The "MAF" module performs HOCA and Low-Rank HOCA for the features of different modalities with h t . As shown in Fig. 3(b), each modality has three types of attention weights, i.e. unary, binary, ternary, which denote the different number (i.e. 1,2,3) of modalities applied to the HOCA and Low-Rank HOCA (note that unary attention is equal to Bahdanau Attention). In some cases, not all the modalities are effective, for example, the binary attention weights of image and motion are more accurate for the salient videos.
We use α F t,3 , α M t,3 , α S t,3 to denote the ternary weights of three modalities at time step t, the calculation is shown as follows: where F ,M ,S denote the features of image, motion, audio, respectively, and HOCA can be replaced by Low-Rank HOCA. We obtain the binary and unary weights in the same way. For different attention weights, we utilize trainable variables to determine their importance. We take the image modality as an example, the fusion weights are obtained as follows: where θ 1−4 are trainable variables, α F t,1 denotes the unary weights, α F,M t,2 and α F,S t,2 denote the binary weights with motion and audio, respectively. As we obtain the attention weights of three modalities, we can calculate the context vectors ϕ t (F ), ϕ t (M ), ϕ t (S) following the Eqn. 4 and 5.
To further determine the relative importance of multiple modalities, we perform a hierarchical attention mechanism for the context vectors in the "Fusion" module. The attention weights of image modality are calculated as follows: β M t and β S t are obtained in the same way. We integrate the context vectors and corresponding weights to predict word as follows: The optimization goal is to minimize the crossentropy loss function defined as the accumulative loss from all the time steps: where p * t denotes the probability of the groundtruth word at time step t, T denotes the length of description, V denotes the original video data.

Datasets and Metrics
We evaluate video captioning on two standard datasets, MSVD (Chen and Dolan, 2011) and MSR-VTT , which are both provided by Microsoft Research, with several stateof-the-art methods. MSVD includes 1970 video clips. The time length of a video clip is about 10 to 25 seconds and each video clip is annotated with about 40 English sentences. MSR-VTT has 10000 video clips; each clip is annotated with 20 English sentences. We follow the commonly used protocol in the previous work and use four common metrics in the evaluation, including BLEU4, ROUGE, METEOR, and CIDEr.

Preprocessing and Experimental Setting
We sample video data to 80 frames for extracting image features. For extracting motion features, we divide the raw video data into video chunks centered on 80 sampled frames at the first step. Each video chunk includes 64 frames. For extracting audio features, we obtain the audio file from the raw video data with FFmpeg. For both datasets, we use a pre-trained Inception-ResNet-v2 (Szegedy et al., 2017) to extract image features from the sampled frames and we keep the activations from the penultimate layer. In addition, we use a pre-trained I3D (Carreira and Zisserman, 2017) to extract motion features from video chunks. We employ the activations from the last convolutional layer and implement a meanpooling in the temporal dimension. We use the pre-trained VGGish  to extract audio features. On MSRVTT, we also utilize the glove embedding of the auxiliary video category labels to initialize the decoder state.
The hidden size is 512 for all LSTMs. The attention layer size for image, motion, audio attention is also 512. The dropout rate for the input and output of LSTM decoder is 0.5. The rank is set to 1. In the training stage, we use Adam (Kingma and Ba, 2014) algorithm to optimize the loss function; the learning rate is set to 0.0001. In the testing stage, we use beam-search method with beam-width 5. We use a pre-trained word2vec embedding to initialize the word vectors. Each word is represented as a 300-dimension vector. Those words which are not in the word2vec matrix are initialized randomly. All the experiments are done on 4 GTX 1080Ti GPUs. 6 Experimental Results Table 2 shows the results of different variants of HOCA and Low-Rank HOCA. HOCA-U, HOCA-B, and HOCA-T denote the model with only unary, binary, and ternary attention, respectively. HOCA-UB and HOCA-UBT denote the models with original HOCA and more types of attention mechanisms. The prefix "L-HOCA" denotes the model with Low-Rank HOCA.

Impact of Cross-Modal Attention
It is observed that the model with only one type of attention mechanism(U, B, or T) in the decoder achieves relatively bad results on both datasets. However, when we combine them, the performances are significantly improved on metrics, especially ROUGE and CIDEr. HOCA-UBT and L-HOCA-UBT with a combination of unary, binary, and ternary attention achieve relatively promising results on all the metrics. We argue that HOCA-UBT and L-HOCA-UBT can learn appropriate ratios of all types of attention mechanisms based on the specific video-description pairs, while other variants only focus on one or two types. In addition, the models of low-rank version (L-HOCA-UB and L-HOCA-UBT) have better metrics than the models of original version (HOCA-UB and HOCA-UBT). On the one hand, we utilize w I l to replace 1 d in Eqn. 17, fully mining the different information carried by the elements of the feature, on the other hand, the low-rank approximation is effective.   Table 2 shows the results of different methods on MSR-VTT and MSVD, including ours (L-HOCA-UBT), and some state-of-the-art methods, such as LSTM-TSA (Pan et al., 2017), TDDF , SCN (Gan et al., 2017), MM-TGM , Dense Caption (Shen et al., 2017), RecNet (Wang et al., 2018a), HRL (Wang et al., 2018b),HACA (Wang et al., 2018c).
From Table 2, we find that Ours(L-HOCA-UBT) shows competitive performances compared with the state-of-the-art methods. On MSVD, L-HOCA-UBT has outperformed SCN, TDDF, Rec-Net, LSTM-TSA, MM-TGM, on all the metrics. In particular, L-HOCA-UBT achieves 86.1% on CIDEr, making an improvement of 5.7% over MM-TGM. On MSR-VTT, we have the similar observation, L-HOCA-UBT has outperformed RecNet, HRL, MM-TGM, Dense Caption, and HACA on all the metrics. Table 3: Computing cost of different methods, where the "space" denotes the memory space requirement and the "training time" denotes the total time for training. We evaluate them on MSR-VTT. Note that the metrics belong to the whole model, not only the attention module.

Computing Cost
The theoretical complexity of different attention mechanisms is illustrated in Section 4.3. In practice, we utilize the experimental settings mentioned above, the batch size and the maximum number of epochs are set to 25 and 100, respectively. the training time and memory space requirement are shown in Table 3. We can find that L-HOCA-UBT has smaller space requirement and less time cost than HOCA-UBT, in addition, the computing cost of L-HOCA-UBT is close to that of HOCA-U (Bahdanau Attention). The results demonstrate the advantage of Low-rank HOCA.

Rank Setting
We also evaluate the impact of different rank values. We show the results on MSVD in Fig.  5. The red and green lines represent HOCA-UBT and L-HOCA-UBT, respectively. We find that the CIDEr of L-HOCA-UBT has slight fluctuations as the rank changes and a small value of rank can achieve competitive results with high efficiency.  Fig. 4 shows some qualitative results of our method. We simply compare the descriptions generated by HOCA-U and L-HOCA-UBT, respectively. GT represents "Ground Truth". Benefiting from the high-order correlation of multiple modalities, L-HOCA-UBT can generate more accurate descriptions which are close to GT.

Conclusion
In this paper, we have proposed a new crossmodal attention mechanism called HOCA for video captioning. HOCA integrates the information of the other modalities into the inference of attention weights of current modality. Furthermore, we have introduced the Low-Rank HOCA which has a good scalability to the increasing number of modalities. The experimental results on two standard datasets have demonstrated the effectiveness of our approach.

A Supplemental Material
A.1 Proposition 1 for Eqn. 13 in the paper Proposition 1: Suppose that we have n matrices, I 1 , I 2 , ..., I n , and n vectors, w 1 , w 2 , ..., w n . The space of I l is R d×t l and the space of w l is R 1×t l . Then Proof: We use C l and C r to denote the left side and right side of the equation, respectively. We utilize the element-wise comparison in two tensors. Following Eqn. 7 and 8 in the paper, the (r 1 , ..., r n )-th entry of C l is expressed as where (I i ) r i is a vector which denotes r i -th column of the I i , (w i ) r i is the r i -th value of the vector. Since (w i ) r i is a single element, we can directly multiply it with the corresponding vector (I i ) r i .
The proposition is proven and is used to convert Eqn. 12 to Eqn. 13 in the paper.
A.2 Proposition 2 for Eqn. 15 in the paper Proposition 2: Suppose that we have n matrices, I 1 , I 2 , ..., I n . The space of I l is R d×t l . Then here we use vectors 1 d and 1 t i which consist of 1 to represent the summation operation for matrix I i in d dimension and t i dimensions, respectively. Proof: We use v l to denote the left side of the equation and v r to denote the right side of the equation. We can express v l as v l = t 1 r 1 =1 ...
Following Eqn. 7 and 8 in the paper, we can express C r 1 ,r 2 ,...,rn as C r 1 ,r 2 ,...,rn = 1 d We apply Eqn. 32 to Eqn. 30, The proposition is proven and is used to convert Eqn. 13 to Eqn. 15 in the paper.

A.3 Learning Curves
We show the learning curves of the CIDEr on the validation set in Fig. 6 and observe that the L-HOCA-UBT performs better than HOCA-UBT and HOCA-U when the training converges. Note that we use greedy search during training while beam search during testing, so the testing scores are higher.

A.4 Visualization of Attention Weights
We also perform visualization of the attention weights in multiple attentive fusion (MAF) module. As shown in Fig. 7, HOCA-UBT obtains a more accurate ratio of each modality than HOCA-U, i.e. for the word "man", HOCA-U obtains a higher score of motion modality, which violates human subjective understanding.  Figure 7: Visualization of the attention weights in multiple attentive fusion (MAF) module, the red bar denotes image modality, the green bar denotes motion modality, the blue bar denotes audio modality.