Dual Low-Rank Multimodal Fusion

Tensor-based fusion methods have been proven effective in multimodal fusion tasks. However, existing tensor-based methods make a poor use of the fine-grained temporal dynamics of multimodal sequential features. Motivated by this observation, this paper proposes a novel multimodal fusion method called Fine-Grained Temporal Low-Rank Multimodal Fusion (FT-LMF). FT-LMF correlates the features of individual time steps between multiple modalities, while it involves multiplications of high-order tensors in its calculation. This paper further proposes Dual Low-Rank Multimodal Fusion (Dual-LMF) to reduce the computational complexity of FT-LMF through low-rank tensor approximation along dual dimensions of input features. Dual-LMF is conceptually simple and practically effective and efficient. Empirical studies on benchmark multimodal analysis tasks show that our proposed methods outperform the state-of-the-art tensor-based fusion methods with a similar computational complexity.


Introduction
Multimodal fusion aims to integrate information of multiple modalities as a compact but informative representation. Multimodal fusion is fundamentally significant for real-world multimodal applications like speech translation (Yuhas et al., 1989), emotion recognition (De Silva et al., 1997;Chen et al., 2018), and sentiment analysis (Morency et al., 2011). It is very challenging that it requires correlating the semantics of multiple modalities in an effective and efficient way. Recently, several methods have been proposed to learn joint embeddings of multiple modalities (Fukui et al., 2016;Nojavanasghari et al., 2016;. * means equal contribution † corresponding author There are two lines of fusion methods: early fusion and late fusion. In this paper, we mainly focus on the former, which aims to integrate information of different modalities before it is processed by the model. Earlier work on early fusion employs a simple concatenation of input features (Pérez-Rosas et al., 2013;Park et al., 2014;Zadeh et al., 2016b). To construct a more compact representation,  introduces Tensor Fusion Network (TFN) which averages the features of each modality along the temporal dimension and transforms the multimodal features into a high-order tensor which is used for subsequent tasks. Although TFN achieves a better performance than the concatenation manner, its computational complexity increases exponentially with the number of modalities. (Liu et al., 2018) further proposes Low-Rank Multimodal Fusion (LMF) which employs lowrank approximation to reconstruct the high-order tensor. However, these tensor-based methods neglect the fine-grained temporal dynamics which include rich structured information for multimodal modeling. For example, if the facial expression of a man is happy at time step t, he will speak in a positive tone at time step t + ∆t more likely. The features of different time steps and different modalities are correlated.
Motivated by this observation, in this paper we introduce Fine-Grained Temporal Low-Rank Multimodal Fusion (FT-LMF). Instead of averaging the features along the temporal dimension, we associate the features of individual time steps between different modalities to form a high-order tensor. The tensor is then embedded to a low-dimensional matrix for subsequent tasks. Compared with LMF, FT-LMF is able to capture the cross-modal interactions at a finer granularity on the temporal dimension.
Since FT-LMF involves multiplications of high-order tensors in its calculation, its computational complexity increases exponentially with the number of modalities. To tackle this problem, we further introduce Dual Low-Rank Multimodal Fusion (Dual-LMF) which approximates the high-order tensor using low-rank tensor decomposition along both temporal and non-temporal dimensions. We show that Dual-LMF has a linear complexity w.r.t the number of modalities. In experiments, we have validated FT-LMF and Dual-LMF on four benchmark multimodal analysis datasets and they have shown promising results in comparison with the state-of-the-art methods. The contributions of this paper can be summarized as follows: (1) To address the ignorance of fine-grained temporal dynamics in the existing tensor-based fusion methods, we propose Fine-Grained Temporal Low-Rank Multimodal Fusion (FT-LMF) which correlates the features of different time steps between all the modalities.
(2) To reduce the computational complexity of FT-LMF, we propose Dual Low-Rank Multimodal Fusion (Dual-LMF) which employs low-rank decomposition to approximate the high-order tensor along its dual dimensions.
(3) Experimental results show that our methods outperform the most state-of-the-art methods on different multimodal analysis tasks.

Related Work
Multimodal analysis has attracted much attention recently. Thanks to the high-quality open-source datasets like CMU-MOSI, POM, YOUTUBE, and ICT-MMMO, many effective methods have been proposed and comprehensively evaluated. The key to multimodal analysis is the fusion of multimodal information. Generally, there are two lines of fusion methods, early fusion and late fusion. Early fusion methods integrate features of different modalities before feeding them to the model. For instance, concatenating different features (Zadeh et al., 2016b) is a simple way. However, the intramodal dynamics cannot be effectively captured, and the temporal information of a single modality is ignored in early fusion. Late fusion methods (Nojavanasghari et al., 2016) utilize information of a single modality for inference, and then ensembling them by majority voting or weighted averaging (Wörtwein and Scherer, 2017). Unfortunately, the inter-modal interactions are not modeled in late fusion.
To address the drawbacks of the above methods, (Pham et al., 2019) investigates learning joint representations via cyclic translations from source to target modalities and only uses the source modality for prediction during testing. TFN  and its successive work  propose to embed multiple feature vectors into a high-order tensor to improve the modeling of intermodal relationships. However, the computational complexity of TFN increases exponentially with the number of modalities. LMF (Liu et al., 2018) reduces the complexity of TFN by applying lowrank decomposition to the high-order tensor. While LMF simply averages the feature matrices along the temporal dimension or chooses a feature vector of one time step among all the time steps, ignoring the rich fine-grained temporal information.
In this paper, we develop Fine-Grained Temporal Low-Rank Multimodal Fusion (FT-LMF) to correlate the features of different time steps between modalities. Furthermore, considering that the computational complexity of FT-LMF increases exponentially with the number of modalities, we propose Dual Low-Rank Multimodal Fusion (Dual-LMF) which performs low-rank decomposition to both dimensions of the input features. The performances of our methods on several tasks, i.e., multimodal sentiment analysis and speaker traits recognition, are improved with an acceptable complexity.

Tensor Fusion Network
We start by introducing TFN  which only adopts multimodal fusion on nontemporal dimension of input features. Suppose that the space of the m-th modality is R dm×tm and the number of modalities is M . We randomly choose one time step from features of each modality and denote it as v m ∈ R dm . As shown in Fig. 1, TFN transforms the input vectors v 1 , v 2 ,...,v M into a high-order tensor and then maps it back to a lowdimensional vector. The input tensorṼ formed by the unimodal representation is calculated as: where denotes the tensor outer product operation over a set of vectors andṼ ∈ R M m=1 dm is the hybrid representation of the input vectors. Following the conventional setting of neural networks,Ṽ is followed by a fully-connected layer for dimension reduction, as It is obvious that the computational complexity of TFN increases exponentially with the number of modalities.

Low-Rank Multimodal Fusion
To reduce the complexity of TFN, LMF (Liu et al., 2018) is proposed to utilize low-rank decomposition for approximating the high-order tensor W h , as shown in Fig. 2. LMF first divides the (M + 1)order tensor W h into a series of M -order tensors as For efficiently calculating the tensor multiplication W i h ·Ṽ , LMF applies low-rank decomposition 1 In practice, the bias b h is approximated by the concatenation of vm and a scalar value of 1; thus, we omit b h in the subsequent derivations of this paper.
where (W i h ) m,r ∈ R dm×1 and R is the value of rank. W i h ·Ṽ is then computed based on Eqns. 1 and 4 2 : (5) where • denotes element-wise multiplication and denotes the summation function for all the elements in the high-order tensor. To facilitate reading, we rewrite Eqn. 5 as where Λ M m=1 denotes the element-wise multiplication • over a sequence of tensors. For instance, where (W h ) m,r ∈ R dm×d h consists of (W i h ) m,r ∈ R dm×1 . Through low-rank approximation to the high-order tensor, LMF scales linearly with the number of modalities.
The space of an output of LMF is R d h , thus the space of all the groups is R d h ×tv×ta×t l . A fully-connected layer follows the high-order tensor to reduce the tensor space from R d h ×tv×ta×t l to R d h ×d k . In practice, the parameters W k on FC layer are generated by attention mechanism for better effectiveness. time steps between multiple modalities. We use to denote the correlation result of selected timesteps of M modalities. We can obtain a high-order tensor H ∈ R M m=1 tm×d h which carries the interactive information of different time steps between modalities. Following Eqn. 2, we calculate the values of tensor as: where V m ∈ R dm×tm denotes the feature matrix of m-th modality and (V m ) lm denotes the l m -th time step of V m .
We map H to a 2-D matrix where the spaces of W k and b k are R d k × M m=1 tm and R d k , respectively; thus the space of K is R d k ×d h . For the convenience of subsequent derivations, we rewrite Eqn. 9 as where H i ∈ R M m=1 tm is just one channel of H, and K i ∈ R d k is one channel of K. In practice, we employ attention mechanism to generate W k for better capturing the importance of each time-step group: The numerator is elementwise divided by the denominator.
FT-LMF shown in Fig. 3 is able to capture the fine-grained temporal interactions between different modalities, while the computational complexity of its high-order tensor H increases exponentially with the number of modalities. To tackle this problem, we further propose Dual-LMF as discussed in the next section.

Dual-LMF
Based on FT-LMF, Dual-LMF further performs low-rank decomposition on both temporal dimension and non-temporal dimensions. First, we follow LMF to divide the (M + 1)-order tensor W k into a series of M -order tensors. The number of the tensors is d k : We apply low-rank decomposition to each W j k , H i [l 1 , l 2 , ..., l M ] ∈ R 1 is an element of H i ∈ We substitute Eqn. 16 into Eqn. 14 2 : (17) where we treat both (W j k ) m,r 2 and (W i h ) T m,r 1 V m as t m -dimensional vectors in the second and third rows of Eqn. 17. While in the forth row, we utilize the original sizes, i.e., (W j k ) m,r 2 ∈ R tm×1 , is the corresponding index. We refer to the derivations of LMF and employ a simple transformation to Eqn. 17 to obtain the output fusion matrix W k H: Similar to Eqn. 11, (W k ) m ∈ R tm×(R 2 ×d k ) is computed with element-wise attention mechanism, where the space of , the numerator is element-wise divided by the denominator. Thanks to the low-rank decomposition on both temporal and non-temporal dimensions of input features, Dual-LMF shown in Fig. 4 is much more efficient than FT-LMF and has a good scalability to the increasing number of modalities.
CMU-MOSI (Zadeh et al., 2016a) is created for sentiment analysis, which contains 63 long videos with a sentiment label in range [-3,3]. During the training and testing, we divide the 63 videos into 2199 chunks for label alignment. Following the existing work, we divide the whole dataset into three parts, for training, validation, and testing. Note that the same speaker does not appear in multiple sets.
POM (Park et al., 2014) is created for speaker traits recognition. It contains 903 movie review videos and each video is annotated with 16 speaker traits, including confident, passionate, voice pleasant, dominant, credible, vivid, expertise, entertaining, reserved, trusting, relaxed, outgoing, thorough, nervous, persuasive and humorous.
YouTube (Morency et al., 2011) is created for sentiment analysis. It contains 47 videos from the social media website YouTube and each video is annotated at the segment level for sentiment.
ICT-MMMO (Wöllmer et al., 2013) is created for sentiment analysis. It contains 370 movie review videos and each video is annotated at the video level for sentiment.

Features
In this paper, we follow the existing methods to do empirical studies on three different modalities, including audio, visual, and text. In addition, P2FA (Yuan and Liberman, 2008) is utilized to align the three modalities at the word granularity. The visual and audio features are aligned by computing their average value over the utterance interval of each word.
To extract audio, visual, and text features, we follow the methods of LMF. Specifically, for audio modality, we use COVAREP (Degottex et al., 2014) to extract a set of low-level audio features. For visual modality, we use Facet (iMotions, 2017) to extract a set of visual features for each frame. For text modality, we use pre-trained 300-dimension glove word vectors (Pennington et al., 2014) to extract word representations.
For audio and visual features, we use a 2-layer feed-forward neural network to handle the features of all time steps. For text features, we use an LSTM (Hochreiter and Schmidhuber, 1997) to capture the semantic information. After encoding the features, we send them to fusion models.

Metrics
For different datasets, we compare methods under different metrics. For CMU-MOSI, we report Mean Absolute Error (MAE), Pearson correlation (Corr), binary accuracy, F1-Score, 7-class accuracy. For POM, we report average MAE, average Corr, average binary-accuracy for speaker traits. For YouTube, we report 3-class accuracy and F1-Score. For ICT-MMMO, we report binary accuracy and F1-Score.

Model and Optimization
For a fair comparison, we implement FT-LMF and Dual-LMF similarly to LMF, while we keep all the time steps of the three modalities. The output of our FT-LMF and Dual-LMF is K ∈ R d k ×d h . In the experiments, we set d k to 1 and d h to the number of attributes. We employ MAE loss function to optimize the learnable variables.

Experimental Setting
For CMU-MOSI, the output dimension is 1; we train the model for at most 500 epochs. If MAE does not increase for 20 epochs, we stop the training. The other hyper-parameters (i.e., hidden size, learning rate, batch size) are determined by the grid search method. The best hyper-parameters are different for TFN, LMF, FT-LMF, Dual-LMF.
For POM, the output dimension is 16, since we treat the predictions for 16 speakers as a multilabel task. We also train the model for at most 500 epochs and the patience is 20. The other hyper-parameters are determined by the grid search method.
For YOUTUBE and ICT-MMMO, the output dimension is 1. We also train the model for at most 500 epochs and the patience is 20. The other hyper-parameters are determined by the grid search method.

Comparison Baselines
We use TFN  and LMF (Liu et al., 2018) as our baselines. In addition, we compare our methods with several other state-of-the-art methods which employ simple feature-encoding ways, like LSTM and fully-connected layer, since we also use these simple ways and focus on the fusion method before final prediction.
SVM is trained on simply concatenated multimodal features for prediction (Zadeh et al., 2016b;Park et al., 2014;Pérez-Rosas et al., 2013).  DF (Nojavanasghari et al., 2016) uses multiple fully-connected layers to predict the results for each modality, respectively, and ensembles the results.  correlates multiple modalities with a proposed context-dependent fusion method. (Rajagopalan et al., 2016) is an extension to LSTM, designed to model both viewspecific and cross-view dynamic by partitioning internal representations to mirror the multiple input modalities.

MV-LSTM
MCTN (Pham et al., 2019) investigates learning joint representations via cyclic translations from source to target modalities and only uses the source modality for prediction during testing.
MARN  discovers the interaction between modalities through time with a neural module called Multi-attention Block and stores them in a hybrid memory component called Longshort Term Hybrid Memory. Although MARN considers temporal information, it is not tensor-based. Table 1 shows the performances of the methods on CMU-MOSI and POM datasets. On CMU-MOSI, FT-LMF and Dual-LMF outperform the state-ofthe-art methods on MAE, Corr, and Acc-7; and Dual-LMF has a better overall performance than FT-LMF. On POM, we report the average performances on 16 speakers and find that Dual-LMF outperforms the state-of-the-art methods on all the metrics. Table 2 shows the performances of the methods on ICT-MMMO and YOUTUBE datasets. The observed results are similar to those of POM. The promising empirical results demonstrate the effectiveness of our methods.

Effect of Fine-Grained Temporal Information
To further validate the effect of fine-grained temporal information, we show the performances of FT-LMF and Dual-LMF with different time-step sizes 3 . In our experiments, t v =t a =t l =20. The time- Here we select a series of time-step sizes including 2, 4, 10, and 20 for comparison. Note that FT-LMF, Dual-LMF, and LMF are equivalent when the timestep size is 20. As shown in Fig. 5, we find that the performances of the models are improved as the step size decreases. The results demonstrate the effectiveness of incorporating fine-grained temporal dynamics into the multimodal fusion scheme.

Space Complexity
We analyze the space complexity of different methods theoretically. Following the supposition in the approach section, we focus on the sizes of learnable variables and the output of each layer. Note that we omit the variables with relatively small size, i.e., bias.
TFN The size of a vector of m-th modality is d m . Therefore, the size of the high-order tensor is M m=1 d m . The size of variables in the fullyconnected layer is . LMF We map all the vectors to a dimension of d h . The rank is set to R; the size of variables in linear layers is . FT-LMF We use LMF for M m=1 t m groups of time steps in total. The size of variables in a fullyconnected layer of LMF is d h × R × M m=1 d m ; thus, the size of the generated high-order tensor is M m=1 t m × d h . The size of variables in the subsequent attention-based fully-connected layer ).

Dual-LMF The size of variables in the linear layer is
With respect to the number of modalities, we can easily find that TFN and FT-LMF have an exponential space complexity, while LMF and Dual-LMF have a linear space complexity.  Table 4 shows the float point operation (FLOPs) of different methods on CMU-MOSI. Specifically, we use a set of hyper-parameters as t v =t a =t l =20,

Practical FLOPs
FLOPs of TFN and FT-LMF are much more than those of LMF and Dual-LMF.

Empirical Study on Rank Value
The selection of rank is important in multimodal fusion. We utilize the hyper-parameters mentioned above and evaluate Dual-LMF with combinations of different rank values R 1 and R 2 . We start by setting both R 1 and R 2 to 1, and gradually increase them. The results on CMU-MOSI are shown in Fig.  6. We find that only a single R 1 or R 2 cannot well determine the final performance. Thus, a careful selection of R 1 and R 2 is necessary. In addition, Dual-LMF with low rank values can achieves similar results to that with high rank values and the computational complexity is reduced.

Conclusion
In this paper, we have proposed novel multimodal fusion methods, including FT-LMF and Dual-LMF, for multimodal analysis tasks. FT-LMF is a finegrained version of Low-Rank Multimodal Fusion which particularly associates the features of individual time steps between multiple modalities. Based on FT-LMF, Dual-LMF performs low-rank tensor approximation along dual dimensions of input features to reduce the exponential computational com-plexity of FT-LMF to a linear complexity w.r.t. the number of modalities. The experimental results show that our methods achieve superior performance compared with the state-of-the-art methods with similar computational cost. Amir Zadeh, Rowan Zellers, Eli Pincus, and  .1 Reproducibility of the paper We implement experiments on GTX 1080Ti. The main hyperparameters include audio hidden dimension, video hidden dimension, text hidden dimension, audio dropout rate, video dropout rate, text dropout rate, learning rate, weight decay, rank1 and rank2. Grid search is employed to find the appropriate combination of parameters. For each method, we randomly try 2000 combinations, since the model is small and the running time is short as shown in Table 4. The feature extraction method and the division of training and test sets follow . If the paper is accepted, we promise to open the source code and the bestperforming hyperparameters. • v m . These two formations are equal, just with different operation orders. The former utilizes summation( ) first, while the later uses multiplication( ) between different elements first.
Therefore, we obtain the final formation of W i hṼ : .3 Derivations for Eqn. 17 in the paper W j k · H i can be rewritten as: similar to Eqns. 21, 24, and 27, we obtain the final formation of W j k · H i ,