Modality-based Factorization for Multimodal Fusion

We propose a novel method, Modality-based Redundancy Reduction Fusion (MRRF), for understanding and modulating the relative contribution of each modality in multimodal inference tasks. This is achieved by obtaining an (M+1)-way tensor to consider the high-order relationships between M modalities and the output layer of a neural network model. Applying a modality-based tensor factorization method, which adopts different factors for different modalities, results in removing information present in a modality that can be compensated by other modalities, with respect to model outputs. This helps to understand the relative utility of information in each modality. In addition it leads to a less complicated model with less parameters and therefore could be applied as a regularizer avoiding overfitting. We have applied this method to three different multimodal datasets in sentiment analysis, personality trait recognition, and emotion recognition. We are able to recognize relationships and relative importance of different modalities in these tasks and achieves a 1% to 4% improvement on several evaluation measures compared to the state-of-the-art for all three tasks.


Introduction
Multimodal data fusion is a desirable method for many machine learning tasks where information is available from multiple source modalities, typically achieving better predictions through integration of information from different modalities.Multimodal integration can handle missing data from one or more modalities.Since some modalities can include noise, it can also lead to more robust prediction.Moreover, since some information may not be visible in some modalities or a single modality may not be powerful enough for a specific task, considering multiple modalities often improves perfor-mance (Potamianos et al., 2003;Soleymani et al., 2012;Kampman et al., 2018).
For example, humans assign personality traits to each other, as well as to virtual characters by inferring personality from diverse cues, both behavioral and verbal, suggesting that a model to predict personality should take into account multiple modalities such as language, speech, and visual cues.
Our method, Modality-based Redundancy Reduction multimodal Fusion (MRRF), builds on recent work in mutimodal fusion utilizing first an outer product tensor of input modalities to better capture inter-modality dependencies (Zadeh et al., 2017) and a recent approach to reduce the number of elements in the resulting tensor through low rank factorization (Liu et al., 2018).Whereas the factorization used in (Liu et al., 2018) utilizes a single compression rate across all modalities, we instead use Tuckers tensor decomposition (see the Methodology section), which allows different compression rates for each modality.This allows the model to adapt to variations in the amount of useful information between modalities.Modality-specific factors are chosen by maximizing performance on a validation set.
Applying a modality-based factorization method results in removing redundant information duplicated across modalities and leading to fewer parameters with minimal information loss.Through maximizing performance on a validation set, our method can work as a regularizer, leading to a less complicated model and reducing overfitting.In addition, our modality-based factorization approach helps to understand the differences in useful information between modalities for the task at hand.
We evaluate the performance of our approach using sentiment analysis, personality detection, and emotion recognition from audio, text and video frames.The method reduces the number of pa-rameters which requires fewer training samples, providing efficient training for the smaller datasets, and accelerating both training and prediction.Our experimental results demonstrate that the proposed approach can make notable improvements, in terms of accuracy, mean average error (MAE), correlation, and F 1 score, especially for the applications with more complicated inter-modality relations.
We further study the effect of different compression rates for different modalities.Our results on the importance of each modality for each task supports the previous results on the usefulness of each modality for personality recognition, emotion recognition and sentiment analysis.
In the sequel, we first describe related work.We elaborate on the details of our proposed method in Methodology section.In the following section we go on to describe our experimental setup.In the Results section, we compare the performance of MRRF and state-of-the-art baselines on three different datasets and discuss the effect of compression rate on each modality.Finally, we provide a brief conclusion of the approach and the results.Supplementary materials describe the methodology in greater detail.
Notation The operator ⊗ is the outer product operator where z 1 ⊗ . . .⊗ z M for z i ∈ R d i leads to a M-way tensor in R d 1 ×...×d M .The operator × k , for a given k, is k-mode product of a tensor R ∈ R r 1 ×r 2 ×...×r M and a matrix W ∈ R d k ×r k as W × k R, which results in a tensor R ∈ R r 1 ×...×r k−1 ×d k ×r k+1 ×...×r M .
According to the recent work by (Baltrušaitis et al., 2018), the techniques for multimodal fusion can be divided into early, late and hybrid approaches.Early approaches combine the multimodal features immediately by simply concatenating them (D'mello and Kory, 2015).Late fusion combines the decision for each modality (either classification, or regression), by voting (Morvant et al., 2014), averaging (Shutova et al., 2016) or weighted sum of the outputs of the learned models (Glodek et al., 2011;Shutova et al., 2016).The hybrid approach combines the prediction by early fusion and unimodal predictions.
It has been observed that early fusion (feature level fusion) concentrates on the inter-modality information rather than intra-modality information (Zadeh et al., 2017) due to the fact that intermodality information can be more complicated at the feature level and dominates the learning process.On the other hand, these fusion approaches are not powerful enough to extract the inter-modality integration model and they are limited to some simple combining methods (Zadeh et al., 2017).Zadeh et al. (2017) proposed combining n modalities by computing an n-way tensor as a tensor product of the n different modality representations followed by a flattening operation, in order to include 1-st order to n-th order inter modality relations.This is then fed to a neural network model to make predictions.The authors show that their proposed method improves the accuracy by considering both inter-modality and intra-modality relations.However, the generated representation has a very large dimension which leads to a very large hidden layer and therefore a huge number of parameters.
The authors of (Poria et al., 2017a,b;Zadeh et al., 2018a,b) introduce attention mechanisms utilizing the contextual information available from the utterances for each speaker.They require additional information like the identity of the speaker, the sequence of the utterance-sentiments while integrating the multimodal data.Since these methods, despite our proposed method, need additional information might not be available in the general scenario, we do not include them in our experiments.
Low Rank Factorization: Recently (Liu et al., 2018) proposed a factorization approach in order to achieve a factorized version of the weight matrix which leads to fewer parameters while maintaining model accuracy.They use a CAN-DECOMP/PARAFAC decomposition (Carroll and Chang, 1970;Harshman, 1970) which follows Eq. 1 in order to decompose a tensor where ⊗ is the outer product operator, λ i s are scalar weights to combine rank 1 decompositions.This approach used the same compression rate for all modalities, i.e. r is shared for all the modalities, and is not able to allow for varying compression rates between modalities.Previous studies have found that some modalities are more informative than others (De Silva et al., 1997;Kampman et al., 2018), suggesting that allowing different compression rates for different modalities should improve performance.

Tucker Factorization for Multimodal Learning
Modality-based Redundancy Reduc-tion Fusion (MRRF): We have used Tucker's tensor decomposition method (Tucker, 1966;Hitchcock, 1927) which decomposes an M -way tensor W ∈ R d 1 ×d 2 ×...×d M to a core tensor R ∈ R r 1 ×r 2 ×...×r M and M matrices W i ∈ R r i ×d i , with r i ≤ d i , as it can be seen in Eq. 2.
The operator For M modalities with representations D 1 , D 2 , . . .and D M of size d 1 , d 2 , . . .and d M , an Mmodal tensor fusion approach as proposed by the authors of (Zadeh et al., 2017) leads to a tensor The authors proposed flattening the tensor layer in the deep network which results in loss of the information included in the tensor structure.In this paper, we propose to avoid the flattening and follow Eq. 3 with weight tensor W ∈ R h×d 1 ×d 2 ×...×d M , where leads to an output layer H of size h.

H = W D
(3) The above equation suffers from a large number of parameters (O( i=1 d i h)) which requires a large number of the training samples, huge time and space, and easily overfits.In order to reduce the number of parameters, we propose to use Tucker's tensor decomposition (Tucker, 1966;Hitchcock, 1927) as shown in Eq. 4, which works as a lowrank regularizer (Fazel, 2002).
(4) The non-diagonal core tensor R maintain intermodality information after compression, despite the factorization proposed by (Liu et al., 2018) which loses part of inter-modality information.

Proposed MRRF framework
We propose Modality-based Redundancy Reduction Fusion (MRRF), a tensor fusion and factorization method allowing for modality specific compression rates, combining the power of tensor fusion methods with a reduced parameter complexity.Without loss of generality, we will consider the number of modalities to be 3 in this discussion.
Our method first forms an outer product tensor from input modalities D, then projects this via a tensor W to a feature vector H passed as input to a neural network which performs the desired inference task.
The trainable projection tensor W represents a large number of parameters, and in order to reduce this number, we propose to use Tucker's tensor decomposition (Tucker, 1966;Hitchcock, 1927), which works as a low-rank regularizer (Fazel, 2002).This results in a decomposition of W into a core tensor R of reduced dimensionality and three modality specific matrices W i .
where × k is a k-mode product of a tensor and a matrix.Equation 5 can then be re-written See Figure 1 for an overview of this process for the case of three separate channels for audio, text, and video.In practice we flatten tensors Z and R to reduce this last operation to a matrix multiplication.Further details of the decomposition strategy can be found in the supplementary materials.
Note that a simple outer product of the input features leads only to the high-order trimodal dependencies.In order to also obtain the unimodal and bimodal dependencies, the input feature vectors for each modality are padded by 1.This also provides a constant element whose corresponding factors in W act as a bias vector.
Algorithm 1 shows the whole MRRF process.
order to transform the high-dimensional tensor D to the output h. 3: Use Adam optimizer for the differentiable tensor factorization layer to find the unknown parameters W 1 , W 2 , . . ., W n , R.
The original tensor fusion approach as proposed in (Zadeh et al., 2017) flattened the tensor D which results in loss of the information included in the tensor structure, which is avoided in our approach.Liu et al. (2018) developed a similar approach to ours using a diagonal core tensor R, losing much inter-modality information.Our non-diagonal core tensor maintains key inter-modality information after compression.
Note that the factorization step is task dependent, included in the deep network structure and learned during network training.Thus, for follow-up learning tasks, we would learn a new factorization specific to the task at hand, typically also estimating optimal compression ratios as described in the discussion section.In this process, any shared, helpful information is retained, as demonstrated by our results.
Analysis of parameter complexity: Following our proposed approach, we have decomposed the trainable W tensor to four substantially smaller trainable matrices It can be seen that the number of parameters in the proposed approach is substantially fewer than the simple tensor fusion (TF) approach and comparable to the LMF approach.
4 Experimental Setup

Datasets
We perform our experiments on the following multimodal datasets: CMU-MOSI (Zadeh et al., 2016), POM (Park et al., 2014), and IEMOCAP (Busso et al., 2008) for sentiment analysis, speaker traits recognition, and emotion recognition, respectively.These tasks can be done by integrating both verbal and nonverbal behaviors of the persons.
The CMU-MOSI dataset is annotated on a sevenstep scale as highly negative, negative, weakly negative, neutral, weakly positive, positive, highly positive which can be considered as a 7 class classification problem with 7 labels in the range [−3, +3].The dataset is an annotated dataset of 2199 opinion utterances from 93 distinct YouTube movie reviews, each containing several opinion segments.Segments average of 4.2 seconds in length.
The POM dataset is composed of 903 movie review videos.Each video is annotated with the following speaker traits: confident, passionate, voice pleasant, dominant, credible, vivid, expertise, entertaining, reserved, trusting, relaxed, outgoing, thorough, nervous, persuasive and humorous.
The IEMOCAP dataset is a collection of 151 videos of recorded dialogues, with 2 speakers per session for a total of 302 videos across the dataset.Each segment is annotated for the presence of 9 emotions (angry, excited, fear, sad, surprised, frustrated, happy, disgust and neutral).
Each dataset consists of three modalities, namely language, visual, and acoustic.The visual and acoustic features are calculated by taking the average of their feature values over the word time interval (Chen et al., 2017).In order to perform time alignment across modalities, the three modalities are aligned using P2FA (Yuan and Liberman, 2008) at the word level.
Pre-trained 300-dimensional Glove word embeddings (Chen et al., 2017) were used to extract the language feature representations, which encodes a sequence of the transcribed words into a sequence of vectors.
Visual features for each frame (sampled at 30Hz) are extracted using the library Facet 1 which includes 20 facial action units, 68 facial landmarks, head pose, gaze tracking and HOG features (Zhu et al., 2006).

Model Architecture
Similarly to (Liu et al., 2018), we use a simple model architecture for extracting the representations for each modality.We used three unimodal sub-embedding networks to extract representations z a , z v and z l for each modality, respectively.For acoustic and visual modalities, the sub-embedding network is a simple 2-layer feed-forward neural network, and for language, we used a long shortterm memory (LSTM) network (Hochreiter and Schmidhuber, 1997).
We tuned the layer sizes, the learning rates and the compression rates, by checking the mean average error for the validation set by grid search.We trained our model using the Adam optimizer (Kingma and Ba, 2014).All models were implemented with Pytorch (Paszke et al., 2017).

Experimental Results and Comparing with State-of-the-art
We compared our proposed method with three baseline methods.Concat fusion (CF) (Baltrušaitis et al., 2018) proposes a simple concatenation of the different modalities followed by a linear combination.The tensor fusion approach (TF) (Zadeh et al., 2017) computes a tensor including uni-modal, bi-modal, and tri-modal combination information.LMF (Liu et al., 2018) is a tensor fusion method that performs tensor factorization using the same rank for all the modalities in order to reduce the number of parameters.Our proposed method aims to use different factors for each modality.
In Table 2, we present mean average error (MAE), the correlation between prediction and true scores, binary accuracy (Acc-2), multi-class accuracy (Acc-7) and F1 measure.The proposed approach outperforms baseline approaches in nearly all metrics, with marked improvements in Happy and Neutral recognition.The reason is that the inter-modality information for these emotions is more complicated than the other emotions and requires a non-diagonal core tensor to extract the complicated information.It is worth to note that for the equivalent setting and equal ranks for all the modalities, the result of the proposed method is always marginally better than LMF method.

Investigating the Effect of Compression Rate on Each Modality
In this section, we aim to investigate the amount of redundant information in each modality.To do this, after obtaining a tensor which includes the combinations of all modalities with the equivalent size, we factorize a single dimension of the tensor while keeping the size for the other modalities fixed.By observing how the performance changes by compression rate, one can find how much redundant information is contained in the corresponding modality relative to the other modalities.
The results can be seen in Fig. 2, 3 and 4. The horizontal axis is the compressed size and the ver-tical axis shows the accuracy for each modality.Note that due to the padding of each D i with 1, we have used r i + 1 as the new embedding size.
The first point that could be perceived clearly from the different modality diagrams is that each of the modalities changes in a different way when getting compressed, which means they each have a different amount of information that can not be compensated by the non-compressed modalities.In other words, a high accuracy when a modality is highly compressed means that there is a lot of redundant information in this modality -the information loss resulting from factorization could be compensated by the other modalities so performance was not reduced.
Fig. 2 shows results for the CMU-MOSI sentiment analysis dataset.For this dataset, a notable decrease in accuracy can be seen by compressing the video modality, while the audio and text modalities are not notably sensitive to compression.This shows that for sentiment analysis based on CMU-MOSI dataset, the information in Video modality cannot be compensated by other modalities, however most information in the audio and language modalities is covered in video modality.In other words, the video contains essential information for this task whereas information from audio and language can be recovered from video.
Fig. 3 shows the average accuracy over 16 personality types for the POM personality trait recognition dataset.For this dataset also, each of the modalities has a different behavior for different compression rates.We can see that the audio modality includes more non-redundant information for personality recognition as accuracy is highly affected by audio compression.In addition, there is a notable accuracy reduction when the language modality is highly compressed, which shows a small amount of non-redundant information for this task.Note that the POM data does not contain sufficient information for an effective analysis of the 16 personality sub types individually.
Fig. 4 shows the results for the IEMOCAP emo-   tion recognition dataset for each of the four emotional categories: happy, angry, sad, and neutral.
Looking at the sad category, we see notable accuracy reduction for small sizes (high compression) for all the modalities, showing that each contains at least some non-redundant information.However, high compression of audio and especially language modalities results in strong accuracy reduction whereas video compression results in relatively minor reduction.It can be concluded that for this emotion, the language modality has the most nonredundant information and the video modality very little -it's information can be compensated by the other two modalities.Moving on to the angry emotion, small sizes (high compression) result in accuracy reduction for audio and language modalities, showing that they contain some non-redundant information, with the audio modality containing more.Again the information in video can be almost completely compensated by the other two modalities.
By comparing the highest accuracy values for various emotion categories, it is observed that neutral is hard to predict in comparison to the other categories.Again, the audio and Language modalities both include non-redundant information leading to a severe accuracy reduction with high compression of these modalities, with video containing almost no information not compensated by audio and language.
The happy category is the easiest to predict emotion, and it slightly suffers for very small sizes of audio and video and language modalities, indicating a small amount of non-redundant information in all modalities.

Conclusion
We proposed a tensor fusion method for multimodal media analysis by obtaining an M + 1-way tensor to consider the high-order relationships between M input modalities and the output layer.Our modality-based factorization method removes the redundant information in this high-order dependency structure and leads to fewer parameters with minimal loss of information.In addition, a modality-based factorization approach helps to understand the relative quantities of non-redundant information in each modality through investigation sensitivity to modality-specific compression rates.As the proposed compression method leads to a less complicated model, it can be applied as a regularizer which avoiding overfitting.
We have provided experimental results for combining acoustic, text, and visual modalities for three different tasks: sentiment analysis, personality trait recognition, and emotion recognition.We have seen that the modality-based tensor compression approach improves the results in comparison to the simple concatenation method, the tensor fusion method and tensor fusion using the same factorization rank for all modalities, as proposed in the LMF method.In other words, the proposed method enjoys the same benefits as the tensor fusion method and avoids suffering from having a large number of parameters, which leads to a more complex model, needs many training samples and is more prone to overfitting.We have investigated the effect of the compression rate on single modalities while  fixing the other modalities helping to understand the amount of useful non-redundant information in each modality.Moreover, we have evaluated our method by comparing the results with state-of-theart methods, achieving a 1% to 4% improvement across multiple measures for the different tasks.
In future work, we will investigate the relation between dataset size and compression rate by applying our method to larger datasets.This will help to understand the trade-off between the model size and available training data, allowing more efficient training and avoiding under-and overfitting.
As the availability of data with more and more modalities increases, both finding a trade-off between cost and performance and effective and efficient utilization of available modalities will be vital.Exploring compression methods promises to help identify and remove highly redundant modalities.

Figure 2 :
Figure 2: CMU-MOSI sentiment analysis dataset: Effect of different compression rates on accuracy for single modalities.

Figure 3 :
Figure 3: POM personality recognition dataset: Effect of different compression rates on accuracy for single modalities.

Figure 4 :
Figure 4: IEMOCAP Emotion Recognition Dataset: Effect of different compression rates on accuracy for single modalities.
parameters.Concat fusion (CF) leads to a layer size of O( M i=1 d i ) and O( M i=1 d i * h) parameters.The tensor fusion approach (TF), leads to a layer size of O( M i=1 d i ), and O( M i=1 d i * h) parameters.The LMF approach (Liu et al., 2018) requires training O( M i=1 r * h * d i ) parameters, where r is the rank used for all the modalities.

Table 1 :
The speaker independent data splits for training, validation, and test sets

Table 2 :
Results for Sentiment Analysis on CMU-MOSI, emotion recognition on IEMOCAP and personality trait recognition on POM.(CF, TF, and LMF stand for concat, tensor and low-rank fusion respectively).