Tensor Fusion Network for Multimodal Sentiment Analysis

Multimodal sentiment analysis is an increasingly popular research area, which extends the conventional language-based definition of sentiment analysis to a multimodal setup where other relevant modalities accompany language. In this paper, we pose the problem of multimodal sentiment analysis as modeling intra-modality and inter-modality dynamics. We introduce a novel model, termed Tensor Fusion Networks, which learns both such dynamics end-to-end. The proposed approach is tailored for the volatile nature of spoken language in online videos as well as accompanying gestures and voice. In the experiments, our model outperforms state-of-the-art approaches for both multimodal and unimodal sentiment analysis.


Introduction
Multimodal sentiment analysis (Morency et al., 2011;Zadeh et al., 2016b; is an increasingly popular area of affective computing research (Poria et al., 2017) that focuses on generalizing text-based sentiment analysis to opinionated videos, where three communicative modalities are present: language (spoken words), visual (gestures), and acoustic (voice).
This generalization is particularly vital to part of the NLP community dealing with opinion mining and sentiment analysis  since there is a growing trend of sharing opinions in videos instead of text, specially in social media (Facebook, YouTube, etc.). The central challenge in multimodal sentiment analysis is to model the inter-modality dynamics: the interactions between † means equal contribution Figure 1: Unimodal, bimodal and trimodal interaction in multimodal sentiment analysis.
language, visual and acoustic behaviors that change the perception of the expressed sentiment. Figure 1 illustrates these complex inter-modality dynamics. The utterance "This movie is sick" can be ambiguous (either positive or negative) by itself, but if the speaker is also smiling at the same time, then it will be perceived as positive. On the other hand, the same utterance with a frown would be perceived negatively. A person speaking loudly "This movie is sick" would still be ambiguous. These examples are illustrating bimodal interactions. Examples of trimodal interactions are shown in Figure 1 when loud voice increases the sentiment to strongly positive. The complexity of inter-modality dynamics is shown in the second trimodal example where the utterance "This movie is fair" is still weakly positive, given the strong influence of the word "fair".
A second challenge in multimodal sentiment analysis is efficiently exploring intra-modality dynamics of a specific modality (unimodal interaction). Intra-modality dynamics are particularly challenging for the language analysis since multimodal sentiment analysis is performed on spoken language. A spoken opinion such as "I think it was alright . . . Hmmm . . . let me think . . . yeah . . . no . . . ok yeah" almost never happens in written text. This volatile nature of spoken opinions, where proper language structure is often ignored, complicates sentiment analysis. Visual and acoustic modalities also contain their own intra-modality dynamics which are expressed through both space and time.
Previous works in multimodal sentiment analysis does not account for both intra-modality and intermodality dynamics directly, instead they either perform early fusion (a.k.a., feature-level fusion) or late fusion (a.k.a., decision-level fusion). Early fusion consists in simply concatenating multimodal features mostly at input level (Morency et al., 2011;Pérez-Rosas et al., 2013;. This fusion approach does not allow the intra-modality dynamics to be efficiently modeled. This is due to the fact that inter-modality dynamics can be more complex at input level and can dominate the learning process or result in overfitting. Late fusion, instead, consists in training unimodal classifiers independently and performing decision voting Zadeh et al., 2016a). This prevents the model from learning inter-modality dynamics in an efficient way by assuming that simple weighted averaging is a proper fusion approach.
In this paper, we introduce a new model, termed Tensor Fusion Network (TFN), which learns both the intra-modality and inter-modality dynamics end-to-end. Inter-modality dynamics are modeled with a new multimodal fusion approach, named Tensor Fusion, which explicitly aggregates unimodal, bimodal and trimodal interactions. Intramodality dynamics are modeled through three Modality Embedding Subnetworks, for language, visual and acoustic modalities, respectively.
In our extensive set of experiments, we show (a) that TFN outperforms previous state-of-the-art approaches for multimodal sentiment analysis, (b) the characteristics and capabilities of our Tensor Fusion approach for multimodal sentiment analysis, and (c) that each of our three Modality Embedding Subnetworks (language, visual and acoustic) are also outperforming unimodal state-of-the-art unimodal sentiment analysis approaches.
Multimodal Sentiment Analysis is an emerging research area that integrates verbal and nonverbal behaviors into the detection of user sentiment.
Audio-Visual Emotion Recognition is closely tied to multimodal sentiment analysis (Poria et al., 2017). Both audio and visual features have been shown to be useful in the recognition of emotions (Ghosh et al., 2016a). Using facial expressions and audio cues jointly has been the focus of many recent studies (Glodek et al., 2011;Valstar et al., 2016;Nojavanasghari et al., 2016).
Multimodal Machine Learning has been a growing trend in machine learning research that is closely tied to the studies in this paper. Creative and novel applications of using multiple modalities have been among successful recent research directions in machine learning (You et al., 2016;Donahue et al., 2015;Antol et al., 2015;Specia et al., 2016;Tong et al., 2017).

CMU-MOSI Dataset
Multimodal Opinion Sentiment Intensity (CMU-MOSI) dataset is an annotated dataset of video  opinions from YouTube movie reviews (Zadeh et al., 2016a). Annotation of sentiment has closely followed the annotation scheme of the Stanford Sentiment Treebank (Socher et al., 2013), where sentiment is annotated on a seven-step Likert scale from very negative to very positive. However, whereas the Stanford Sentiment Treebank is segmented by sentence, the CMU-MOSI dataset is segmented by opinion utterances to accommodate spoken language where sentence boundaries are not as clear as text. There are 2199 opinion utterances for 93 distinct speakers in CMU-MOSI. There are an average 23.2 opinion segments in each video. Each video has an average length of 4.2 seconds. There are a total of 26,295 words in the opinion utterances. These utterance are annotated by five Mechanical Turk annotators for sentiment. The final agreement between the annotators is high in terms of Krippendorf's alpha α = 0.77. Figure 2 shows the distribution of sentiment across different opinions and different opinion sizes. CMU-MOSI dataset facilitates three prediction tasks, each of which we address in our experiments: 1) Binary Sentiment Classification 2) Five-Class Sentiment Classification (similar to Stanford Sentiment Treebank fine-grained classification with seven scale being mapped to five) and 3) Sentiment Regression in range [−3, 3]. For sentiment regression, we report Mean-Absolute Error (lower is better) and correlation (higher is better) between the model predictions and regression ground truth.

Tensor Fusion Network
Our proposed TFN consists of three major components: 1) Modality Embedding Subnetworks take as input unimodal features, and output a rich modality embedding. 2) Tensor Fusion Layer explicitly models the unimodal, bimodal and trimodal interactions using a 3-fold Cartesian product from modality embeddings. 3) Sentiment Inference Subnetwork is a network conditioned on the output of the Tensor Fusion Layer and performs sentiment inference. Depending on the task from Section 3 the network output changes to accommodate binary classification, 5-class classification or regression. Input to the TFN is an opinion utterance which includes three modalities of language, visual and acoustic. The following three subsections describe the TFN subnetworks and their inputs in detail.

Modality Embedding Subnetworks
Spoken Language Embedding Subnetwork: Spoken text is different than written text (reviews, tweets) in compositionality and grammar. We revisit the spoken opinion: "I think it was alright . . . Hmmm . . . let me think . . . yeah . . . no . . . ok yeah". This form of opinion rarely happens in written language but variants of it are very common in spoken language. The first part conveys the actual message and the rest is speaker thinking out loud eventually agreeing with the first part. The key factor in dealing with this volatile nature of spoken language is to build models that are capable of operating in presence of unreliable and idiosyncratic speech traits by focusing on important parts of speech.
Our proposed approach to deal with challenges of spoken language is to learn a rich representation of spoken words at each word interval and use it as input to a fully connected deep network ( Figure 3). This rich representation for ith word contains information from beginning of utterance through time, as well as ith word. This way as the model is discovering the meaning of the utterance through time, if it encounters unusable information in word i + 1 and arbitrary number of words after, the representation up until i is not diluted or lost. Also, if the model encounters usable information again, it can recover by embedding those in the long short-term memory (LSTM). The time-dependent encodings are usable by the rest of the pipeline by simply focusing on relevant parts using the nonlinear affine transformation of time-dependent embeddings which can act as a dimension reducing attention mechanism. To formally define our proposed Spoken Language Embedding Subnetwork (U l ), let l = {l 1 , l 2 , l 3 , . . . , l T l ; l t ∈ R 300 }, where T l is the number of words in an utterance, be the set of spoken words represented as a sequence of 300-dimensional GloVe word vectors (Pennington et al., 2014). A LSTM network (Hochreiter and Schmidhuber, 1997) with a forget gate (Gers et al., 2000) is used to learn time-dependent language representations h l is a matrix of language representations formed from concatenation of h 1 , h 2 , h 3 , . . . h T l . h l is then used as input to a fully-connected network that generates language embedding z l : where W l is the set of all weights in the U l network (including W l d , W le ,W l f c , and b l f c ), σ is the sigmoid function. Visual Embedding Subnetwork: Since opinion videos consist mostly of speakers talking to the audience through close-up camera, face is the most important source of visual information. The speaker's face is detected for each frame (sampled at 30Hz) and indicators of the seven basic emotions (anger, contempt, disgust, fear, joy, sadness, and surprise) and two advanced emotions (frustration and confusion) (Ekman, 1992) are extracted using FACET facial expression analysis framework 1 . A set of 20 Facial Action Units (Ekman et al., 1980), indicating detailed muscle movements on the face, are also extracted using FACET. Estimates of head position, head rotation, and 68 facial landmark locations also extracted per frame using OpenFace .
for frame j of utterance video contain the set of p visual features, with T v the number of total video frames in utterance. We perform mean pooling over the frames to obtain the expected visual fea- . v is then used as input to the Visual Embedding Subnetwork U v . Since information extracted using FACET from videos is rich, using a deep neural network would be sufficient to produce meaningful embeddings of visual modality. We use a deep neural network with three hidden layers of 32 ReLU units and weights W v . Empirically we observed that making the model deeper or increasing the number of neurons in each layer does not lead to better visual performance. The subnetwork output provides the visual embedding z v : Acoustic Embedding Subnetwork: For each opinion utterance audio, a set of acoustic features are extracted using COVAREP acoustic analysis framework (Degottex et al., 2014), including 12 MFCCs, pitch tracking and Voiced/UnVoiced segmenting features (using the additive noise robust Summation of Residual Harmonics (SRH) method (Drugman and Alwan, 2011)), glottal source parameters (estimated by glottal inverse filtering based on GCI synchronous IAIF (Drugman et al., 2012;Alku, 1992;Alku et al., 2002Alku et al., , 1997Titze and Sundberg, 1992;Childers and Lee, 1991)), peak slope parameters (Degottex et al., 2014), maxima dispersion quotients (MDQ) (Kane and Gobl, 2013), and estimations of the R d shape parameter of the Liljencrants-Fant (LF) glottal model (Fujisaki and Ljungqvist, 1986). These extracted features capture different characteristics of human voice and have been shown to be related to emotions (Ghosh et al., 2016b). For each opinion segment with T a audio frames (sampled at 100Hz; i.e., 10ms), we extract the set of q acoustic featuresâ j = [a 1 j , a 2 j , a 3 j , . . . , a q j ] for audio frame j in utterance. We perform mean pooling per utterance on these extracted acoustic features to obtain the expected acoustic fea- Here, a is the input to the Audio Embedding Subnetwork U a . Since COVAREP also extracts rich features from audio, using a deep neural network is sufficient to model the acoustic modality. Similar to U v , U a is a network with 3 layers of 32 ReLU units with weights W a .
Here, we also empirically observed that making the model deeper or increasing the number of neurons in each layer does not lead to better performance. The subnetwork produces the audio embedding z a :

Tensor Fusion Layer
While previous works in multimodal research has used feature concatenation as an approach for multimodal fusion, we aim to build a fusion layer in TFN that disentangles unimodal, bimodal and trimodal dynamics by modeling each of them explicitly. We call this layer Tensor Fusion, which is defined as the following vector field using three-fold Cartesian product: The extra constant dimension with value 1 generates the unimodal and bimodal dynamics. Each neural coordinate (z l , z v , z a ) can be seen as a 3-D point in the 3-fold Cartesian space defined by the language, visual, and acoustic embeddings dimen- This definition is mathematically equivalent to a differentiable outer product between z l , the visual representation z v , and the acoustic representation z a .
Here ⊗ indicates the outer product between vectors and z m ∈ R 129×33×33 is the 3D cube of all possible combination of unimodal embeddings with seven semantically distinct subregions in Figure 4. The first three subregions z l , z v , and z a are unimodal embeddings from Modality Embedding Subnetworks forming unimodal interactions in Tensor Fusion. Three subregions z l ⊗ z v , z l ⊗ z a , and z v ⊗ z a capture bimodal interactions in Tensor Fusion. Finally, z l ⊗ z v ⊗ z a captures trimodal interactions. Early fusion commonly used in multimodal research dealing with language, vision and audio, can be seen as a special case of Tensor Fusion with only unimodal interactions. Since Tensor Fusion is mathematically formed by an outer product, it has no learnable parameters and we empirically observed that although the output tensor is high dimensional, chances of overfitting are low.
We argue that this is due to the fact that the output neurons of Tensor Fusion are easy to interpret and semantically very meaningful (i.e., the manifold that they lie on is not complex but just high dimensional). Thus, it is easy for the subsequent layers of the network to decode the meaningful information.

Sentiment Inference Subnetwork
After Tensor Fusion layer, each opinion utterance can be represented as a multimodal tensor z m . We use a fully connected deep neural network called Sentiment Inference Subnetwork U s with weights W s conditioned on z m . The architecture of the network consists of two layers of 128 ReLU activation units connected to decision layer. The likelihood function of the Sentiment Inference Subnetwork is defined as follows, where φ is the sentiment prediction: In our experiments, we use three variations of the U s network. The first network is trained for binary sentiment classification, with a single sigmoid output neuron using binary cross-entropy loss. The second network is designed for five-class sentiment classification, and uses a softmax probability function using categorical cross-entropy loss. The third network uses a single sigmoid output, using meansquarred error loss to perform sentiment regression.

Experiments
In this paper, we devise three sets of experiments each addressing a different research question: Experiment 1: We compare our TFN with previous state-of-the-art approaches in multimodal sentiment analysis.
Experiment 2: We study the importance of the TFN subtensors and the impact of each individual modality (see Figure 4). We also compare with the commonly-used early fusion approach.
Experiment 3: We compare the performance of our three modality-specific networks (language, visual and acoustic) with state-of-the-art unimodal approaches.
Section 5.4 describes our experimental methodology which is kept constant across all experiments. Section 6 will discuss our results in more details with a qualitative analysis.  Table 1: Comparison with state-of-the-art approaches for multimodal sentiment analysis. TFN outperforms both neural and non-neural approaches as shown by ∆ SOT A .

E1: Multimodal Sentiment Analysis
In this section, we compare the performance of TFN model with previously proposed multimodal sentiment analysis models. We compare to the following baselines: C-MKL  Convolutional MKL-based model is a multimodal sentiment classification model which uses a CNN to extract textual features and uses multiple kernel learning for sentiment analysis. It is current SOTA (state of the art) on CMU-MOSI.
SAL-CNN  Select-Additive Learning is a multimodal sentiment analysis model that attempts to prevent identity-dependent information from being learned in a deep neural network. We retrain the model for 5-fold cross-validation using the code provided by the authors on github.
SVM-MD (Zadeh et al., 2016b) is a SVM model trained on multimodal features using early fusion. The model used in (Morency et al., 2011) and (Pérez-Rosas et al., 2013) also similarly use SVM on multimodal concatenated features. We also present the results of Random Forest RF-MD to compare to another non-neural approach.
The results first experiment are reported in Table 1. TFN outperforms previously proposed neural and non-neural approaches. This difference is specifically visible in the case of 5-class classification. Table 4 shows the results of our ablation study. The first three rows are showing the performance of each modality, when no intermodality dynamics are modeled. From this first experiment, we observe that the language modality is the most predictive.  Table 2: Comparison of TFN with its subtensor variants. All the unimodal, bimodal and trimodal subtensors are important. TFN also outperforms early fusion.

E2: Tensor Fusion Evaluation
As a second set of ablation experiments, we test our TFN approach when only the bimodal subtensors are used (TFN bimodal ) or when only the trimodal subtensor is used (TFN bimodal ). We observe that bimodal subtensors are more informative when used without other subtensors. The most interesting comparison is between our full TFN model and a variant (TFN notrimodal ) where the trimodal subtensor is removed (but all the unimodal and bimodal subtensors are present). We observe a big improvement for the full TFN model, confirming the importance of the trimodal dynamics and the need for all components of the full tensor.
We also perform a comparison with the early fusion approach (TFN early ) by simply concatenating all three modality embeddings < z l , z a , z v > and passing it directly as input to U s . This approach was depicted on the left side of Figure 4. When looking at Table 4 results, we see that our TFN approach outperforms the early fusion approach 2 .

E3: Modality Embedding Subnetworks Evaluation
In this experiment, we compare the performance of our Modality Embedding Networks with stateof-the-art approaches for language-based, visualbased and acoustic-based sentiment analysis.

Language Sentiment Analysis
We selected the following state-of-the-art approaches to include variety in their techniques, 2 We also performed other comparisons with variants of the early fusion model TFN early where we increased the number of parameters and neurons to replicate the numbers from our TFN model. In all cases, the performances were similar to TFN early (and lower than our TFN model). Because of space constraints, we could not include them in this paper.  Table 3: Language Sentiment Analysis. Comparison of with state-of-the-art approaches for language sentiment analysis. ∆ SOT A language shows improvement.
based on dependency parsing (RNTN), distributional representation of text (DAN), and convolutional approaches (DynamicCNN). When possible, we retrain them on the CMU-MOSI dataset (performances of the original pre-trained models are shown in parenthesis in Table 3) and compare them to our language only TFN language . RNTN (Socher et al., 2013)The Recursive Neural Tensor Network is among the most well-known sentiment analysis methods proposed for both binary and multi-class sentiment analysis that uses dependency structure.
DAN (Iyyer et al., 2015) The Deep Average Network approach is a simple but efficient sentiment analysis model that uses information only from distributional representation of the words and not from the compositionality of the sentences.
DynamicCNN (Kalchbrenner et al., 2014) Dy-namicCNN is among the state-of-the-art models in text-based sentiment analysis which uses a convolutional architecture adopted for the semantic modeling of sentences.
CMK-L, SAL-CNN-L and SVM-MD-L are multimodal models from section using only language modality 5.1.
Results in Table 3 show that our model using only language modality outperforms state-of-theart approaches for the CMU-MOSI dataset. While previous models are well-studied and suitable models for sentiment analysis in written language, they underperform in modeling the sentiment in spoken language. We suspect that this underperformance is due to: RNTN and similar approaches rely heavily on dependency structure, which may not be present  in spoken language; DAN and similar sentence embeddings approaches can easily be diluted by words that may not relate directly to sentiment or meaning; D-CNN and similar convolutional approaches rely on spatial proximity of related words, which may not always be present in spoken language.

Visual Sentiment Analysis
We compare the performance of our models using visual information (TFN visual ) with the following well-known approaches in visual sentiment analysis and emotion recognition (retrained for sentiment analysis): 3DCNN (Byeon and Kwak, 2014) a network using 3D CNN is trained using the face of the speaker. Face of the speaker is extracted in every 6 frames and resized to 64 × 64 and used as the input to the proposed network.
CNN-LSTM (Ebrahimi Kahou et al., 2015) is a recurrent model that at each timestamp performs convolutions over facial region and uses output to an LSTM. Face processing is similar to 3DCNN.
LSTM-FA similar to both baselines above, information extracted by FACET is used every 6 frames as input to an LSTM with a memory dimension of 100 neurons.
The results in Table 5 show that U v is able to outperform state-of-the-art approaches on visual sentiment analysis.

Acoustic Sentiment Analysis
We compare the performance of our models using visual information (TFN acoustic ) with the following well-known approaches in audio sentiment analysis  and emotion recognition (retrained for sentiment analysis): HL-RNN (Lee and Tashev, 2015) uses an LSTM on high-level audio features. We use the same features extracted for U a averaged over time slices of every 200 intervals.
Adieu-Net (Trigeorgis et al., 2016) is an endto-end approach for emotion recognition in audio using directly PCM features.
SER-LSTM (Lim et al., 2016) is a model that uses recurrent neural networks on top of convolution operations on spectrogram of audio.
SAL-CNN-A, SVM-MD-A, CMKL-A, RF-A use only acoustic modality in multimodal baselines from Section 5.1.

Methodology
All the models in this paper are tested using five-fold cross-validation proposed by CMU-MOSI (Zadeh et al., 2016a). All of our experiments are performed independent of speaker identity, as no speaker is shared between train and test sets for generalizability of the model to unseen speakers in real-world. The best hyperparameters are chosen using grid search based on model performance on a validation set (using last 4 videos in train fold). The TFN model is trained using the Adam optimizer (Kingma and Ba, 2014) with the learning rate 5e4. U v and U a , U s subnetworks are regularized using dropout on all hidden layers with p = 0.15 and L2 norm coefficient 0.01. The train, test and validation folds are exactly the same for all baselines.

Qualitative Analysis
We analyze the impact of our proposed TFN multimodal fusion approach by comparing it with the  Table 6: Examples from the CMU-MOSI dataset. The ground truth sentiment labels are between strongly negative (-3) and strongly positive (+3). For each example, we show the prediction output of the three unimodal models ( TFN acoustic , TFN visual and TFN language ), the early fusion model TFN early and our proposed TFN approach. TFN early seems to be mostly replicating language modality while our TFN approach successfully integrate intermodality dynamics to predict the sentiment level.
early fusion approach TFN early and the three unimodal models. Table 6 shows examples taken from the CMU-MOSI dataset. Each example is described with the spoken words as well as the acoustic and visual behaviors. The sentiment predictions and the ground truth labels range between strongly negative (-3) and strongly positive (+3).
As a first general observation, we observe that the early fusion model TFN early shows a strong preference for the language modality and seems to be neglecting the intermodality dynamics. We can see this trend by comparing it with the language unimodal model TFN language . In comparison, our TFN approach seems to capture more complex interaction through bimodal and trimodal dynamics and thus performs better. Specifically, in the first example, the utterance is weakly negative where the speaker is referring to lack of funny jokes in the movie. This example contains a bimodal interaction where the visual modality shows a negative expression (frowning) which is correctly captured by our TFN approach.
In the second example, the spoken words are ambiguous since the model has no clue what a B is except a token, but the acoustic and visual modalities are bringing complementary evidences. Our TFN approach correctly identify this trimodal interaction and predicts a positive sentiment. The third example is interesting since it shows an interaction where language predicts a positive sentiment but the strong negative visual behaviors bring the final prediction of our TFN approach almost to a neutral sentiment. The fourth example shows how the acoustic modality is also influencing our TFN predictions.

Conclusion
We introduced a new end-to-end fusion method for sentiment analysis which explicitly represents unimodal, bimodal, and trimodal interactions between behaviors. Our experiments on the publiclyavailable CMU-MOSI dataset produced state-ofthe-art performance when compared against both multimodal approaches. Furthermore, our approach brings state-of-the-art results for languageonly, visual-only and acoustic-only multimodal sentiment analysis on CMU-MOSI.