Seq2Seq2Sentiment: Multimodal Sequence to Sequence Models for Sentiment Analysis

Multimodal machine learning is a core research area spanning the language, visual and acoustic modalities. The central challenge in multimodal learning involves learning representations that can process and relate information from multiple modalities. In this paper, we propose two methods for unsupervised learning of joint multimodal representations using sequence to sequence (Seq2Seq) methods: a Seq2Seq Modality Translation Model and a Hierarchical Seq2Seq Modality Translation Model. We also explore multiple different variations on the multimodal inputs and outputs of these seq2seq models. Our experiments on multimodal sentiment analysis using the CMU-MOSI dataset indicate that our methods learn informative multimodal representations that outperform the baselines and achieve improved performance on multimodal sentiment analysis, specifically in the Bimodal case where our model is able to improve F1 Score by twelve points. We also discuss future directions for multimodal Seq2Seq methods.


Introduction
Sentiment analysis, which involves identifying a speaker's sentiment, is an open research problem. In this field, the majority of work done focused on unimodal methodologies -primarily textual analysis -where investigating was limited to identifying usage of words in positive and negative scenarios. However, unimodal textual sentiment analysis through usage of words, phrases, and their interdependencies were found to be insufficient for extracting affective content from textual opinions (Rosas et al., 2013). 1 As a result, there has been a recent push towards using statistical methods to extract additional behavioral cues not present in the language modality from the video and audio modalities. This research field is known as multimodal sentiment analysis and it extends the conventional text-based definition of sentiment analysis to a multimodal setup where different modalities contribute to modeling the sentiment of the speaker. For example, (Kaushik et al., 2013) explores modalities such as audio, while (Wöllmer et al., 2013) explores a multimodal approach to predicting sentiment. This push has been further bolstered by the advent of multimodal social media platforms, such as YouTube, Facebook, and VideoLectures which are used to express personal opinions on a worldwide scale. As a result, several multimodal datasets, such as CMU-MOSI (Zadeh et al., 2016) and later CMU-MOSEI (Zadeh et al., 2018c), ICT-MMMO (Wöllmer et al., 2013) and YouTube (Morency et al., 2011), take advantage of the abundance of multimodal data on the Internet. At the same time, neural network based multimodal models have been proposed that are highly effective at learning multimodal representations for multimodal sentiment analysis Zadeh et al., 2018a,b).
Recent progress has been limited to supervised learning using labeled data, and does not take advantage of the abundant unlabeled data on the Internet. To address this gap, our work is primarily one of unsupervised representation learning. We attempt to learn a multimodal representation of our data in a structured paradigm and explore whether a joint multimodal representation trained via unsupervised learning can improve the performance for multimodal sentiment analysis. While representation learning has been an area of rapid research in the past years, there has been limited work that explores multimodal setting. To this end, we propose two methods: a Seq2Seq Modality Translation Model and a Hierarchical Seq2Seq Modality Translation Model for unsupervised learning of multimodal representations. Our results show that using multimodal representations learned from our Seq2Seq modality translation method outperforms the baselines and achieves improved performance on multimodal sentiment analysis.

Related Work
In the past, approaches to text-based emotion and sentiment recognition rely mainly on rule-based techniques, bag of words (BoW) modeling or SNoW architecture (Chaumartin, 2007) using a large sentiment or emotion lexicon (Mishne et al., 2005), or statistical approaches that assume the availability of a large dataset annotated with polarity or emotion labels.
Multimodal sentiment analysis has gained a lot of research interests over the last few years . Probably the most challenging task in multimodal sentiment analysis is to find a joint representation of multiple modalities. This problem is has been approached in a number of ways. Earlier works such as (Ngiam et al., 2011;Lazaridou et al., 2015;Kiros et al., 2014) have pushed some progress towards this direction.
Recently, more advanced neural network models were proposed to learn multimodal representations. The Multi-View LSTM (MV-LSTM) (Rajagopalan et al., 2016) was suggested to exploit fusion and temporal relationships. MV-LSTM partitions memory cells and gates into multiple regions corresponding to different views. Tensor Fusion Network  presented an efficient method based on Cartesian-product to take into consideration intramodal and intermodal relations between video, audio and text of the reviews to create a novel feature representation for each utterance. The Gated Multimodal Embedding model  created an algorithm using reinforcement learning to train an on-off switch that decided what values the video and audio components would have. Noisy modalities are turned off and clean modalities are allowed to pass through. (Zadeh et al., 2018a) utilizes external multimodal memory mechanisms to store multimodal information and create multimodal representations through time. (Zadeh et al., 2018b) proposed using multi-ple attention coefficient assignments to represent multiple cross-modal interactions. However, all these methods discussed so far are purely supervised approaches to multimodal sentiment analysis and do not leverage the power of unsupervised data and generative approaches towards learning multimodal representations.
Besides supervised approaches, generative methods based on generative adversarial networks (GAN) (Goodfellow et al., 2014) have attracted significant interest in learning joint distribution between two or more modalities (Donahue et al., 2016;Gan et al., 2017). Another method to deal with multimodal problems is to view them as conditional problems which learn to map a modality to the other (Mirza and Osindero, 2014;Kingma et al., 2014;Pandey and Dukkipati, 2017). Our work can be viewed as an extension of the conditional approach, as both utilize unsupervised learning. However, our work differs from those in that it takes into account the sequential dependency within each modality.
Finally, attention based layers have also proved themselves to be effective tools to boost performance of neural network models, such as in neural machine translation (Klein et al.;Bahdanau et al., 2014;Luong et al., 2015), speech recognition (Sriram et al., 2017) and in image captioning (Xu et al., 2015). Our work also employs this mechanism in an attempt to better handle long-term dependencies of variable-length sequences.

Problem Formulation
Given a dataset with data X = (X text , X audio , X video ) where X text , X audio , X video stand for text, audio and video modality inputs, respectively.
Typically a dataset is indexed by videos. This means that if we have n videos, then X = (X 1 , X 2 , ..., X n ) where The corresponding labels for these n videos are To simplify the problem, we align the input based on words. Typically, researchers often segment each video into a smaller set in which each segmented video will last a couple of seconds, instead of minutes as done in . After such alignment and segmentation, we have the equal-length inputs of each modality per video. For example, at the i th video, we have X i text = (w i (1) , w i (2) , ..., w i (T i ) ) where w i (t) stands for the t th word and T i is the length of the i th video's text input, a.k.a time steps. Note that different videos will have different time steps. Similarly for this video, we have a sequence of audio input X i audio = (a i (1) , a i (2) , ..., a i (T i ) ) and video input . In this work we are tackling the input learning problem where we want to learn the embedding representation for all text, audio, and video modalities: . In our baseline model, the function f is simply the concatenation at time step level: In our proposed method, we learn X i by using a Seq2Seq model. We do not calculate each embedding representation for each time step, but for the whole sequence. Formally, For simplicity, in the next formula, we omit the index of video segment i, and so the input becomes X = (x 1 ,x 2 , ...,x T ), and the labels become Y = (y 1 , y 2 , ..., y T ).
We will be using a Recurrent Neural Network (RNN) such as LSTM (Hochreiter and Schmidhuber, 1997) or GRU (Chung et al., 2015) to model this sequence. In detail, this RNN has a stack of K hidden layers h = (h 1 , h 2 , ..., h K ), each contains D hidden neurons: We denote W and b to be weight and bias, then for the first layer which contacts directly with input: where H is the RNN cell function. For example of LSTM, it contains input, forget, output and cell state. At hidden layer k ∈ [2, K]: Optionally, we apply a soft attention mechanism on top of the last hidden layer h K , with shared weight W α over T time steps, then we can obtain the attention output α: (X1, ..., X N ) and output is (Y1, ..., Y T ). Seq2Seq makes use of the whole input sequence in the decoding phase for every token Yi. If attention model (yellow color) is used, for each Yi, it learns a separate weight vector w.r.t each token of input X to see which token should the decoder "attend" more.
The last hidden layer's output now becomes: And the last output layer with regression score is: Finally, we calculate the loss with respect to the labels. As in , we choose Mean Absolute Error (MAE) as our loss and later train with stochastic gradient descent:

Proposed Approach
In this section we describe the different approaches that we plan to take to improve affect recognition through learning multimodal representations.

Seq2Seq Modality Translation Model
The Seq2Seq Modality Translation Model aims to learn multimodal representations that can be used for discriminative tasks. While Seq2Seq models have been predominantly used for machine translation (Bahdanau et al., 2014;Luong et al., 2015), we extend its usage to the realm of multimodal machine learning where we use it to translate one modality to another, or translate a joint representation to another single or joint representation. To do so, we propose a Seq2Seq modality translation model with attention mechanism, as shown in the second phase, from which the results will be fed into RNN for sentiment prediction. The green boxes denote the joint representation learned by Seq2Seq models: the joint representation of modalities A and B will be fed into another Seq2Seq model which in turn learns the joint representation of AB and another modality C. Finally the joint representation of ABC will be fed into a RNN to predict sentiment.
involved. As a result, this representation can be used for tasks that involve learning joint representation across multiple modalities. The detail is in Algorithm 1.
Algorithm 1 Seq2Seq Modality Translation X, Y, S are 2 modalities and sentiment sequences 1: Phase 1: Train Seq2Seq 2: Backprop to update params 6: Phase 2: Sentiment Regression 7: score ← Regression(R) 10: loss ← M AE(score, S) 11: Backprop to update params Formally, the Seq2Seq Modality Translation Model consists of 2 separate steps: encoding and decoding, each phase typically consists of a single RNN or a stack of them. This model accepts variable-length inputs of X and Y , and the network should be trained to maximize the translational condition probability p(Y X). For encoding, it encodes the whole input sequence X into an embedded representation. The hidden state output of each time step is based on the previous hidden state along with the input sequence (refer to Figure 1): The encoder's output is the final hidden state's output of the encoding RNN: where N is the length of the input sequence X. The decoder tries to decode each token Y i at a time based on E and all previous decoded tokens, which is formulated as: The Seq2Seq training target is to find the best translation sequence which is as close to the ground truth Y as possible, or formally: And while there are some other search algorithms such as random sampling or greedy search to decode each token (Neubig, 2017), we use the traditional beam search approach (Sutskever et al., 2014).

Hierarchical Seq2Seq Modality Translation Model
The Seq2Seq Modality Translation Model only learns joint representation between 2 modalities X and Y . While this might be a strong starting point, we believe an approach that captures the joint interactions between all different modalities X, Y, Z is more effective in modeling the full distribution of the multimodal data and therefore more useful for regression or classification. In response, we propose the Hierarchical Seq2Seq Modality Translation Model that learns a joint multimodal representation. Once the Seq2Seq Modality Translation Model is trained for 2 modalities X and Y , we obtain the intermediate representation E XY which is the joint representation of (X, Y ). E XY is in turn treated as input sequence for the next Seq2Seq Modality Translation Model to decode the third modality Z. The final multimodal representation E XY Z represents the joint representation of (X, Y, Z). The Hierarchical Seq2Seq Modality Translation Model is described as in Algorithm 2.
Algorithm 2 Hierarchical Seq2Seq Modality Translation: X, Y, Z, S are 3 modalities and sentiment sequences 1: Phase 1: Train Seq2Seq for 2 modalities 2: loss = cross entropy(Ỹ , Y ) 5: Backpropagate to update parameters 6: Phase 2: Train Seq2Seq for 3 modalities 7: loss = cross entropy(Z, Z) 10: Backpropagate to update parameters 11: Phase 3: Sentiment Regression 12: score ← Regression(R) 15: loss ← M AE(score, S) 16: Backpropagate to update parameters This strategy is also illustrated in Figure 2. The output of the second Seq2Seq model is the input of the last RNN model where we will train to predict regression sentiment scores. This last Seq2Seq model will be trained using MAE loss function and it perform subsequent regression process.

Experimental Setup
We explored the applications of this model to the CMU-MOSI dataset (Zadeh et al., 2016). We implemented a baseline LSTM model based off the work done in . Our implementation uses 66.67% of the data for training from which we take a 15.15% held-out set for validation, and the remaining 33.33% is used for testing. Finally, we evaluated our proposed model against the baseline results generated by the implementation of . Here we compared our results against the various multimodal configurations evaluating our performance using precision, recall, and F1 scores.

Dataset and Input Modalities
The dataset that we use to explore applications of our model is the CMU Multimodal Opinion-level Sentiment Intensity dataset (CMU-MOSI). The dataset contains video, audio, and transcriptions of 89 different speakers in 93 different videos divided into 2199 separate opinion sentiments. Each video has an associated sentiment label in the range from -3 to 3. The low end of the spectrum (-3) indicates strongly negative sentiment, where as the high end of the spectrum indicates strongly positive sentiment (+3), and ratings of 0 indicate neutral sentiment. The CMU-MOSI dataset is currently subject to much research Zadeh et al., 2018a,b) and the current state of the art is achieved by  with an F1 score of 80.3 using a context aware model across entire videos. The state of the art using only individual segments is achieved by (Zadeh et al., 2018a) with an F1 score of 77.3.
With respect to raw features that are being given as inputs to our model, we perform feature extraction in the same manner as described in . In the text domain, pretrained 300 dimensional GLoVe embeddings (Pennington et al., 2014) were used to represent the textual tokens. In the audio domain, low level acoustic features including 12 Mel-frequency cepstral coefficients (MFCCs), pitch tracking and voiced/unvoiced segmenting features (Drugman and Alwan, 2011), glottal source parameters (Childers and Lee, 1991; Drugman et al., 2012;Alku, 1992;Alku et al., 1997Alku et al., , 2002, peak slope parameters and maxima dispersion quotients (Kane and Gobl, 2013) were extracted automatically using COVAREP (Degottex et al., 2014). Finally, in the video domain, Facet (iMotions, 2017) is used to extract per-frame basic and advanced emotions and facial action units as indicators of facial muscle movement (Ekman, 1992;Ekman et al., 1980). In situations where the same time alignment between different modalities are required, we choose the granularity of the input to be at the level of words. The words are aligned with audio using P2FA (Yuan and Liberman, 2008) to get their exact utterance times. The visual and acoustic modalities are aligned to words using these utterance times.

Baselines
We use a LSTM model implemented in 3 different ways (one for each different grouping of the modalities). First in the unimodal domain, we run sentiment regression based solely on one modality, second in the bimodal domain we change the input to the concatenation of any pair of modality, and finally in the trimodal domain we concatenate all three modalities. This baseline not only serves to act as a benchmark for comparing our results but also acts as a starting point for our code development. As such, any improvements in our metrics are strictly as a result of the representations that we have learned and not structural changes in our model.

Multimodal Model Variations
Throughout our experimentation, we apply the algorithms in Section 4 with several intuitive variations of how to translate modalities. Below are all approaches that we try to maximize our chances of learning a strong representation. For bimodal, we translate one modality into another one. For example, A → V stands for translating from Audio to Video, and take the embedding state, which we refer to as embed(A+V), to predict sentiment. Here we employ the Seq2Seq Modality Translation Model mentioned in Algorithm 1.
For trimodal, there are a lot more variations as follows. First, since we have 3 different modality and Seq2Seq is only capable of translating one modality to another, we use the Hierarchical Seq2Seq Modality Translation Model which is mentioned in Algorithm 2, e.g. we translate from T to A to have the joint representation embed(T+A), and then continue the translation from embed(T+A) to the rest modality which is V, which in turn yields the joint representation embed(T+A+V) to make sentiment prediction.
Second, we reuse the previous Seq2Seq Modality Translation Model to translate a concatenation of 2 modality to the rest, e.g. concat(T+V) to A, and vice versa, e.g. translating from A back to concat(T+V).
Finally, we still use the Seq2Seq Modality Translation Model to translate from a concatenation of 2 modality to another concatenation of other 2. With this setting, at least one modality is repeated, and base on many previous works and our experience, we tend to favor text modality (T) over the other two and make it repeated.

Baseline Unimodal Results
We see that with the baseline model, as shown Table 1, the text modality is by far the most discriminative when it comes to detecting emotion. This implies that users rely heavily on their word choice and language to convey meaning and emotion. While this may be true, we know that other works such as (Zadeh et al., 2018a; have achieved higher scores by combining all these different modalities. This implies that with some careful thinking and pointed model construction, we should be able to improve upon our baseline unimodal results through the integration of additional modalities into our model.

Baseline Multimodal Results
The results of our different baseline multimodal approaches is shown in Table 2 for bimodal and  Table 3 for trimodal. We see that of the multimodal baselines the model which combines the 3 modalities of text, speech, and video performed the best. The baseline model which combined text and audio arrived in second place followed closely by the combined text and video model. The model which combines video and audio arrived in last place by a significant margin. This corroborates our results from our unimodal baselines which implied that the text modality is the most discriminative modality in this dataset.
On the whole we can see that when all three modalities are working in concert we get the best result in a multimodal context, however, it is worth noting that we were not able to match out unimodal baseline with our multimodal models. This implies that there is still more to be drawn from our data when constructing our model and there is generally more work to be done. We believe that incorporating a stronger more robust representation of our data will be beneficial to our later attempts at classification. Though we view this to be out of scope of this work as the focus of this work is on learning informative representations.  Table 3: Trimodal results with 3 metrics: Precision, Recall and F-Score (F1)

Analysis of Baseline Failure Cases
The common trend that we see among all of those baseline models is the consistent failure to identify extreme cases of either positive or negative emotions. We believe that this phenomenon is due to two possibilities. First we see that there are very few cases of highly positive (+3) and highly negative (−3) examples in the training data. As a result the models that are trained are highly biased towards not selecting +3 or −3 ratings. Secondly, our baseline models are performing categorical classification as opposed to regression or ordinal classification. We plan to solve by training the model to perform this type of prediction as a regression task as opposed to a categorical classification task.

Bimodal Seq2Seq Results
Our bimodal models require the exploration of two modalities, one for the encoding step and another for the decoding step. We explored several different different encoder/decoder frameworks for these models. The first model that we explored were representations generated from encoding exactly one modality and then decoding exactly one dif-ferent modality. The results of this approach are included below in Table 2. Here we can see that the Seq2Seq Modality Translation Model outperforms the baseline method in terms of F1 consistently and outperforms in terms of precision and recall in several cases, but not all.

Trimodal Seq2Seq Results
We try all variations mentioned in Section 5.3 and the full breakdown of these results can be found in the Table 3. According to that, while the Hierarchical Seq2Seq Modality Translation Model is a natural extension to the normal Seq2Seq Modality Translation model, it does not perform well on the CMU-MOSI dataset. Otherwise, using the normal non-hierarchical model with concatenation variations does improve the performance, and particularly beats the baseline (for only F1 score) on the model which translates from concat(T,V) to concat(T+A) for the 7-class case. As mentioned in Section 5.3, we favor the text (T) modality and make it repeated in this setting because it typically contributes more significantly to sentiment prediction. Indeed, we have tried to repeat video or audio modality but the result shrinks dramatically. One possible reason for this behavior is the scarcity of training data. Given that at every phase of Seq2Seq translation, we only have 1289 train samples, 230 validation and 269 test samples, Seq2Seq, which typically requires more data for training a good model, does not work efficiently. This affects even more in the hierarchical Seq2Seq cases where we train two phases of Seq2Seq. We project the performance will improve if we work on other dataset which is bigger, or if we pretrain our model on other dataset first before applying it to MOSI.

Discussion
The language modality is the most discriminative as well as the most important towards learning multimodal representations. While we outperform the baseline multimodal approach we were unable to outperform the baseline unimodal text approach. Clearly from these results we know that that the text modality is the most discriminative of all of these modalities. However, it appears that these models which we have described are not able to truly separate the importance of the text modality. The fact that we are merging these modalities into a shared representation space is likely decreasing the resolution of the text domain and thus decreasing the modeling power of the domain. This is why we believe that the top performing multimodal model is one that incorporates the text domain so much (see Tables 2 and 3).
It is worth noting that some of the learned representations were quite poor when it came to their use in prediction. For example, representations that were learned using only audio and video generally performed poorly. This is to be expected given the already known information that these modalities are not as discriminative as the language modality. At the same time, some of the worse performing representations were learned in the methodology of learning a representation based on an existing embedding. We believe this to be due to the representation losing the resolution of the original two domains from which the original source embedding was learned and instead being focused on learning the best representation to predict the final modality.

Future Directions
This research opens up a promising direction in joint unsupervised learning of multimodal repre-sentations and supervised learning of multimodal temporal data. We propose the following extensions that could improve performance: Firstly, using an Variational Autoencoder (VAE) (Kingma and Welling, 2013) in conjunction with LSTM Encoder/Decoder model (as in the case of VAE Seq2Seq model) would be an interesting avenue to explore. This is because VAEs have been shown to learn better representations as compared to vanilla autoencoders (Kingma and Welling, 2013;Pu et al., 2016).
Secondly, since our method for multimodal representation learning is unsupervised, we could take advantage of larger external datasets to pre-train the multimodal representations before fine-tuning further with CMU-MOSI. We believe this will boost performance because we have limited data in CMU-MOSI for training (CMU-MOSI has 2199 training segments). Some datasets that come to mind include the Persuasion Opinion Multimodal (POM) dataset (Park et al., 2014) with 1000 total videos (longer than segments) and the IEMOCAP dataset with 10000 total segment. Since these datasets also consist of monologue speaker videos, we expect the learnt multimodal representations to generalize.
Thirdly, our method does not train our combined model end to end: the representations that we use to generated during on training run and the sentiment classification model are trained separately. Exploring an end-to-end version of this model end to end could possibly result in better performance where we could additionally fine tune the learned multimodal representation for sentiment analysis.

Conclusion
To conclude, this paper investigate the problem of multimodal representation learning to leverage the abundance of unlabeled multimedia data available on the internet. We presente two methods for unsupervised learning of joint multimodal representations using multimodal Seq2Seq models: the Seq2Seq Modality Translation Model and the Hierarchical Seq2Seq Modality Translation Model. We found that these intermediate multimodal representations can then be used for multimodal downstream tasks. Our experiments indicate that the multimodal representations learned from our Seq2Seq modality translation method are highly informative and achieves improved performance on multimodal sentiment analysis.