SWAFN: Sentimental Words Aware Fusion Network for Multimodal Sentiment Analysis

Multimodal sentiment analysis aims to predict sentiment of language text with the help of other modalities, such as vision and acoustic features. Previous studies focused on learning the joint representation of multiple modalities, ignoring some useful knowledge contained in language modal. In this paper, we try to incorporate sentimental words knowledge into the fusion network to guide the learning of joint representation of multimodal features. Our method consists of two components: shallow fusion part and aggregation part. For the shallow fusion part, we use crossmodal coattention mechanism to obtain bidirectional context information of each two modals to get the fused shallow representations. For the aggregation part, we design a multitask of sentimental words classification to help and guide the deep fusion of the three modalities and obtain the final sentimental words aware fusion representation. We carry out several experiments on CMU-MOSI, CMU-MOSEI and YouTube datasets. The experimental results show that introducing sentimental words prediction as a multitask can really improve the fusion representation of multiple modalities.


Introduction
Multimodal sentiment analysis is a task of predicting the sentiment of a video, an image or a text based on multiple modal features. Based on the contributions of different modalities to each other, multimodal sentiment analysis has achieved significant results and attracted the attentions of many researchers in recent years.
The main challenge of the multimodal sentiment analysis is to capture a better fusion of different modalities. Previous studies have proposed different methods for the fusion in different point of views. Some methods focus on the improvement of the LSTM structure to learn the interactions of different modal features from the view of the uni-stage and multi-stage. Zadeh et al. (2018a) propose a Memory Fusion Network to learn both the view-specific interactions and the cross-view interactions.  propose a Recurrent Multistage Fusion Network to model cross-modal interactions using multistage fusion approach. Some methods focus on exploiting the expressiveness of tensors for multimodal representation.  propose a Tensor Fusion Network to explicitly model the unimodal, bimodal and trimodal interactions through a 3-fold Cartesian product from modality embedding. More recently, other methods are proposed and achieve new state-of-the-art results (Pham et al., 2019;Mai et al., 2019;Wang et al., 2019;Tsai et al., 2019).
Although previous studies achieved good results, there are still two points can be improved: (1) We find that the fusion of most of previous methods is from one direction, that is, when the two modals are fused, the representation of two modals are fused directly as a new representation, similar to the work of  and . This fusion strategy ignores the long range of context information of each modality. As an example shown in Figure 1, for the fusion of language modality and vision modality, if we can capture the context information of each modality from bi-directions, we can get more sufficient fusion information.
(2) Few of previous studies explicitly explored the knowledge contained in the language text which can be used to help the fusion of different modalities based on the rich information existing in the language.
To this end, in this paper, we propose a Sentimental Words Aware Fusion Network (SWAFN) for multimodal sentiment analysis. More specifically, we first use LSTM to encode the original features of three modalities. Then we use the coattention mechanism (Xiong et al., 2017) to learn the co-dependent representation between language and other modalities separately by capturing attention contexts of each modality. We call this kind of bimodal fusion between language and other modalities as the shallow fusion part. Figure 1 presents the illustration of crossmodal coattention for language and vision modalities. Then, we design a sentimental words prediction task as an auxiliary task through the multitask learning mechanism to guide the aggregation of the shallow fusion of multiple modal features and obtain the final sentimental words aware deep fusion representation.
The main contribution of this work are as follows: 1) We propose to use crossmodal coattention to learn the long range context information of each two modals to obtain more sufficient fusion information for multiple modals. We also design a sentimental words prediction multitask as an auxiliary task to guide the fusion of multiple modal features and learn sentimental words aware final representation. To the best of our knowledge, this is the first time that multi-task learning is applied in multimodal sentiment analysis.
2) We conduct several experiments on different public datasets, and we will show that our model is effective for multimodal sentiment analysis. In addition, we also carry out a series of experiments to investigate the contribution of different modalities, the impact of the shallow fusion and the final fusion after integrating the auxiliary task.

Related Work
The key problem of multimodal sentiment analysis is to fill the gap of different modalities and learn the effective fusion of multimodal features. In recent years, with the successful application of neural networks in many tasks, different sophisticated fusion approaches are proposed and achieve significant results.
Fusion methods based on improved LSTM structure. Some of the previous studies propose to improve the LSTM structure to learn the interactions of different modality features from the view of the same timestep and cross timestep.  propose a Gated Multimodal Embedding LSTM with Temporal Attention model which consists of two modules, one is Gated Multimodal Embedding aiming to alleviate the fusion difficulty when there are noisy modalities, another is LSTM with temporal attention to perform word-level fusion. Zadeh et al. (2018c) propose a Multi-attention Recurrent Network, in which the LSTHM (an extension of LSTM) is used to store view-specific dynamics of the assigned modality and cross-view dynamics related to the assigned modality, and the Multi-attention Block is used to discover cross-view dynamics cross different modalities. Zadeh et al. (2018a) propose a Memory Fusion Network which employs LSTM to learn view-specific interactions and an attention mechanism called the Delta-memory Attention Network to identify the cross-view interactions.  propose a Recurrent Multistage Fusion Network to model cross-modal interactions using multi-stage fusion approach. The whole architecture of our model. We first use coattention mechanism to learn the bidirectional long range of context information between language modality and other modalities separately. Then we integrate a sentimental words classification task into the model through multitask learning mechanism to guide the learning and aggregation of multimodal fusion.
Fusion methods based on tensor structure. Different from using the improved LSTM-based models, some previous studies exploit the expressiveness of tensors for multimodal representation.  propose a Tensor Fusion Network to explicitly model the interactions of different modals through a 3-fold Cartesian product from modality embedding.  propose a Low-rank Multimodal Fusion network which first obtain the unimodal representation and perform low-rank multimodal fusion to improve the efficiency.
Previous state-of-the-art fusion methods. More recently, Pham et al. (2019) explore a method of translations between modalities to learn joint representations, in which a cycle consistency loss is used to ensure that the joint representations retain maximal information from all modalities. Different from most previous studies which directly fuse features at holistic level, Mai et al. (2019) propose a "divide, conquer and combine" strategy to perform multimodal fusion hierarchically which considers both local and global interactions. In order to model expressive nonverbal representations, Wang et al. (2019) propose a Recurrent Attended Variation Embedding Network which analyzes the fine-grained visual and acoustic patterns and dynamically shifts word representations according to nonverbal cues. Tsai et al.
(2019) introduce a Multimodal Factorization Model which factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors and propose a joint generative-discriminative objective to optimize across multimodal data and labels.
Although previous studies have proposed many effective multimodal fusion approaches, few studies have explored the possibility of using knowledge in language as a multi-task learning framework in multimodal sentiment analysis. In this paper, we try to design an auxiliary task to guide the model to learn sentimental words information aware multimodal representation.

Our Model
In this section, we will describe our model in more detail. Section 3.1 introduces the crossmodal coattention, section 3.2 introduces the sentimental words prediction auxiliary task, section 3.3 introduces the sentimental words aware representation and section 3.4 describes the model training. Figure 2 shows the whole architecture of our model.

CrossModal Coattention
Given the word embedding of language, the raw features of acoustic and vision modalities, denoted as X L = {l 1 , l 2 , . . . , l T }, X A = {a 1 , a 2 , .., a T } and X V = {v 1 , v 2 , .., v T } respectively, we use LSTM to model the temporal information of the three modalities as intra-modal encoding, getting the LSTM hidden states output of the three modalities, denoted as H L , H A and H V respectively.
After getting the encoded features of three modalities H L , H A and H V respectively, we use coattention (Xiong et al., 2017) to learn the bimodal fusion between language modality and other modalities. Firstly, we use a non-linear projection layer to transform the dimension of the encoded language repre-sentation into the same dimension of that of other modalities in order to perform coattention, as show in equation (1).
The coattention mechanism is applied to attend to the language modality and other modality(i.e. vision or acoustics) simultaneously, and learn the bimodal fusion. Firstly, an affinity matrix is computed, which contains the affinity scores corresponding to all pairs of language hidden states and vision(or acoustic) hidden states. Then the softmax function is used to normalize the affinity matrix row-wise to produce the attention weights A V (or A L ) across the language text for each timestep of the vision(or acoustic) features, and column-wise to produce the attention weights A L across the vision(or acoustic) features for each word, as shown in equation (2-4): Next, we compute the attention contexts of the language features based on the attention weights of each timestep of the vision(or acoustic) features, as shown in equation (5): Similarly, we can compute the attention contexts A L H V of the vision(or acoustic) features based on the attention weights of each word of the language features. Following the work of (Xiong et al., 2017), we also compute the summaries A L C V to map the vision(or acoustic) features encoding into the space of language features encoding. The corresponding operation is shown in equation (6): Where C L&V is defined as a co-dependent representation of the language modality and vision modality. [ ] denotes for concatenation operation. Similarly, we can get C L&A using the same coattention operation for language modality and acoustic modality as a co-dependent representation of the language modality and acoustic modality. The bimodal fusion C L&V and C L&A are regarded as a kind of shallow fusion, as the trimodal fusion and the knowledge existing in the language modality are not well captured so far.

Sentimental Words Prediction Auxiliary Task
In addition to use other modalities to assist language modality, we find that the sentimental words information existing in the language modal can also be incorporated into the fusion model to learn richer multimodal representation. In this paper, we design a word-level classification task which is used to determine whether each word is a sentimental word. Specifically, we use Bing Liu's Opinion Lexicon as the knowledge 1 , which contains the negative-words list and positive-words list to obtain the label of each word. We first merge the two lists into a sentimental word list. If a word is in the sentimental word list, then it is a sentimental word, otherwise, it is not a sentimental word. Then we build the auxiliary task as a multi-label classification task as each sentence in the language modality may contain more than one sentimental word. Note that the word-level classification task and the sentiment analysis task share the same language encoding layer, as shown in Figure 2. We input H L into a fully-connected layer with a row-wise squash activation function (Sabour et al., 2017) to adjust it to prepare for word-level classification. The squash function is used to ensure that short vectors get shrunk to almost zero length and long vectors get shrunk to a length slightly below 1. We expect that the squash function can learn representation where the length of the vector can represent the probability of each word to be a sentimental word.
Where W w is weight and b w is bias. Then H words is input to a multi-label classification layer, as shown in equation (9).
Where W words is weight and b words is bias. y words ∈ R T , which denotes whether each word is sentimental word.

Sentimental Words Aware Multimodal Representation
As described in section 3.1, we get the bimodal fusion between language with other modalities separately. As mentioned earlier, we view this fusion as a shallow fusion because we believe that there is rich semantic information in the language which can be fused to learn the deep fusion and aggregation of different modalities. As demonstrated by many previous studies (Poria et al., 2017a;Mai et al., 2019), the language modality often plays a dominated role among the three modalities, thus we concatenate C L&V , C L&A and H L , and input the result to a LSTM layer to aggregate the two kinds of bimodal fusion representation and the intra-modality encoding of language, getting H agg , as shown in equation (10).
During the training of the auxiliary task, we expect that the H words = {h w 1 , h w 2 , ..., h w T } can learn the information about whether each word is sentimental word and the representation of the sentimental words can be distinguished from that of non-sentimental words. For sentiment analysis, the sentimental words are usually the key clues for determining the sentiment. However, in some cases, a sentence may contain sentimental words with different polarities and we need to decide which sentimental words make more contribution for sentiment prediction. Thus, to enable the auxiliary task to produce a marked effect, we use the final representation of word-level representation H words to learn the contribution of each word and guide the learning of multimodal fusion, as shown in equations (11)(12)(13).
Where W a and W u are trainable weights and b a is the bias. Note that we use the learned attention weights to perform weighted sum on the multimodal fusion representation H agg , getting S att , which is the sentimental words information aware representation.
In addition to S att learned through the guiding of the auxiliary task, we perform average pooling on H agg to obtain the global multimodal information, which is denoted as S avg . Finally, we concatenate S att and S avg to form the final representation. The final representation is input to a fully-connected layer and a prediction layer to get the sentiment prediction, as shown in equations (14-15).

Model Training
Considering that classification task and regression task for sentiment analysis are simultaneously evaluated on CMU-MOSI dataset, we use L1 Loss for training the sentiment analysis tasks of CMU-MOSI dataset, which is shown in equation (16). Where y i s andŷ i s are the true sentiment and predicted sentiment of i-th sample respectively. N is number of training samples. For CMU-MOSEI and YouTube datasets, following (Pham et al., 2019), we use Cross-entropy as training loss.
As for the word-level classification task, we use Binary Cross Entropy Loss, which is shown in equation (17). Where y i w andŷ i w are the true label and the predicted label of the i-th sample respectively, T is the length of language sentence.
The overall loss of our model is the weighted sum of Loss sa and Loss sw , as shown in equation (18), where α ∈ (0, 1) is a hyper parameter.

Dataset
We use CMU-MOSI,CMU-MOSEI and YouTube as our experimental datasets, which are extensively used in the previous studies. Following most previous studies, GloVe embeddings (Pennington et al., 2014) are used to represent the language features, the visual features are extracted by Facet library 2 and acoustic features are extracted using COVAREP (Degottex et al., 2014). CMU-MOSI (Zadeh et al., 2016) contains 93 videos from YouTube, each of the videos is expressing a speaker's opinions towards a movie. The videos are split into 2199 clips. We train our model on 52 videos (1284 clips), validates on 10 videos (229 clips) and tests on 31 videos (686 clips). Each sentiment label of the clip is a number between [-3, 3], which represents strongly positive (denoted as +3), positive (+2), weakly positive (+1), neutral (0), weakly negative (-1), negative (-2), strongly negative (-3) respectively. CMU-MOSEI (Zadeh et al., 2018b) consists of 22,413 video clips about movie reviews from YouTube. There are 15290, 2291 and 4832 clips in the training set, validation set and test set respectively. YouTube (Morency et al., 2011) consists of 269 video clips, in which the size of training set, validation set and test set are 173, 36 and 60 respectively.
For CMU-MOSI dataset, we complete binary classification, multi-class classification and regression experiments. For regression task, we report Mean Absolute Error (MAE) and Pearson's Correlation (Correlation). For binary classification, we report accuracy and F1 score, while for multi-class classification we only report accuracy, which is consistent with most previous studies. For CMU-MOSEI and YouTube dataset, we consider positive, negative and neutral sentiments following (Mai et al., 2019) and use accuracy and F1 score. For all metrics, higher values represent better performance, except for MAE.

Baseline Models
We use the following methods as our baseline models for experiments. Firstly

Experimental Results
In this section we present the experimental results and the analysis of our model on CMU-MOSI, YouTube and CMU-MOSEI datasets.
Experimental results on CMU-MOSI dataset. We summarize the experimental results of different models on the CMU-MOSI dataset in Table 1. As shown in Table 1, our model achieves competitive performance compared with the best baseline model HFFN on accuracy and F1 score of binary classification. For regression task, our model achieves best performance among the baselines both on mean absolute error(MAE) and correlation(Corr). Specifically, our model outperforms MCTN by 2.9% on MAE and 2.1% on correlation, which are significant improvements. For 7 classification task, our model also achieve the best performance among the baseline models, which outperforms RMFN by 1.8% and MFM by 3.9% on accuracy. The experimental results on CMU-MOSI dataset show that our approach brings more significant improvements on regression task and 7 classification than binary classification task.
Experimental results on YouTube dataset. Table 2 shows the experimental results of our model and the baseline models on YouTube dataset. Although the size of YouTube dataset is very small, we can see that compared with the baseline models, our model achieves the best performance on both accuracy and F1 score, which outperforms MCTN by 3.3% on accuracy and 0.9 % on F1 score, and outperforms the previous state-of-the-art model MFM by 1.7% on accuracy and 0.9% on F1 score. Due to the very limited training samples, many baseline models may be overfitting on the training set. Our model achieves better performance, indicating its better generalization ability.

Model
YouTube-Acc YouTube-F1 MOSEI-Acc MOSEI-F1 MV-LSTM (Rajagopalan et al., 2016) 45.8 43.3 --BC-LSTM (Poria et al., 2017a) 45.0 45.1 60.77 59.04 TFN  45.0 41.0 59.40 57.33 CAT-LSTM (Poria et al., 2017b) --60.72 58.83 MARN (Zadeh et al., 2018c) 48.3 44.9 --MFN (Zadeh et al., 2018a) 51.7 51.6 --CHFusion (Majumder et al., 2018) --58.45 56.90 LMF  --60.27 53.87 MCTN (Pham et al., 2019)    Experimental results on CMU-MOSEI dataset. For CMU-MOSEI dataset, following (Mai et al., 2019), we conduct experiments on 3 classification tasks. We present the experimental results of different models in Table 2. Our model achieves the best performance on both accuracy and F1 score, which outperforms HFFN by 0.66% on accuracy and 0.25% on F1 score. CMU-MOSEI is the largest dataset among the three datasets, we can see that the difference of the performance of different models is not very significant. For example, the range of the performance on accuracy of the baselines is between 60.2% and 60.8%, except for TFN and CHFusion. Howerver, the range of F1 score of the baseline models is between 53.87% and 59.07% as some baseline models achieve much lower values on F1 score than that on accuracy. Our model can achieve good performance on both accuracy and F1 score. The overall experimental results on three datasets show the effectiveness of our model.

Discussion
In this section, we investigate the impact of different modalities on the performance of the final model. We also conduct a case study to investigate how the auxiliary task guide the learning of attention weights of sentimental words in the sentence.

Investigation of the Contribution of Different Modalities
In order to investigate the impact of different modalities of our model, we carry out a series of experiments to compare the performance of our model using unimodal, bimodal and multimodal features respectively. We shown them in Table 3.
Firstly, we conduct experiments with the model just using unimodal features, where language(no auxiliary work) is only using language representation and language(+auxiliary work) is fused with sentimental words classifications task. We can see that the model using language modality outperforms the model using acoustic modality or vision modality with significant margin. This is probably because that the language features are word embeddings trained from large-scale corpus while audio and video features are extracted manually. Thus language modality contains much richer information than other modalities.
Secondly, we compare the models with bimodal features. We can infer that when combining language modality with acoustic modality or vision modality, the performance can be improved on some metrics compared with only using language modality, but not all of them can be improved. However, when using audio features and video features as input, the performance of the model is still much worse than that of only using language modality, suggesting that language modality is the dominated modality in this task. When cooperating three modalities, our model can achieve further improvements compared with using bimodal features.
Finally, as can be seen in Table 3, for all different combinations of modalities, the performance of the models with auxiliary task outperform that of the models without auxiliary task, which suggests that sentimental words classification auxiliary task indeed plays a remarkable role in our model. In addition, the proposed crossmodal coattention mechanism which learns the interaction between different modalities also makes significant contribution in our model. Due to the sufficient modality fusion and the cooperation of the auxiliary task, our model can achieve the final promising performance.

Case Study
As mentioned before, we propose a sentimental words classification task as an auxiliary task in the model to help to guide the fusion of multiple modalities and in turn help to learn more precise attention weights of sentimental words in the sentence. In order to investigate how the auxiliary task guide the learning of attention weights, we conduct a case study on two instances.
As shown in Figure 3, we present the attention weights learned by our model (SWAFN) with auxiliary task and without auxiliary task (denoted as SWAFN(∆)). The first line of each example is the predicted labels of the word-level classification task, "N" means the word is predicted as not sentimental word, "Y" means the word is predicted as a sentimental word. For example, for sentence "And i was unbelievably shocked how much i loved it", there are three sentimental words in this sentence which are "unbelievably","shocked" and "loved", in which the word "unbelievably" and "shocked" are negative and the word "loved" is positive. The third and fourth line of each example are the learned attentions of each word by SWAFN model and SWAFN(∆) model. We can see that SWAFN pays most of attention on the three sentimental words and can assign largest weight on the word which can directly reflect the sentiment of the sentence. However, SWAFN without auxiliary task (SWAFN(∆)) pays most attention on the word "shocked", which is a negative word, so it predicts wrong label of the sentiment. Similar observation can be seen in another instance.
The observation shown in Figure 3 indicates that the sentimental words classification auxiliary task can guide the model to pay more attention on sentimental words than other words when predicting sentiment and can recognize which sentimental words reflect the sentiment directly. With more accurate attention weights, SWAFN can summarize more effective representation, thus it can achieve better performance than SWAFN(no auxiliary task).

Conclusion
In this paper, we propose a Sentimental Words Aware Fusion Network (SWAFN) which first applies the crossmodal coattention mechanism to learn the long range of context information and then use a sentimental words classification auxiliary task to guide and learn the sentimental words aware final multimodal fusion representation. The experimental results on several datasets show the effectiveness of our model. The results and case study also demonstrate that our proposed sentimental words classification auxiliary task is an effective way to use the external knowledge to help the model to learn more powerful multimodal representation. In the future, we will consider incorporating more external language knowledge to obtain better multimodal fused representations.