Multilogue-Net: A Context Aware RNN for Multi-modal Emotion Detection and Sentiment Analysis in Conversation

Sentiment Analysis and Emotion Detection in conversation is key in a number of real-world applications, with different applications leveraging different kinds of data to be able to achieve reasonably accurate predictions. Multimodal Emotion Detection and Sentiment Analysis can be particularly useful as applications will be able to use specific subsets of the available modalities, as per their available data, to be able to produce relevant predictions. Current systems dealing with Multimodal functionality fail to leverage and capture the context of the conversation through all modalities, the current speaker and listener(s) in the conversation, and the relevance and relationship between the available modalities through an adequate fusion mechanism. In this paper, we propose a recurrent neural network architecture that attempts to take into account all the mentioned drawbacks, and keeps track of the context of the conversation, interlocutor states, and the emotions conveyed by the speakers in the conversation. Our proposed model out performs the state of the art on two benchmark datasets on a variety of accuracy and regression metrics.


Introduction
Multi-modal Emotion Detection and Sentiment Analysis in conversation is gathering a lot of attention recently considering its potential use cases owing to the rapid growth of online social media platforms such as YouTube, Facebook, Instagram, Twitter etc. (Chen et al., 2017;Poria et al., 2016Poria et al., , 2017bZadeh et al., , 2016, especially knowing that information obtained from any combination of more than one of the available modalities (e.g. text, audio, video) can be used to produce meaningful results.
The current state of the art systems on multimodal emotion detection and sentiment analysis do not make the best use of the available modalities -they do not treat the modalities in accordance to the information they are capable of holding (e.g. textual information is significantly more likely to hold contextual information then audio or video features are), lack an adequate fusion mechanism, and fail to effectively capture the context of a conversation in a multimodal setting. In addition to the lack of proper usage of the Majority of the work was done when author was an intern at NVIDIA Graphics, Bengaluru, India available modalities, models also fail to effectively capture the flow of a conversation, the separation between speaker and listener states, and the emotional effect a speaker's utterance has on the listener (s) in dyadic conversations.
Our proposed model Multilogue-Net takes insight from (Poria et al. 2019) and assumes that the sentiment or emotion governing a particular utterance predominantly depends on 4 factorsinterlocutor state, interlocutor intent, the preceding and future emotions, and the context of the conversation. Interlocutor intent amongst the mentioned is particularly difficult to model due to its dependency of prior knowledge about the speaker, but modelling the other 3 separately, yet in an interrelated manner was theorized to produce meaningful results if managed to be captured effectively. The key intention was to attempt to simulate the setting in which an utterance is said, and use the actual utterance at that point to be able to gain better insights regarding emotion and sentiment of that utterance. The model uses information from all modalities learning multiple state vectors (representing interlocular state) for a given utterance, followed by a pairwise attention mechanism (Ghosal et al. 2018) attempting to learn the relationship between all pairs of the available modalities.
The model uses two gated recurrent units (GRU) (Chung et al. 2014) for each modality for modelling interlocutor state and emotion, and an inter-connected context network consisting of the same number of gated recurrent units as the number of available modalities in order to model a different learned context representation for each modality. The incoming utterance text representation is fed to the context GRU of all modalities in concatenation with the previous timestamp state GRU output to update the context representation of that modality (text representation has been theorized to be the best representative of context, independent of which modality context GRU its acting as an input to) while the incoming utterance representation of each modality is fed to the state GRU of that modality in concatenation with all the context GRU outputs of that modality until that timestampthe output of the state GRU is taken to be an encoding of the interlocutor state as conveyed by that modality. This encoding is fed into the emotion GRUwhich acts as the decoder, to produce the emotion representation of that utterance, followed by a pairwise attention mechanism, using the emotion representation, which intends to emphasize attributes of the learned representations common to multiple modalities and deemphasize attributes learned differently by the different modality representations.
The usage of text representation as input to the context GRU's has been observed to be key to the results, as the context of the conversation would be better captured by textual information then it would have with audio or video information. We believe that Multilogue-net performs better than the current state of the art (Ghosal et al. 2018) on multimodal datasets because of better context representation leveraging all available modalities.
The remaining sections of the paper are arranged as follows: Section 2discusses related work; Section 3discusses the model in detail; Section 4provides experimental results, dataset details, and analysis; and Section 5speaks on potential future work, and concludes our paper.

Related Work
Multimodal Emotion recognition and Sentiment Analysis has always attracted attention in multiple fields such as natural language processing, psychology, cognitive science, and so on (Picard 2010). Previous works have been done studying factors of variation that have a more direct correlation with emotion, such as Ekman (1993) who found correlation between emotion and facial cues, and a lot of studies extensively focus on emotions and their relationship with one another such as Plutchik's wheel of emotions, which defines eight primary emotion types, each of which has a multitude of emotions as subtypes.
Early work done to leverage multimodal information for emotion recognition includes works such as Rothkrantz (2008), who fused acoustic information with visual cues for emotion recognition and Wollmer et al. (2010), who used contextual information for emotion recognition in multimodal setting. More recently, deep recurrent neural networks have been used to be able make the best of the learned representations of the modalities available to be able to give very effective and accurate emotion and sentiment predictions. Poria et al. (2017) successfully used RNN-based deep networks for multimodal emotion recognition, which was followed by multiple other works (Chen et al. 2017;Zadeh et al. 2018a;2018b) giving results far better than what was seen before. Recent works also include works such as Hazarika et al. (2018), who used memory networks for emotion recognition in dyadic conversations, where two distinct memory networks enabled inter-speaker interaction.
Some works such as DialogueRNN (Majumder et al. 2018), though focused on emotion recognition and sentiment analysis using a single modality (text), works very well in a multimodal setting by just replacing the text representation with a concatenated vector of all the modality representations. DialogueRNN effectively leveraged the separation between the speakers by maintaining two independent gated recurrent units to keep track of the interlocutor states, also effectively capturing context in the conversation, yielding state-of-the-art performance on unimodal data. Even though DialogueRNN was able to give reasonably good results on multimodal data, the lack of an adequate fusion mechanism and the lack of focus on a multimodal representation held its multimodal performance back.
Apart from the kind of works shown before, where a methodology or a model was proposed works such as of Poria et al. (2019) spoke extensively about the research challenges and advancements in emotion detection in conversation and gave a comprehensive overview of the problem.
Most recently Ghosal et al. (2018) introduced the idea of learning the relationship between pairs of all available modalities using pairwise attention, in a multimodal setting, where similar attributes learned by multiple modalities are emphasized and differenced between the modality representations are deemphasized. Pairwise attention proved to be incredibly effective yielding state-of-the-art performance on multimodal data with just simple representations for each modality.

Problem Formulation
Let there be a P number of participants p1, p2, …., pP in the conversation. The problem is defined such that for every utterance u1, u2,…., uN uttered by any participant(s), a sentiment score is allotted along with a predicted emotion label (one of happy, sad, angry, surprise, disgust, and fear). Each utterance corresponds to a particular participant of the conversation, allowing this formulation of the problem to also capture the average sentiment of a participant in the conversation. Predictions over utterances also avoids problems such as classification during long moments of silence when predictions are made for a fixed time interval.
For every utterance ut(p), where p is the party who uttered the utterance, there exist three independent representations ℝ , ℝ , and ℝ obtained using feature extractors, mentioned in section 4.2.
This gives us our overall formulation of the problem, which is to be able to learn a function which would take as input three independent representations of a particular utterance, information regarding the previous emotional state of the participant, and a representation of the current context of the conversation -to be able to map to an output prediction of a sentiment score and emotion label. Apart from the function that would map to sentiment score and emotion label, one also needs to pay heed to how the context representation and participant state representation is updated. This would also play a huge role in the performance of the model as the emotional state of a participant and context of the conversation at that point are key to emotional understanding of that utterance.
In the rest of the paper, variables will follow the convention that subscripts denote timestamps and superscripts denote the modality, unless mentioned otherwise.

Model Details
Modelling was done under the underlying assumption that the sentiment or emotion of an utterance predominantly depends on four factors as mentioned before: • Interlocutor State • Interlocutor Intent • Context of the conversation until that point • Previous interlocutor states and emotions of a particular participant in the conversation The proposed model attempts to model three out of the mentioned four explicitly, and assume that interlocutor intent will be modelled implicitly during model training. Interlocutor state is modelled using a state network (sGRU), A context network is used to keep track of the context of the conversation (cGRU), and an emotion network (eGRU) keeps track of the emotion representation of that particular participant. Finally, a pairwise attention mechanism, which uses the emotion representation of all modalities at a particular timestamp is used to leverage the important modalities and relevant combination of the modalities for emotion or sentiment prediction at that timestamp.
Every utterance has three independent feature representations (text, audio, and video features) ℝ , ℝ , and ℝ . Each of these feature representations are treated and operated on independently until the pairwise attention mechanism. The model consists of two GRU's (state GRU, and emotion GRU) for every modality and participant, and a context GRU for each modality common to all participants in the conversation (If p is the number of participants and m is the number of modalities, the model would have 2mp + m GRU's). The inputs at the current timestamp, along with the previous state representations are used to update each one of three states (interlocutor, context, and emotion), out of which the emotion states are operated on by the fusion mechanism to be able to produce the emotion class or sentiment score. The state updates for a particular timestamp are described in figure 1 above.
Each GRU plays a particular role, attempting to encode multiple aspects of that utterance in the conversation. The role and purpose of each GRU has been described below.
Context GRU (cGRU) The Context GRU for each modality aims to capture the context of the conversation by jointly encoding the utterance representation of that modality (at timestamp t in the given diagram) ( ℝ , ℝ , or ℝ ) and the previous timestamp speaker state GRU output of that modality. This accounts for inter-speaker and inter-utterance dependencies to produce an effective context representation. The current utterance , , changes the state of that speaker from ( , , ) to ( +1 , +1 , +1 ). To capture this change in context we use GRU cell cGRU having output size , using , , and ( , , ) as: Where is the size of the context vectors +1 , +1 , +1 . , , and are the sizes of utterance representations of text, audio, and video respectively, ⊕ represents concatenation operation, , , are the sizes of the state vectors ( +1 , +1 , +1 ), and all GRU weights shapes are such that they produce the expected shape of outputs taking the given shape of inputs.
State GRU (sGRU) -The network keeps track of the participants involved in a conversation by employing a × number of ′ , where P is the number participants in the conversation and M is the number of available modalities. The associated with a participant outputs fixed size vectors which serve as encodings representing the interlocutor state, and are directly used for both emotion and sentiment prediction, and updating context vectors. All the state vectors are initialized to null at the first timestamp. For a timestamp t the state vector of participant p and modality m(t,a,v) is updated using the input feature representation of that modality and attention over all the context vectors until that timestamp. The attention mechanism over all the context vectors is described in Fig. 2 and by the following equations: Where { , , }, ℝ , , × , ℝ ( −1) , and ℝ . In Eq. (4), we calculate attention scores over all previous context representations of all previous utterances, highlighting the relative importance of all the previous context vectors to . A softmax layer is applied as shown in Eq. (5) to amplify this relative importance, and finally Eq. (6) the final output of attention over context is calculated by pooling the previous context vectors with .
We then employ , , to update , , to +1 , , on the basis of incoming utterance { , , } and context representations , , and using GRU cells , , and of output sizes , , respectively.
Where , , are the sizes of the state vectors +1 , +1 , +1 . , , and are the sizes of utterance representations of text, audio, and video respectively, ⊕ represents concatenation operation, and all GRU weights shapes are such that they produce the expected shape of outputs taking the given shape of inputs.
The intended purpose of using this as the input to , , is to model the dependency of the speaker state on the context of the conversation as understood by the utterances until that point, along with the utterance representation at that point. The output of the sGRU for modality m and timestamp t serves as an encoding of the speaker state as conveyed by modality m, at time t. Emotion GRU (eGRU) -The emotion GRU serves as the decoder for the encoding produced by the state GRU. The emotion GRU uses the previous timestamp eGRU output, and the encoding provided by sGRU to produce an emotion or sentiment representation which is further used by the pairwise attention mechanism to be able to produce the relevant output for prediction. At timestamp (t+1) the emotion vectors are updated as: Where , , are the sizes of the emotion vectors +1 , +1 , +1 . , , and are the sizes of utterance representations of text, audio, and video respectively.
, , are the sizes of the state vectors +1 , +1 , +1 . and all GRU weights shapes are such that they produce the expected shape of outputs taking the given shape of inputs.
The emotion GRU acts as a decoder to the encoding produced by the associated state GRU, producing a vector which can be used for both sentiment and emotion prediction.
Pairwise Attention Mechanism -The emotion GRU for each timestamp will produce a number of vectors (where M is the number of modalities available). Pairwise attention is then used over these vectors to produce the final prediction output. In particular pairwise attention is calculated over the following pairs in our case -( , ), ( , ), ( , ). Pairwise attention for ( , ) would be calculated as follows: Where 1 , 2 ℝ × ; { , , }; 1 , 2 ℝ × ; 1 , 2 ℝ × ; ( , ) ℝ × 2 ; ⊙ represents multiplicative gating; and ⊕ represents concatenation. In Eq. (13) a pair of matching matrices are computed, 1 , 2 ℝ × , (where { , , }) over two representations that account for cross-modality information. In Eq. (14) softmax over the matching matrices is performed which gives us the relative importance of the features in both modalities followed by calculation of modality-wise attentive representations in Eq. (15). Finally, a multiplicative gating functioning following (Dhingra et al., 2016) is computed between the multi-modal utterance specific representations of each individual modality and the other modalities. This element-wise matrix multiplication assists in attending to the important components of multiple modalities and utterances. Attention matrices are then concatenated to produce ( , ). Prediction Layer -The prediction layer varies based on whether a sentiment or emotion prediction is expected. For sentiment prediction first all three pairs of pairwise attention i.e.
( , ), ( , ), and ( , ) at that timestamp are concatenated along with the emotion GRU outputs at that timestamp ( , , and ) and the concatenated layer is passed through a fully connected layer followed by a softmax or tanh layer based on the nature of the expected prediction. For sentiment prediction between -1 and +1 at timestamp t the output layer would equate as follows: Where ( , ) has been represented as ( , ); ℝ 9 × 1 . For emotion prediction we use a fully connected layer along with a final softmax layer to calculate 6 emotion class probabilities from 1 .
ℎ ℝ × 9 ; ℝ ; ℝ × ; ℝ ; ℝ Loss functions -We have used categorical cross-entropy along with L2-regularization as the loss function during training for emotion prediction and Mean Square Error (MSE) as loss function during training for sentiment prediction.
We used mini-batch gradient descent-based Adam (Kingma and Ba 2014) optimizer to train both networks. Hyperparameters were optimized using grid search.

Datasets
We evaluate our model using two benchmark datasets, namely namely CMU Multi-modal Opinion-level Sentiment Intensity (CMU-MOSI) corpus (Zadeh et al., 2016)  CMU-MOSEI -In CMU-MOSEI dataset labels are in a continuous range of -3 to +3 and are accompanied by an emotion label being one of six emotions. However, in this work we also project the instances of CMU-MOSEI in a two-class classification setup with values ≥ 0 signify positive sentiments and values < 0 signify negative sentiments. We have called this A2 accuracy (accuracy with 2 classes). Along with this we have also shown results for continuous range prediction between -3 and +3, and emotion prediction with the 6 emotion labels for each utterance in CMU-MO-SEI. We have used A2 as a metric to be consistent with the previous published works on CMU-MOSEI dataset (Ghosal et al. 2018;Zadeh et al., 2018c). CMU-MOSEI has further been used for other comprehensive experiments due to its large sizer and easier feature extraction

Unimodal Feature Extraction
CMU-MOSEI -We use the CMU-Multi-modal Data SDK (Zadeh et al., 2018a) for feature extraction. For MO-SEI dataset, sentiment label-level features were provided where text features were extracted by GloVe embeddings, visual features by Facets & acoustic features by OpenSMILE. Thereafter, we compute the average of sentiment label-level features in an utterance to obtain the utterance-level features. For each sentiment label-level feature, the dimension of the feature vector is set to 300 (text), 35 (visual) & 384 (acoustic).
CMU-MOSI -In contrast, for MOSI dataset we use utterance level features provided in (Poria et al., 2017b). These utterance-level features represent the outputs of a convolutional neural network (Karpathy et al., 2014)

Experiments
We evaluate our proposed approach for CMU-MOSI (test data) & CMU-MOSEI (dev data). We evaluate our approach on CMU-MOSI on accuracy and F1 score, and CMU-MOSEI on accuracy, F1 score, mean absolute error (MAE), pearson score (r), and accuracies on the emotion labels. Due to the lack of speaker information in CMU-MOSI  (Zadeh et al., 2016) and MOSEI (Zadeh et al., 2018c). (Tr : Train Set ; Dv : Development set ; Ts : Test) we were not able to use the CMU-Multi-modal Data SDK for sentiment label extraction, to be able to evaluate our approach on CMU-MOSI on mean absolute error and pearson score. We use the final output from the emotion GRU with sizes , , for each modality. A dense layer with the same size as the emotion GRU output is used after the emotion GRU for the fusion mechanism, hence having all of the modalities be the same sizes. We set dropout=0.5 for both datasets as a measure of regularization. We employ ReLu activation function in the dense layers, and softmax activation in the final classification layer. For training the network, we set the batch size=32 for CMU-MOSI and batch size=128 for MOSEI dataset. We train for all metrics up to 60 epochs and results have been reported for the epoch showing maximum F1 score, and for regression metrics the epoch showing maximum pearson score.
Results have also been reported for usage of two of the three available modalities. Unimodal performance has not been reported as the focus of the paper is the effective usage of multimodal data. In a unimodal setting the model would not be using the fusion mechanism and the output would be equivalent to having a few dense layers after the emotion GRU to directly output the final prediction.
F1 scores have not been mentioned in the models being used for comparison, but have been reported for additional comparison to any prior models using CMU-MOSI dataset.  Table 2 -Multilogue-Net performance on CMU-MOSI in comparison with the current and previous state-of-the-art on the dataset. A2 indicating accuracy with 2 classes, and F1 indicating F1 score Table 2 shows the performance of Multilogue-Net on CMU-MOSI dataset, comparing to the current state of the art (Ghosal et al. 2018) and previous state-of-the-art (Poria et al. 2017). Our model consistently out performs the previous state-of-the-art but performs better only on one of the subsets of the modalities when compared to the current state-of-the-art. In comparison to MMMU-BA our model also lacks in Multimodal performance. We theorize that the model performance lacks because of the low number of training examples, as CMU-MOSI consists only of 93 conversations out which 62 were used for training. The proposition that performance lacks due to a lack of training examples is backed by the results on MOSEI (demonstrated in a comparative setting in Table 3 and 4) where the model consistently out performs the current state-of-the-art.

Metric
As shown in Table 2, highest performance is achieved in a tri-modal setting, as expected, with text + video and text + audio having F1 scores and accuracies slightly lower. Video + audio on the other hand shows significantly lower performance, suggesting text is key in emotion and context understanding. Our understanding of why video + audio performance betters MMMU-BA is the usage of text for building context, strengthening the intuition that text is key in multimodal understanding.
The comparative results for MOSEI have been shown in Table 3 and 4, where it is evident that Multilogue-Net consistently outperforms the current state-of-the-art models on majority of the metrics. Since MOSEI dataset was introduced recently, extensive experimentation has not been done on it by models, with the results on MOSEI mostly coming only from MMMU-BA (Ghosal et al. 2018), and Graph-MFN (Zadeh et al., 2018c Table 3 -Multilogue-Net performance on MOSEI Sentiment Labels compared to previous state-of-the-art models on regression and accuracy Metrics. All metrics apart from MAE represents higher values for better results, MAE represents lower values for better results further proving the importance of text in a multimodal setting. We theorize Multilogue-Net works best on most metrics on MOSEI, because of effective usage of the available modalities, better capture of context, and the larger corpus of conversational data available for training. Dependency on preceding and future utterances: One of the crucial components of Multilogue-Net is the 3 context GRU's which leads to a much more focused attention mechanism, focusing on the most relevant utterances and modalities leading to lesser misclassifications. Also, the usage of all GRU's bidirectionally takes into account immediate and distant future emotions, building a correlation between the final emotion in a segment of a conversation and the current utterance, helping better predictions. Analysis of the attention mechanism: The attention mechanism used as a fusion mechanism has been analysed in the works of (Ghosal et al. 2018), in which they originally proposed the pairwise attention mechanism. In the analysis, certain utterances were found to focus on particular modalities and it was observed that textual features of the first few utterances were relatively most helpful. The utility of the fusion mechanism has also been described through experiments with and without the fusion mechanism, where results of the model without attention consistently underperforms relative to when attention is used, with statistical T-test showing the improvements due to the fusion mechanism are statistically significant. Pairwise attention was also successfully shown to be more effective than combinations of selfattentions on single modalities, and also triplet-wise attention (3 modalities at a time). Table 5 shows the results of Multilogue-net with and without the attention mechanism further proving the effectiveness of the pairwise attention mechanism as the fusion mechanism of choice Error Analysis: On performing error analysis we observe that in the classification problems, most errors on sentiment labels occur due to underrepresentation of 0's in MOSI and most misclassifications on emotion labels in MO-SEI happen between related emotions (Disgust and Sadness). The dimensional predictions on MOSEI sentiment have been observed to particularly perform badly on values around 0 with slightly positive sentiment commonly mistaken for slightly negative sentiment and vice versa.  Table 5 -Multilogue-net performance on MOSEI with and without the pairwise attention mechanism used. When the pairwise attention mechanism is not used, the modality representations are concatenated before the prediction layer

Conclusion
In this paper we have presented an RNN architecture for multi-modal sentiment analysis and emotion detection in conversation. In contrast to the current state-of-the-art, our model focuses on effectively capturing the context of a conversation and treats each modality independently, taking into account the information a particular modality is capable of holding. Our model out performs the current state-of-theart on most metrics on MOSEI, and has extensively focused on the multi-modal setting for sentiment analysis and emotion detection.
The model can be further extended to increase both the number of modalities (IR imaging data, or polygraph data) and the number of participants in the conversation. Due to the lack of availability of datasets consisting of these extensions with emotion or sentiment labels, we have left this to our future work. Table 4 -Multilogue-Net performance on MOSEI Emotion Labels compared with that of Graph-MFN on weighted accuracy and F1 score.
MOSEI Emotion label results were presented by only one model, and comprehensive results have not been published for the same.