Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph

Analyzing human multimodal language is an emerging area of research in NLP. Intrinsically this language is multimodal (heterogeneous), sequential and asynchronous; it consists of the language (words), visual (expressions) and acoustic (paralinguistic) modalities all in the form of asynchronous coordinated sequences. From a resource perspective, there is a genuine need for large scale datasets that allow for in-depth studies of this form of language. In this paper we introduce CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI), the largest dataset of sentiment analysis and emotion recognition to date. Using data from CMU-MOSEI and a novel multimodal fusion technique called the Dynamic Fusion Graph (DFG), we conduct experimentation to exploit how modalities interact with each other in human multimodal language. Unlike previously proposed fusion techniques, DFG is highly interpretable and achieves competative performance when compared to the previous state of the art.


Introduction
Theories of language origin identify the combination of language and nonverbal behaviors (vision and acoustic modality) as the prime form of communication utilized by humans throughout evolution (Müller, 1866).In natural language processing, this form of language is regarded as human multimodal language.Modeling multimodal language has recently become a centric research direction in both NLP and multimodal machine learning (Pham et al., 2018;Tsai et al., 2018;Hazarika et al., 2018;Chen et al., 2017).Studies strive to model the dual dynamics of multimodal language: intra-modal dynamics (dynamics within each modality) and cross-modal dynamics (dynamics across different modalities).However, from a resource perspective, previous multimodal language datasets have severe shortcomings in the following aspects: Diversity in the training samples: The diversity in training samples is crucial for comprehensive multimodal language studies due to the complexity of the underlying distribution.This complexity is rooted in variability of intra-modal and crossmodal dynamics for language, vision and acoustic modalities (Rajagopalan et al., 2016).Previously proposed datasets for multimodal language are generally small in size due to difficulties associated with data acquisition and costs of annotations.Variety in the topics: Variety in topics opens the door to generalizable studies across different domains.Models trained on only few topics generalize poorly as language and nonverbal behaviors tend to change based on the impression of the topic on speakers' internal mental state.Diversity of speakers: Much like writing styles, speaking styles are highly idiosyncratic.Training models on only few speakers can lead to degenerate solutions where models learn the identity of speakers as opposed to a generalizable model of multimodal language (Wang et al., 2016).Variety in annotations Having multiple labels to predict allows for studying the relations between labels.Another positive aspect of having variety of labels is allowing for multi-task learning which has shown excellent performance in past research.
Our first contribution in this paper is to introduce the largest dataset of multimodal sentiment and emotion recognition called CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI).CMU-MOSEI contains 23,453 annotated video segments from 1,000 distinct speakers and 250 topics.Each video segment contains manual transcription aligned with audio to phoneme level.All the videos are gathered from online video sharing websites 1 .The dataset is currently a part of the CMU Multimodal Data SDK and is freely available to the scientific community through Github 2 .
Our second contribution is an interpretable fusion model called Dynamic Fusion Graph (DFG) to study the nature of cross-modal dynamics in multimodal language.DFG contains built-in efficacies that are directly related to how modalities interact.These efficacies are visualized and studied in detail in our experiments.Aside interpretability, DFG achieves superior performance compared to previously proposed models for multimodal sentiment and emotion recognition on CMU-MOSEI.

Background
In this section we compare the CMU-MOSEI dataset to previously proposed datasets for modeling multimodal language.We then describe the baselines and recent models for sentiment analysis and emotion recognition.

Comparison to other Datasets
We compare CMU-MOSEI to an extensive pool of datasets for sentiment analysis and emotion recognition.The following datasets include a combination of language, visual and acoustic modalities as their input data.

Multimodal Datasets
CMU-MOSI (Zadeh et al., 2016b) is a collection of 2199 opinion video clips each annotated with sentiment in the range [-3,3].CMU-MOSEI is the next generation of CMU-MOSI.The ICT-MMMO (Wöllmer et al., 2013) consists of online social review videos annotated at the video level for sentiment.YouTube (Morency et al., 2011) contains videos from the social media web site YouTube that span a wide range of product reviews and opinion videos.MOUD (Perez-Rosas et al., 2013) consists of product review videos in Spanish.Each video consists of multiple segments labeled to display positive, negative or neutral sentiment.IEMO-CAP (Busso et al., 2008) consists of 151 videos of recorded dialogues, with 2 speakers per session for a total of 302 videos across the dataset.Each  segment is annotated for the presence of 9 emotions (angry, excited, fear, sad, surprised, frustrated, happy, disappointed and neutral) as well as valence, arousal and dominance.

Language Datasets
Stanford Sentiment Treebank (SST) (Socher et al., 2013) includes fine grained sentiment labels for phrases in the parse trees of sentences collected from movie review data.While SST has larger pool of annotations, we only consider the root level annotations for comparison.Cornell Movie Review (Pang et al., 2002) is a collection of 2000 moviereview documents and sentences labeled with respect to their overall sentiment polarity or subjective rating.Large Movie Review dataset (Maas et al., 2011) contains text from highly polar movie reviews.Sanders Tweets Sentiment (STS) consists of 5513 hand-classified tweets each classified with respect to one of four topics of Microsoft, Apple, Twitter, and Google.

Visual and Acoustic Datasets
The Vera am Mittag (VAM) corpus consists of show "Vera am Mittag" (Grimm et al., 2008).This audio-visual data is labeled for continuous-valued scale for three emotion primitives: valence, activation and dominance.VAM-Audio and VAM-Faces are subsets that contain on acoustic and visual inputs respectively.RECOLA (Ringeval et al., 2013) consists of 9.5 hours of audio, visual, and physiological (electrocardiogram, and electrodermal activity) recordings of online dyadic interactions.Mimicry (Bilakhia et al., 2015) consists of audiovisual recordings of human interactions in two situations: while discussing a political topic and while playing a role-playing game.AFEW (Dhall et al., 2012(Dhall et al., , 2015) ) is a dynamic temporal facial expressions data corpus consisting of close to real world environment extracted from movies.Detailed comparison of CMU-MOSEI to the datasets in this section is presented in Table 1.CMU-MOSEI has longer total duration as well as larger number of data point in total.Furthermore, CMU-MOSEI has a larger variety in number of speakers and topics.It has all three modalities provided, as well as annotations for both sentiment and emotions.

Baseline Models
Modeling multimodal language has been the subject of studies in NLP and multimodal machine learning.Notable approaches are listed as follows and indicated with a symbol for reference in the Experiments and Discussion section (Section 5).
# MFN: (Memory Fusion Network) (Zadeh et al., 2018a) synchronizes multimodal sequences using a multi-view gated memory that stores intraview and cross-view interactions through time.∎ MARN: (Multi-attention Recurrent Network) (Zadeh et al., 2018b) models intra-modal and multiple cross-modal interactions by assigning multiple attention coefficients.Intra-modal and cross-modal interactions are stored in a hybrid LSTM memory component.* TFN (Tensor Fusion Network) (Zadeh et al., 2017) models inter and intra modal interactions by creating a multi-dimensional tensor that captures unimodal, bimodal and trimodal interactions.◇ MV-LSTM (Multi-View LSTM) (Rajagopalan et al., 2016) is a recurrent model that designates regions inside a LSTM to different views of the data.§ EF-LSTM (Early Fusion LSTM) concatenates the inputs from different modalities at each time-step and uses that as the input to a single LSTM (Hochreiter and Schmidhuber, 1997;Graves et al., 2013;Schuster and Paliwal, 1997).In case of unimodal models EF-LSTM refers to a single LSTM.

CMU-MOSEI Dataset
Understanding expressed sentiment and emotions are two crucial factors in human multimodal language.We introduce a novel dataset for multimodal sentiment and emotion recognition called CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI).In the following subsections, we first explain the details of the CMU-MOSEI data acquisition, followed by details of annotation and feature extraction.

Data Acquisition
Social multimedia presents a unique opportunity for acquiring large quantities of data from various speakers and topics.Users of these social multimedia websites often post their opinions in the forms of monologue videos; videos with only one person in front of camera discussing a certain topic of interest.Each video inherently contains three modalities: language in the form of spoken text, visual via perceived gestures and facial expressions, and acoustic through intonations and prosody.
During our automatic data acquisition process, videos from YouTube are analyzed for the presence of one speaker in the frame using face detection to ensure the video is a monologue.We limit the videos to setups where the speaker's attention is exclusively towards the camera by rejecting videos that have moving cameras (such as camera on bikes or selfies recording while walking).We use a diverse set of 250 frequently used topics in online videos as the seed for acquisition.We restrict the number of videos acquired from each channel to a maximum of 10.This resulted in discovering 1,000 identities from YouTube.The definition of a identity is proxy to the number of channels since accurate identification requires quadratic manual annotations, which is infeasible for high number of speakers.Furthermore, we limited the videos to have manual and properly punctuated transcriptions provided by the uploader.The final pool of acquired videos included 5,000 videos which were then manually checked for quality of video, audio and transcript by 14 expert judges over three months.The judges also annotated each video for gender and confirmed that each video is an acceptable monologue.A set of 3228 videos remained after manual quality inspection.We also performed automatic checks on the quality of video and transcript which are discussed in Section 3.3 using facial feature extraction confidence and forced alignment confidence.Furthermore, we balance the gender in the dataset using the data provided by the judges (57% male to 43% female).This constitutes the final set of raw videos in CMU-MOSEI.The topics covered in the final set of videos are shown in Figure 1  sentences using punctuation markers manually provided by transcripts.Due to the high quality of the transcripts, using punctuation markers showed better sentence quality than using the Stanford CoreNLP tokenizer (Manning et al., 2014).This was verified on a set of 20 random videos by two experts.After tokenization, a set of 23,453 sentences were chosen as the final sentences in the dataset.This was achieved by restricting each identity to contribute at least 10 and at most 50 sentences to the dataset.Table 2 shows high-level summary statistics of the CMU-MOSEI dataset.

Annotation
Annotation of CMU-MOSEI follows closely the annotation of CMU-MOSI (Zadeh et al., 2016a) and Stanford Sentiment Treebank (Socher et al., 2013).Each sentence is annotated for sentiment on a [-3,3] Likert scale of: [−3: highly negative, −2 negative, −1 weakly negative, 0 neutral, +1 weakly positive, +2 positive, +3 highly positive].Ekman emotions (Ekman et al., 1980) of {happiness, sadness, anger, fear, disgust, surprise} are annotated on a [0,3] Likert scale for presence of emotion x: [0: no evidence of x, 1: weakly x, 2: x, 3: highly x].The annotation was carried out by 3 crowdsourced judges from Amazon Mechanical Turk platform.To avert implicitly biasing the judges and to capture the raw perception of the crowd, we avoided extreme annotation training and instead provided the judges with a 5 minutes training video on how to use the annotation system.All the annotations have been carried out by only master workers with higher than 98% approval rate to assure high quality annotations4 .
Figure 2 shows the distribution of sentiment and emotions in CMU-MOSEI dataset.The distribution The distribution shows a natural skew towards more frequently used emotions.However, the least frequent emotion, fear, still has 1,900 data points which is an acceptable number for machine learning studies.
shows a slight shift in favor of positive sentiment which is similar to distribution of CMU-MOSI and SST.We believe that this is an implicit bias in online opinions being slightly shifted towards positive, since this is also present in CMU-MOSI.The emotion histogram shows different prevalence for different emotions.The most common category is happiness with more than 12,000 positive sample points.The least prevalent emotion is fear with almost 1900 positive sample points which is an acceptable number for machine learning studies.

Extracted Features
Data points in CMU-MOSEI come in video format with one speaker in front of the camera.The extracted features for each modality are as follows (for other benchmarks we extract the same features): Language: All videos have manual transcription.Glove word embeddings (Pennington et al., 2014) were used to extract word vectors from transcripts.Words and audio are aligned at phoneme level using P2FA forced alignment model (Yuan and Liberman, 2008).Following this, the visual and acoustic modalities are aligned to the words by interpolation.Since the utterance duration of words in English is usually short, this interpolation does not lead to substantial information loss.
Visual: Frames are extracted from the full videos at 30Hz.The bounding box of the face is extracted using the MTCNN face detection algorithm (Zhang et al., 2016).We extract facial action units through Facial Action Coding System (FACS) (Ekman et al., 1980).Extracting these action units allows for accurate tracking and understanding of the facial expressions (Baltrušaitis et al., 2016).We also extract a set of six basic emotions purely from static faces using Emotient FACET (iMotions, 2017).MultiComp OpenFace (Baltrušaitis et al., 2016) is used to extract the set of 68 facial landmarks, 20 facial shape parameters, facial HoG features, head pose, head orientation and eye gaze (Baltrušaitis et al., 2016).Finally, we extract face embeddings from commonly used facial recognition models such as DeepFace (Taigman et al., 2014), FaceNet (Schroff et al., 2015) and SphereFace (Liu et al., 2017).

Multimodal Fusion Study
From the linguistics perspective, understanding the interactions between language, visual and audio modalities in multimodal language is a fundamental research problem.While previous works have been successful with respect to accuracy metrics, they have not created new insights on how the fusion is performed in terms of what modalities are related and how modalities engage in an interaction during fusion.Specifically, to understand the fusion process one must first understand the n-modal dynamics (Zadeh et al., 2017).n-modal dynamics state that there exists different combination of modalities and that all of these combinations must be captured to better understand the multimodal language.In this paper, we define building the n-modal dynamics as a hierarchical process and propose a new fusion model called the Dynamic Fusion Graph (DFG).DFG is easily interpretable through what is called efficacies in graph connections.To utilize this new fusion model in a multimodal language framework, we build upon Memory Fusion Network (MFN) by replacing the original fusion component in the MFN with our DFG.We call this resulting model the Graph Memory Fusion Network (Graph-MFN).Once the model is trained end to end, we analyze the efficacies in the DFG to study the fusion mechanism learned for modalities in multimodal language.In addition to being an interpretable fusion mechanism, Graph-MFN also outperforms previously proposed state-of-the-art models for sentiment analysis and emotion recognition on the CMU-MOSEI.

Dynamic Fusion Graph
In this section we discuss the internal structure of the proposed Dynamic Fusion Graph (DFG) neural model (Figure 3. DFG has the following properties: 1) it explicitly models the n-modal interactions, 2) does so with an efficient number of parameters (as opposed to previous approaches such as Tensor Fusion (Zadeh et al., 2017)) and 3) can dynamically alter its structure and choose the proper fusion graph based on the importance of each n-modal dynamics during inference.We assume the set of modalities to be M = {(l)anguage, (v)ision, (a)coustic}.The unimodal dynamics are denoted as {l}, {v}, {a}, the bimodal dynamics as {l, v}, {v, a}, {l, a} and trimodal dynamics as {l, v, a}.These dynamics are in the form of latent representations and are each considered as vertices inside a graph G = (V, E) with V the set of vertices and E the set of edges.A directional neural connection is established between two vertices v i and v j only if v i ⊂ v j .For example, {l} ⊂ {l, v} which results in a connection between < language > and < language, vision >.This connection is denoted as an edge e ij .D j takes as input all v i that satisfy the neural connection formula above for v j .
We define an efficacy for each edge e ij denoted as α ij .v i is multiplied by α ij before being used as input to D j .Each α is a sigmoid activated probabil- (?)   Figure 4: The overview of Graph Memory Fusion Network (Graph-MFN) pipeline.Graph-MFN replaces the fusion block in MFN with a Dynamic Fusion Graph (DFG).For description of variables and memory formulation please refer to the original Memory Fusion Network paper (Zadeh et al., 2018a).
ity neuron which indicates how strong or weak the connection is between v i and v j .αs are the main source of interpretability in DFG.The vector of all αs is inferred using a deep neural network D α which takes as input singleton vertices in V (l, v, and a).We leave it to the supervised training objective to learn parameters of D α and make good use of efficacies, thus dynamically controlling the structure of the graph.The singleton vertices are chosen for this purpose since they have no incoming edges thus no efficacy associated with those edges (no efficacy is needed to infer the singleton vertices).
The same singleton vertices l, v, and a are the inputs to the DFG.In the next section we discuss how these inputs are given to DFG.All vertices are connected to the output vertex T t of the network via edges scaled by their respective efficacy.The overall structure of the vertices, edges and respective efficacies is shown in Figure 3.There are a total of 8 vertices (counting the output vertex), 19 edges and subsequently 19 efficacies.

Graph-MFN
To test the performance of DFG, we use a similar recurrent architecture to Memory Fusion Network (MFN).MFN is a recurrent neural model with three main components 1) System of LSTMs: a set of parallel LSTMs with each LSTM modeling a single modality.by assigning coefficients to highlight cross-modal dynamics.3) Multiview Gated Memory is a component that stores the output of multimodal fusion.We replace the Delta-memory Attention Network with DFG and refer to the modified model as Graph Memory Fusion Network (Graph-MFN).Figure 4 shows the overall architecture of the Graph-MFN.
Similar to MFN, Graph-MFN employs a system of LSTMs for modeling individual modalities.c l , c v , and c a represent the memory of LSTMs for language, vision and acoustic modalities respectively.D m , m ∈ {l, v, a} is a fully connected deep neural network that takes in h m [t−1,t] the LSTM representation across two consecutive timestamps, which allows the network to track changes in memory dimensions across time.The outputs of D l , D v and D a are the singleton vertices for the DFG.The DFG models cross-modal interactions and encodes the cross-modal representations in its output vertex T t for storage in the Multi-view Gated Memory u t .The Multi-view Gated Memory functions using a network D u that transforms T t into a proposed memory update ût .γ 1 and γ 2 are the Multi-view Gated Memory's retain and update gates respectively and are learned using networks D γ 1 and D γ 2 .Finally, a network D z transforms T t into a multimodal representation z t to update the system of LSTMs.The output of Graph-MFN in all the experiments is the output of each LSTM h m T as well as contents of the Multi-view Gated Memory at time T (last recurrence timestep), u T .This output is subsequently connected to a classification or regression layer for final prediction (for sentiment and emotion recognition).

Experiments and Discussion
In our experiments, we seek to evaluate how modalities interact during multimodal fusion by studying the efficacies of DFG through time.
Table 3 shows the results on CMU-MOSEI.Accuracy is reported as A x where x is the number of sentiment classes as well as F1 measure.For regression we report MAE and correlation (r).For emotion recognition due to the natural imbalances across various emotions, we use weighted accuracy (Tong et al., 2017) and F1 measure.Graph-MFN shows superior performance in sentiment analysis and competitive performance in emotion recognition.Therefore, DFG is both an effective and interpretable model for multimodal fusion.
To better understand the internal fusion mechanism between modalities, we visualize the behavior of the learned DFG efficacies in Figure 5 for various cases (deep red denotes high efficacy and deep blue denotes low efficacy).
Multimodal Fusion has a Volatile Nature: The first observation is that the structure of the DFG is changing case by case and for each case over time.As a result, the model seems to be selectively prioritizing certain dynamics over the others.For example, in case (I) where all modalities are informative, all efficacies seem to be high, imply- ing that the DFG is able to find useful information in unimodal, bimodal and trimodal interactions.However, in cases (II) and (III) where the visual modality is either uninformative or contradictory, the efficacies of v → l, v and v → l, a, v and l, a → l, a, v are reduced since no meaningful interactions involve the visual modality.
Priors in Fusion: Certain efficacies remain unchanged across cases and across time.These are priors from Human Multimodal Language that DFG learns.For example the model always seems to prioritize fusion between language and audio in (l → l, a), and (a → l, a).Subsequently, DFG gives low values to efficacies that rely unilaterally on language or audio alone: the (l → τ ) and (a → τ ) efficacies seem to be consistently low.On the other hand, the visual modality appears to have a partially isolated behavior.In the presence of informative visual information, the model increases the efficacies of (v → τ ) although the values of other visual efficacies also increase.

Trace of Multimodal Fusion:
We trace the dominant path that every modality undergoes during fusion: 1) language tends to first fuse with audio via (l → l, a) and the language and acoustic modalities together engage in higher level fusions such as (l, a → l, a, v).Intuitively, this is aligned with the close ties between language and audio through word intonations.2) The visual modality seems to engage in fusion only if it contains meaningful information.In cases (I) and (IV), all the paths involving the visual modality are relatively active while in cases (II) and (III) the paths involv-ing the visual modality have low efficacies.3) The acoustic modality is mostly present in fusion with the language modality.However, unlike language, the acoustic modality also appears to fuse with the visual modality if both modalities are meaningful, such as in case (I).
An interesting observation is that in almost all cases the efficacies of unimodal connections to terminal T is low, implying that T prefers to not rely on just one modality.Also, DFG always prefers to perform fusion between language and audio as in most cases both l → l, a and a → l, a have high efficacies; intuitively in most natural scenarios language and acoustic modalities are highly aligned.Both of these cases show unchanging behaviors which we believe DFG has learned as natural priors of human communicative signal.
With these observations, we believe that DFG has successfully learned how to manage its internal structure to model human communication.

Conclusion
In this paper we presented the largest dataset of multimodal sentiment analysis and emotion recognition called CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI).CMU-MOSEI consists of 23,453 annotated sentences from more than 1000 online speakers and 250 different topics.The dataset expands the horizons of Human Multimodal Language studies in NLP.One such study was presented in this paper where we analyzed the structure of multimodal fusion in sentiment analysis and emotion recognition.This was done using a novel interpretable fusion mechanism called Dynamic Fusion Graph (DFG).In our studies we investigated the behavior of in interacting with each other using built-in efficacies of DFG.Aside analysis of fusion, DFG was trained in the Memory Fusion Network pipeline and showed superior performance in sentiment analysis and competitive performance in emotion recognition.

Figure 1 :
Figure 1: The diversity of topics of videos in CMU-MOSEI, displayed as a word cloud.Larger words indicate more videos from that topic.The most frequent 3 topics are reviews (16.2%), debate (2.9%) and consulting (1.8%) while the remaining topics are almost uniformly distributed.

Figure 2 :
Figure2: Distribution of sentiment and emotions in the CMU-MOSEI dataset.The distribution shows a natural skew towards more frequently used emotions.However, the least frequent emotion, fear, still has 1,900 data points which is an acceptable number for machine learning studies.

Figure 3 :
Figure 3: The structure of Dynamic Fusion Graph (DFG) for three modalities of {(l)anguage, (v)ision, (a)coustic}.Dashed lines in DFG show the dynamic connections between vertices controlled by the efficacies (α).

Figure 5 :
Figure 5: Visualization of DFG efficacies across time.The efficacies (thus the DFG structure) change over time as DFG is exposed to new information.DFG is able choose which n-modal dynamics to rely on.It also learns priors about human communication since certain efficacies (thus edges in DFG) remain unchanged across time and across data points.

Table 1 :
Comparison of the CMU-MOSEI dataset with previous sentiment analysis and emotion recognition datasets.#S denotes the number of annotated data points.#Sp is the number of distinct speakers.Mod indicates the subset of modalities present from {(l)anguage, (v)ision, (a)udio}.Sent and Emo columns indicate presence of sentiment and emotion labels.TL denotes the total number of video hours.