Multimodal Language Analysis with Recurrent Multistage Fusion

Computational modeling of human multimodal language is an emerging research area in natural language processing spanning the language, visual and acoustic modalities. Comprehending multimodal language requires modeling not only the interactions within each modality (intra-modal interactions) but more importantly the interactions between modalities (cross-modal interactions). In this paper, we propose the Recurrent Multistage Fusion Network (RMFN) which decomposes the fusion problem into multiple stages, each of them focused on a subset of multimodal signals for specialized, effective fusion. Cross-modal interactions are modeled using this multistage fusion approach which builds upon intermediate representations of previous stages. Temporal and intra-modal interactions are modeled by integrating our proposed fusion approach with a system of recurrent neural networks. The RMFN displays state-of-the-art performance in modeling human multimodal language across three public datasets relating to multimodal sentiment analysis, emotion recognition, and speaker traits recognition. We provide visualizations to show that each stage of fusion focuses on a different subset of multimodal signals, learning increasingly discriminative multimodal representations.


Introduction
Computational modeling of human multimodal language is an upcoming research area in natural language processing. This research area focuses on modeling tasks such as multimodal sentiment analysis (Morency et al., 2011), emotion recognition (Busso et al., 2008), and personality traits recognition (Park et al., 2014). The multimodal temporal signals include the language (spoken words), visual (facial expressions, gestures) and acoustic modalities (prosody, vocal expressions). At its core, these multimodal signals are At each recursive stage, a subset of multimodal signals is highlighted and then fused with previous fusion representations. The first fusion stage selects the neutral word and frowning behaviors which create an intermediate representation reflecting negative emotion when fused together. The second stage selects the loud voice behavior which is locally interpreted as emphasis before being fused with previous stages into a strongly negative representation. Finally, the third stage selects the shrugging and speech elongation behaviors that reflect ambivalence and when fused with previous stages is interpreted as a representation for the disappointed emotion.
highly structured with two prime forms of interactions: intra-modal and cross-modal interactions (Rajagopalan et al., 2016). Intra-modal interactions refer to information within a specific modality, independent of other modalities. For example, the arrangement of words in a sentence according to the generative grammar of a language (Chomsky, 1957) or the sequence of facial muscle activations for the presentation of a frown. Cross-modal interactions refer to interactions between modalities. For example, the simultaneous co-occurrence of a smile with a positive sentence or the delayed occurrence of a laughter after the end of a sentence. Modeling these interactions lie at the heart of human multimodal language analysis and has recently become a centric research direction in multimodal natural language processing (Liu et al., 2018;Pham et al., 2018;, multimodal speech recognition Gupta et al., 2017;Harwath and Glass, 2017;Kamper et al., 2017), as well as multimodal machine learning (Tsai et al., 2018;Srivastava and Salakhutdinov, 2012;Ngiam et al., 2011).
Recent advances in cognitive neuroscience have demonstrated the existence of multistage aggregation across human cortical networks and functions (Taylor et al., 2015), particularly during the integration of multisensory information (Parisi et al., 2017). At later stages of cognitive processing, higher level semantic meaning is extracted from phrases, facial expressions, and tone of voice, eventually leading to the formation of higher level crossmodal concepts (Parisi et al., 2017;Taylor et al., 2015). Inspired by these discoveries, we hypothesize that the computational modeling of crossmodal interactions also requires a multistage fusion process. In this process, cross-modal representations can build upon the representations learned during earlier stages. This decreases the burden on each stage of multimodal fusion and allows each stage of fusion to be performed in a more specialized and effective manner.
In this paper, we propose the Recurrent Multistage Fusion Network (RMFN) which automatically decomposes the multimodal fusion problem into multiple recursive stages. At each stage, a subset of multimodal signals is highlighted and fused with previous fusion representations (see Figure 1). This divide-and-conquer approach decreases the burden on each fusion stage, allowing each stage to be performed in a more specialized and effective way. This is in contrast with conventional fusion approaches which usually model interactions over multimodal signals altogether in one iteration (e.g., early fusion ). In RMFN, temporal and intra-modal interactions are modeled by integrating our new multistage fusion process with a system of recurrent neural networks. Overall, RMFN jointly models intra-modal and cross-modal interactions for multimodal language analysis and is differentiable end-to-end.
We evaluate RMFN on three different tasks related to human multimodal language: sentiment analysis, emotion recognition, and speaker traits recognition across three public multimodal datasets. RMFN achieves state-of-the-art performance in all three tasks. Through a comprehensive set of ablation experiments and visualizations, we demonstrate the advantages of explicitly defining multiple recursive stages for multimodal fusion.

Related Work
Previous approaches in human multimodal language modeling can be categorized as follows: Non-temporal Models: These models simplify the problem by using feature-summarizing temporal observations . Each modality is represented by averaging temporal information through time, as shown for language-based sentiment analysis (Iyyer et al., 2015;Chen et al., 2016) and multimodal sentiment analysis (Abburi et al., 2016;Nojavanasghari et al., 2016;Zadeh et al., 2016;Morency et al., 2011). Conventional supervised learning methods are utilized to discover intra-modal and cross-modal interactions without specific model design (Wang et al., 2016;Poria et al., 2016). These approaches have trouble modeling long sequences since the average statistics do not properly capture the temporal intra-modal and cross-modal dynamics (Xu et al., 2013). Multimodal Temporal Graphical Models: The application of graphical models in sequence modeling has been an important research problem. Hidden Markov Models (HMMs) (Baum and Petrie, 1966), Conditional Random Fields (CRFs) (Lafferty et al., 2001), and Hidden Conditional Random Fields (HCRFs)  were shown to work well on modeling sequential data from the language (Misawa et al., 2017;Ma and Hovy, 2016;Huang et al., 2015) and acoustic (Yuan and Liberman, 2008) modalities. These temporal graphical models have also been extended for modeling multimodal data. Several methods have been proposed including multi-view HCRFs where the potentials of the HCRF are designed to model data from multiple views (Song et al., 2012), multi-layered CRFs with latent variables to learn hidden spatiotemporal dynamics from multi-view data (Song et al., 2012), and multi-view Hierarchical Sequence Summarization models that recursively build up hierarchical representations (Song et al., 2013). Multimodal Temporal Neural Networks: More recently, with the advent of deep learning, Recurrent Neural Networks (Elman, 1990;Jain and Medsker, 1999) have been used extensively for language and speech based sequence modeling (Zilly et al., 2016;Soltau et al., 2016), sentiment analysis (Socher et al., 2013;dos Santos and Gatti, 2014;Glorot et al., 2011;Cambria, 2016), and emotion recognition Bertero et al., 2016;Lakomkin et al., 2018). Long-short Term Memory (LSTM) networks (Hochreiter and Schmidhuber, 1997a) have also been extended for multimodal settings (Rajagopalan et al., 2016) and by learning binary gating mechanisms to remove noisy modalities . Recently, more advanced models were proposed to model both intra-modal and cross-modal interactions. These use Bayesian ranking algorithms (Herbrich et al., 2007) to model both person-independent and person-dependent features , generative-discriminative objectives to learn either joint (Pham et al., 2018) or factorized multimodal representations (Tsai et al., 2018), external memory mechanisms to synchronize multimodal data (Zadeh et al., 2018a), or lowrank tensors to approximate expensive tensor products (Liu et al., 2018). All these methods assume that cross-modal interactions should be discovered all at once rather than across multiple stages, where each stage solves a simpler fusion problem. Our empirical evaluations show the advantages of the multistage fusion approach.

Recurrent Multistage Fusion Network
In this section we describe the Recurrent Multistage Fusion Network (RMFN) for multimodal language analysis (Figure 2). Given a set of modalities Each sequence X m is modeled with an intra-modal recurrent neural network (see subsection 3.3 for details). At time t, each intra-modal recurrent network will output a unimodal representation h m t . The Multistage Fusion Process uses a recursive approach to fuse all unimodal representations h m t into a cross-modal representation z t which is then fed back into each intra-modal recurrent network.

Multistage Fusion Process
The Multistage Fusion Process (MFP) is a modular neural approach that performs multistage fusion to model cross-modal interactions. Multistage fusion is a divide-and-conquer approach which decreases the burden on each stage of multimodal fusion, allowing each stage to be performed in a more specialized and effective way. The MFP has three main modules: HIGHLIGHT, FUSE and SUMMARIZE.
Two modules are repeated at each stage: HIGHLIGHT and FUSE. The HIGHLIGHT module identifies a subset of multimodal signals from that will be used for that stage of fusion. The FUSE module then performs two subtasks simultaneously: a local fusion of the highlighted features and integration with representations from previous stages. Both HIGHLIGHT and FUSE modules are realized using memorybased neural networks which enable coherence between stages and storage of previously modeled cross-modal interactions. As a final step, the SUMMARIZE module takes the multimodal representation of the final stage and translates it into a cross-modal representation z t . Figure 1 shows an illustrative example for multistage fusion. The HIGHLIGHT module selects "neutral words" and "frowning" expression for the first stage. The local and integrated fusion at this stage creates a representation reflecting negative emotion. For stage 2, the HIGHLIGHT module identifies the acoustic feature "loud voice". The local fusion at this stage interprets it as an expression of emphasis and is fused with the previous fusion results to represent a strong negative emotion. Finally, the highlighted features of "shrug" and "speech elongation" are selected and are locally interpreted as "ambivalence". The integration with previous stages then gives a representation closer to "disappointed".

Module Descriptions
In this section, we present the details of the three multistage fusion modules: HIGHLIGHT, FUSE and SUMMARIZE. Multistage fusion begins with the concatenation of intra-modal network outputs h t = m∈M h m t . We use superscript [k] to denote the indices of each stage k = 1, , K during K total stages of multistage fusion. Let ⇥ denote the neural network parameters across all modules.
HIGHLIGHT: At each stage k, a subset of the multimodal signals represented in h t will be au- At each stage, the HIGHLIGHT module identifies a subset of multimodal signals and the FUSE module performs local fusion before integration with previous fusion representations. The SUMMARIZE module translates the representation at the final stage into a cross-modal representation z t to be fed back into the intra-modal recurrent networks.
tomatically highlighted for fusion. Formally, this module is defined by the process function f H : t is a set of attention weights which are inferred based on the previously assigned attention weights a [1∶k−1] t . As a result, the highlights at a specific stage k will be dependent on previous highlights. To fully encapsulate these dependencies, the attention assignment process is performed in a recurrent manner using a LSTM which we call the HIGHLIGHT LSTM. The initial HIGHLIGHT LSTM memory at stage 0, c This allows the memory mechanism of the HIGHLIGHT LSTM to dynamically adjust to the intra-modal representations h t . The output of the is softmax activated to produce attention weights a [k] t at every stage k of the multistage fusion process: and a [k] t is fed as input into the HIGHLIGHT LSTM at stage k + 1. Therefore, the HIGHLIGHT LSTM functions as a decoder LSTM (Sutskever et al., 2014;Cho et al., 2014) in order to capture the dependencies on previous attention assignments. Highlighting is performed by element-wise multiplying the attention weights a [k] t with the concatenated intra-modal representations h t : where ⊙ denotes the Hadamard product andh [k] t are the attended multimodal signals that will be used for the fusion at stage k.

FUSE:
The highlighted multimodal signals are simultaneously fused in a local fusion and then integrated with fusion representations from previous stages. Formally, this module is defined by the process function f F : where s [k] t denotes the integrated fusion representations at stage k. We employ a FUSE LSTM to simultaneously perform the local fusion and the integration with previous fusion representations. The FUSE LSTM input gate enables a local fusion while the FUSE LSTM forget and output gates enable integration with previous fusion results. The initial FUSE LSTM memory at stage 0, c t . Formally, this operation is defined as: where z t is the final output of the multistage fusion process and represents all cross-modal interactions discovered at time t. The summarized cross-modal representation is then fed into the intra-modal recurrent networks as described in the subsection 3.3.

System of Long Short-term Hybrid Memories
To integrate the cross-modal representations z t with the temporal intra-modal representations, we employ a system of Long Short-term Hybrid Memories (LSTHMs) (Zadeh et al., 2018b). The LSTHM extends the LSTM formulation to include the cross-modal representation z t in a hybrid memory component: where is the (hard-)sigmoid activation function, tanh is the tangent hyperbolic activation function, ⊙ denotes the Hadamard product. i, f and o are the input, forget and output gates respectively.c m t+1 is the proposed update to the hybrid memory c m t at time t + 1 and h m t is the time distributed output of each modality. The cross-modal representation z t is modeled by the Multistage Fusion Process as discussed in subsection 3.2. The hybrid memory c m t contains both intra-modal interactions from individual modalities x m t as well as the cross-modal interactions captured in z t .

Optimization
The multimodal prediction task is performed using a final representation E which integrate (1) the last outputs from the LSTHMs and (2) the last crossmodal representation z T . Formally, E is defined as: where denotes vector concatenation. E can then be used as a multimodal representation for supervised or unsupervised analysis of multimodal language. It summarizes all modeled intra-modal and cross-modal representations from the multimodal sequences. RMFN is differentiable end-toend which allows the network parameters ⇥ to be learned using gradient descent approaches.

Experimental Setup
To evaluate the performance and generalization of RMFN, three domains of human multimodal language were selected: multimodal sentiment analysis, emotion recognition, and speaker traits recognition.

Datasets
All datasets consist of monologue videos. The speaker's intentions are conveyed through three modalities: language, visual and acoustic.

Multimodal Features and Alignment
GloVe word embeddings (Pennington et al., 2014), Facet (iMotions, 2017) and COVAREP (Degottex et al., 2014) are extracted for the language, visual and acoustic modalities respectively 1 . Forced alignment is performed using P2FA (Yuan and Liberman, 2008) to obtain the exact utterance times

IEMOCAP Emotions Task
Happy

Baseline Models
We compare to the following models for multimodal machine learning: MFN (Zadeh et al., 2018a) synchronizes multimodal sequences using a multi-view gated memory. It is the current state of the art on CMU-MOSI and POM. MARN (Zadeh et al., 2018b) models intra-modal and cross-modal interactions using multiple attention coefficients and hybrid LSTM memory components. GME-LSTM(A)  learns binary gating mechanisms to remove noisy modalities that are contradictory or redundant for prediction. TFN ) models unimodal, bimodal and trimodal interactions using tensor products.

Evaluation Metrics
For classification, we report accuracy Ac where c denotes the number of classes and F1 score. For regression, we report Mean Absolute Error MAE and Pearson's correlation r. For MAE lower values indicate stronger performance. For all remaining metrics, higher values indicate stronger performance.

Performance on Multimodal Language
Results on CMU-MOSI, IEMOCAP and POM are presented in Tables 1, 2 and 3 respectively 2 . We achieve state-of-the-art or competitive results for all domains, highlighting RMFN's capability in human multimodal language analysis. We observe that RMFN does not improve results on IEMO-CAP neutral emotion and the model outperforming RMFN is a memory-based fusion baseline (Zadeh et al., 2018a). We believe that this is because neutral expressions are quite idiosyncratic. Some people may always look angry given their facial configuration (e.g., natural eyebrow raises of actor Jack Nicholson). In these situations, it becomes useful to compare the current image with a memorized or aggregated representation of the speaker's face. Our proposed multistage fusion approach can easily be extended to memory-based fusion methods.    Dataset    Table 4, we observe that increasing the number of stages K increases the model's capability to model cross-modal interactions up to a certain point (K = 3) in our experiments. Further increases led to decreases in performance and we hypothesize this is due to overfitting on the dataset. Q3: To compare multistage against independent modeling of cross-modal interactions, we pay close attention to the performance comparison with respect to MARN which models multiple crossmodal interactions all at once (see Table 5). RMFN shows improved performance, indicating that multistage fusion is both effective and efficient for human multimodal language modeling. Q4: RMFN (no MFP) represents a system of LSTHMs without the integration of z t from the MFP to model cross-modal interactions. From Table 5, RMFN (no MFP) is outperformed by RMFN, confirming that modeling cross-modal interactions is crucial in analyzing human multimodal language. Q5: RMFN (no HIGHLIGHT) removes the HIGHLIGHT module from MFP during multistage fusion. From Table 5, RMFN (no HIGHLIGHT) underperforms, indicating that highlighting multimodal representations using attention weights are important for modeling cross-modal interactions.

Visualizations
Using an attention assignment mechanism during the HIGHLIGHT process gives more interpretability to the model since it allows us to visualize the attended multimodal signals at each stage and time step (see Figure 3). Using RMFN trained on the CMU-MOSI dataset, we plot the attention weights across the multistage fusion process for three videos in CMU-MOSI. Based on these visualizations we first draw the following general observations on multistage fusion: Across stages: Attention weights change their behaviors across the multiple stages of fusion. Some features are highlighted by earlier stages while other features are used in later stages. This supports our hypothesis that RMFN learns to specialize in different stages of the fusion process. Across time: Attention weights vary over time and adapt to the multimodal inputs. We observe that the attention weights are similar if the input contains no new information. As soon as new multimodal information comes in, the highlighting mechanism in RMFN adapts to these new inputs.
Priors: Based on the distribution of attention weights, we observe that the language and acoustic modalities seem the most commonly highlighted. This represents a prior over the expression of sentiment in human multimodal language and is closely related to the strong connections between language and speech in human communication (Kuhl, 2000). Inactivity: Some attention coefficients are not active (always orange) throughout time. We hypothesize that these corresponding dimensions carry only intra-modal dynamics and are not involved in the formation of cross-modal interactions.

Qualitative Analysis
In addition to the general observations above, Figure 3 shows three examples where multistage fusion learns cross-modal representations across three different scenarios. Synchronized Interactions: In Figure 3(a), the language features are highlighted corresponding to the utterance of the word "fun" that is highly indicative of sentiment (t = 5). This sudden change is also accompanied by a synchronized highlighting of the acoustic features. We also notice that the highlighting of the acoustic features lasts longer across the 3 stages since it may take multiple stages to interpret all the new acoustic behaviors (elongated tone of voice and phonological emphasis). Asynchronous Trimodal Interactions: In Figure 3(b), the language modality displays ambiguous sentiment: "delivers a lot of intensity" can be inferred as both positive or negative. We observe that the circled attention units in the visual and acoustic features correspond to the asynchronous presence of a smile (t = 2 ∶ 5) and phonological emphasis (t = 3) respectively. These nonverbal behaviors resolve ambiguity in language and result in an overall display of positive sentiment. We further note the coupling of attention weights that highlight the language, visual and acoustic features across stages (t = 3 ∶ 5), further emphasizing the coordination of all three modalities during multistage fusion despite their asynchronous occurrences. Bimodal Interactions: In Figure 3(c), the language modality is better interpreted in the context of acoustic behaviors. The disappointed tone and soft voice provide the nonverbal information useful for sentiment inference. This example highlights the bimodal interactions (t = 4 ∶ 7) in alternating stages: the acoustic features are highlighted more in earlier stages while the language features are highlighted increasingly in later stages.

Conclusion
This paper proposed the Recurrent Multistage Fusion Network (RMFN) which decomposes the multimodal fusion problem into multiple stages, each focused on a subset of multimodal signals. Extensive experiments across three publicly-available datasets reveal that RMFN is highly effective in modeling human multimodal language. In addition to achieving state-of-the-art performance on all datasets, our comparisons and visualizations reveal that the multiple stages coordinate to capture both synchronous and asynchronous multimodal interactions. In future work, we are interested in merging our model with memory-based fusion methods since they have complementary strengths as discussed in subsection 5.1.