Multimodal Routing: Improving Local and Global Interpretability of Multimodal Language Analysis

The human language can be expressed through multiple sources of information known as modalities, including tones of voice, facial gestures, and spoken language. Recent multimodal learning with strong performances on human-centric tasks such as sentiment analysis and emotion recognition are often black-box, with very limited interpretability. In this paper we propose Multimodal Routing, which dynamically adjusts weights between input modalities and output representations differently for each input sample. Multimodal routing can identify relative importance of both individual modalities and cross-modality features. Moreover, the weight assignment by routing allows us to interpret modality-prediction relationships not only globally (i.e. general trends over the whole dataset), but also locally for each single input sample, mean-while keeping competitive performance compared to state-of-the-art methods.


Introduction
The human language contains multimodal cues, including textual (e.g., spoken or written words), visual (e.g., body gestures), and acoustic (e.g., voice tones) modalities. It acts as a medium for human communication and has been advanced in areas spanning affect recognition (Busso et al., 2008), media description (Lin et al., 2014), and multimedia information retrieval (Abu-El-Haija et al., 2016). Modeling multimodal sources requires to understand the relative importance of not only each single modality (defined as unimodal explanatory features) but also the interactions (defined as bimodal or trimodal explanatory features) (Büchel et al., 1998). Recent work (Liu et al., 2018;Williams et al., 2018;Ortega et al., 2019) proposed methods to fuse information across modalities and * indicates equal contribution.
Code is available at https://github.com/martinmamql/ multimodal_routing. yielded superior performance, but these models are often black-box with very limited interpretability. Interpretability matters. It allows us to identify crucial explanatory features for predictions. Such interpretability knowledge could be used to provide insights into multimodal learning, improve the model design, or debug a dataset. This inerpretability is useful at two levels: the global and the local level. The global interpretation reflects the general (averaged) trends of explanatory feature importance over the whole dataset. The local interpretation is arguably harder but can give a highresolution insight of feature importance specifically depending on each individual samples during training and inference. These two levels of interpretability should provide us an understanding of unimodal, bimodal and trimodal explanatory features.  Figure 2: Overview of Multimodal Routing, which contains encoding, routing, and prediction stages. We consider only two input modalities in this example. The encoding stage computes unimodal and bimodal explanatory features with the inputs from different modalities. The routing stage iteratively performs concepts update and routing adjustment. The prediction stage decodes the concepts to the model's prediction. The routing associates the text and the visual-text features with negative sentiment in the left example, and the vision and the visual-text features with positive sentiment in the right example before making predictions.
In this paper we address both local and global interpretability of unimodal, bimodal and trimodal explanatory featuress by presenting Multimodal Routing. In human multimodal language, such routing dynamically changes weights between modalities and output labels for each sample as shown in Fig. 1. The most significant contribution of Multimodal Routing is its ability to establish local weights dynamically for each input sample between modality features and the labels during training and inference, thus providing local interpretation for each sample.
Our experiments focus on two tasks of sentiment analysis and emotion recognition tasks using two benchmark multimodal language datasets, IEMO-CAP (Busso et al., 2008) and CMU-MOSEI . We first study how our model compares with the state-of-the-art methods on these tasks. More importantly we provide local interpretation by qualitatively analyzing adjusted local weights for each sample. Then we also analyze the global interpretation using statistical techniques to reveal crucial features for prediction on average. Such interpretation of different resolutions strengthens our understanding of multimodal language learning.

Related Work
Multimodal language learning is based on the fact that human integrates multiple sources such as acoustic, textual, and visual information to learn language (McGurk and MacDonald, 1976;Ngiam et al., 2011;Baltrušaitis et al., 2018). Recent ad-vances in modeling multimodal language using deep neural networks are not interpretable (Wang et al., 2019;Tsai et al., 2019a). Linear method like the Generalized Additive Models (GAMs) (Hastie, 2017) do not offer local interpretability. Even though we could use post hoc (interpret predictions given an arbitrary model) methods such as LIME (Ribeiro et al., 2016), SHAP (Lundberg and Lee, 2017), and L2X (Chen et al., 2018) to interpret these black-box models, these interpretation methods are designed to detect the contributions only from unimodal features but not bimodal or trimodal explanatory features. It is shown that in human communication, modality interactions are more important than individual modalities (Engle, 1998).
Two recent methods, Graph-MFN  and Multimodal Factorized Model (MFM) (Tsai et al., 2019b), attempted to interpret relationships between modality interactions and learning for human language. Nonetheless, Graph-MFN did not separate the contributions among unimodal and multimodal explanatory features, and MFM only provided the analysis on trimodal interaction feature. Both of them cannot interpret how both single modality and modality interactions contribute to final prediction at the same time.
Our method is inspired and related to Capsule Networks (Sabour et al., 2017;Hinton et al., 2018), which performs routing between layers of capsules. Each capsule is a group of neurons that encapsulates spatial information as well as the probability of an object being present. On the other hand, our method performs routing between multimodal features (i.e., unimodal, bimodal, and trimodal explanatory features) and concepts of the model's decision.

Method
The proposed Multimodal Routing contains three stages shown in Fig. 2: encoding, routing, and prediction. The encoding stage encodes raw inputs (speech, text, and visual data) to unimodal, bimodal, and trimodal features. The routing stage contains a routing mechanism (Sabour et al., 2017;Hinton et al., 2018), which 1) updates some hidden representations and 2) adjusts local weights between each feature and each hidden representation by pairwise similarity. Following previous work (Mao et al., 2019), we call the hidden representations "concepts", and each of them is associated to specific a prediction label (in our case sentiment or an emotion). Finally, the prediction stage takes the inferred concepts to perform model prediction.

Multimodal Routing
We use v(isual), a(coustic), and t(ext) to denote the three commonly considered modalities in human multimodal language. Let x = {x a , x v , x t } represent the multimodal input. x a ∈ R Ta×da is an audio stream with time length T a and feature dimension d a (at each time step). Similarly, x v ∈ R Tv×dv is the visual stream and x t ∈ R Tt×dt is the text stream. In our paper, we consider multiclass or multilabel prediction tasks for the multimodal language modeling, in which we use y ∈ R J to denote the ground truth label with J being the number of classes or labels, andŷ to represent the model's prediction. Our goal is to find the relative importance of the contributions from unimodal (e.g., x a itself), bimodal (e.g., the interaction between x a and x v ), and trimodal features (e.g., the interaction between x a , x v , and x t ) to the model predictionŷ.
Encoding Stage. The encoding stage encodes multimodal inputs {x a , x v , x t } into explanatory features. We use f i ∈ R d f to denote the features with i ∈ {a, v, t} being unimodal, i ∈ {av, vt, ta} being bimodal, and i ∈ {avt} being trimodal interactions. d f is the dimension of the feature. To be specific, f a = F a (x a ; θ), f av = F av (x a , x v ; θ), and f avt = F avt (x a , x v , x t ; θ) with θ as the parameters of the encoding functions and F as the encoding functions. Multimodal Transformer (MulT) (Tsai et al., 2019a) is adopted as the design of the encoding functions F i . Here the trimodal function F avt encodes sequences from three modalities into a unified representation, F av encodes acoustic and visual modalities, and F a encodes acoustic input. Next, p i ∈ [0, 1] is a scalar representing how each feature f i is activated in the model. Similar to f i , we also use MulT to encode p i from the input x i . That is, p a = P a (x a ; θ ), p av = P av (x a , x v , θ ), and p avt = P avt (x a , x v , x t , θ ) with θ as the parameters of MulT and P i as corresponding encoding functions (details in the Supplementary).
Routing Stage. The goal of routing is to infer interpretable hidden representations (termed here as "concepts") for each output label. The first step of routing is to initialize the concepts with equal weights, where all explanatory features are as important. Then the core part of routing is an iterative process which will enforce for each explanatory feature to be assigned to only one output representations (a.k.a the "concepts"; in reality it is a soft assignment) that shows high similarity with a concept. Formally each concept c j ∈ R dc is represented as a one-dimensional vector of dimension d c . Linear weights r ij , which we term routing coefficient, are defined between each concept c j and explanatory factor f i .
The first half of routing, which we call routing adjustment, is about finding new assignment (i.e. the routing coefficients) between the input features and the newly learned concepts by taking a softmax of the dot product over all concepts, thus only the features showing high similarity of a concept (sentiment or an emotion in our case) will be assigned close to 1, instead of having all features assigned to all concepts. This will help local interpretability because we can always distinguish important explanatory features from non-important ones. The second half of routing, which we call concept update, is to update concepts by linearly aggregating the new input features weighted by the routing coefficients, so that it is local to each input samples.
-Routing adjustment. We define the routing coefficient r ij ∈ [0, 1] by measuring the similarity 1 Procedure 1 Multimodal Routing Concepts are initialized with uniform weights 3: for t iterations do /* Routing Adjustment */ 4: for all feature i and concept j: sij ← (fiWij) cj 5: for all feature i: rij ← exp sij / j exp s ij /* Concepts Update */ 6: for all concept j: cj ← i pirij(fiWij) return {cj} between f i W ij and c j : We note that r ij is normalized over all concepts c j . Hence, it is a coefficient which takes high value only when f i is in agreement with c j but not with c j , where j = j.
-Concept update. After obtaining p i from encoding stage, we update concepts c j using weighted average as follows: updates the concepts based on the routing weights by summing input features f i projected by weight matrix W ij to the space of the jth concept c j is now essentially a linear aggregation from We summarize the routing procedure in Procedure 1, which returns concepts (c j ) given explanatory features (f i ), local weights (W ij ) and p i . First, we initialize the concepts with uniform weights. Then, we iteratively perform adjustment on routing coefficients (r ij ) and concept updates. Finally, we return the updated concepts.
Prediction Stage. The prediction stage takes the inferred concepts to make predictionsŷ. Here, we apply linear transformations to concept c j to obtain the logits. Specifically, the jth logit is formulated as where o j ∈ R dc and is the weight of the linear transformation for the jth concept. Then, the Softmax (for multi-class task) or Sigmoid (for multi-label task) function is applied on the logits to obtain the predictionŷ.

Interpretability
In this section, we provide the framework of locally interpreting relative importance of unimodal, bimodal, and trimodal explanatory features to model prediction given different samples, by interpreting the routing coefficients r ij , which represents the weight assignment between feature f i and concept c j . We also provide methods to globally interpret the model across the whole dataset.

Local Interpretation
The goal of local interpretation is trying to understand how the importance of modality and modality interaction features change, given different multimodal samples. In eq. (3), a decision logit considers an addition of the contributions from the unimodal {f a , f v , f t }, bimodal {f av , f vt , f ta }, and trimodal f avt explanatory features. The particular contribution from the feature f i to the jth concept is represented by is large. Intuitively, any of the three scenarios requires high similarity between a modality feature and a concept vector which represents a specific sentiment or emotion. Note that p i , r ij and f i are the covariates and o j and W ij are the parameters in the model. Since different input samples yield distinct p i and r ij , we can locally interpret p i and r ij as the effects of the modality feature f i contributing to the jth logit of the model, which is roughly a confidence of predicting jth sentiment or emotion. We will show examples of local interpretations in the Interpretation Analysis section.

Global Interpretation
To globally interpret Multimodal Routing, we analyze r ij , the average values of routing coefficients r ij s over the entire dataset. Since eq. (3) considers a linear effect from f i , p i and r ij to logit j , r ij represents the average assignment from feature f i to the jth logit. Instead of reporting the values for r ij , we provide a statistical interpretation on r ij using confidence intervals to provide a range of possible plausible coefficients with probability guarantees. Similar tests on p i and p i r ij are provided in Supplementary Materials. Here we choose confidence intervals over p-values because they provide much richer information (Ranstam, 2012;Du Prel et al., 2009). Suppose we have n data with the corre-sponding r ij = {r ij,1 , r ij,2 , · · · r ij,n }. If n is large enough and r ij has finite mean and finite variance (it suffices since r ij ∈ [0, 1] is bounded), according to Central Limit Theorem, r ij (i.e., mean of r ij ) follows a Normal distribution: where µ is the true mean of r ij and s 2 n is the sample variance in r ij . Using eq. (4), we can provide a confidence interval for r ij . We follow 95% confidence in our analysis.

Experiments
In this section, we first provide details of experiments we perform and comparison between our proposed model and state-of-the-art (SOTA) method, as well as baseline models. We include interpretability analysis in the next section.

Datasets
We perform experiments on two publicly available benchmarks for human multimodal affect recognition: CMU-MOSEI  and IEMO-CAP (Busso et al., 2008). CMU-MOSEI  contains 23, 454 movie review video clips taken from YouTube. For each clip, there are two tasks: sentiment prediction (multiclass classification) and emotion recognition (multilabel classification). For the sentiment prediction task, each sample is labeled by an integer score in the range [−3, 3], indicating highly negative sentiment (−3) to highly positive sentiment (+3). We use some metrics as in prior work : seven class accuracy (Acc 7 : seven class classification in Z ∈ [−3, 3]), binary accuracy (Acc 2 : two-class classification in {−1, +1}), and F1 score of predictions. For the emotion recognition task, each sample is labeled by one or more emotions from {Happy, Sad, Angry, Fear, Disgust, Surprise}. We report the metrics : six-class accuracy (multilabel accuracy of predicting six emotion labels) and F1 score.
IEMOCAP consists of 10K video clips for human emotion analysis. Each clip is evaluated and then assigned (possibly more than one) labels of emotions, making it a multilabel learning task. Following prior work and insight (Tsai et al., 2019a;Tripathi et al., 2018;Jack et al., 2014), we report on four emotions (happy, sad, angry, and neutral), with metrics four-class accuracy and F1 score.
For both datasets, we use the extracted features from a public SDK https://github.com/ A2Zadeh/CMU-MultimodalSDK, whose features are extracted from textual (GloVe word embedding (Pennington et al., 2014)), visual (Facet (iMotions, 2019)), and acoustic (COVAREP (Degottex et al., 2014)) modalities. The acoustic and vision features are processed to be aligned with the words (i.e., text features). We present results using this wordaligned setting in this paper, but ours can work on unaligned multimodal language sequences. The train, valid and test set split are following previous work (Wang et al., 2019;Tsai et al., 2019a).

Ablation Study and Baseline Models
We provide two ablation studies for interpretable methods as baselines: The first is based on Generalized Additive Model (GAM) (Hastie, 2017) which directly sums over unimodal, bimodal, and trimodal features and then applies a linear transformation to obtain a prediction. This is equivalent to only using weight p i and no routing coefficients. The second is our denoted as Multimodal Routing * , which performs only one routing iteration (by setting t = 1 in Procedure 1) and does not iteratively adjust the routing and update the concepts.

Results and Discussions
We trained our model on 1 RTX 2080 GPU. We use 7 layers in the Multimodal Transformer, and choose the batch size as 32. The model is trained with initial learning rate of 10 −4 and Adam optimizer.
CMU-MOSEI sentiment. Table 1 summarizes the results on this dataset. We first compare all the interpretable methods. We see that Multimodal Routing enjoys performance improvement over both GAM (Hastie, 2017), a linear model on encoded features, and Multimodal Routing * , a noniterative feed-forward net with same parameters as Multimodal Routing. The improvement suggests the proposed iterative routing can obtain a more robust prediction by dynamically associating the features and the concepts of the model's predictions. Next, when comparing to the non-  interpretable methods, Multimodal Routing outperforms EF-LSTM, LF-LSTM and RAVEN models and performs competitively when compared with MulT (Tsai et al., 2019a). It is good to see that our method can competitive performance with the added advantage of local and global interpretability (see analysis in the later section). The configuration of our model is in the supplementary file.
CMU-MOSEI emotion. We report the results in Table 2. We do not report RAVEN (Wang et al., 2019) and MulT (Tsai et al., 2019a) since they did not report CMU-MOSEI results. Compared with all the baselines, Multimodal Routing performs again competitively on most of the results metrics. We note that the distribution of labels is skewed (e.g., there are disproportionately very few samples labeled as "surprise"). Hence, this skewness somehow results in the fact that all models end up predicting not "surprise", thus the same accuracy for "surprise" across all different approaches.
IEMOCAP emotion. When looking at the IEMOCAP results in Table 1, we see similar trends with CMU-MOSEI sentiment and CMU-MOSEI emotion, that multimodal routing achieves perfor-mance close to the SOTA method. We see a performance drop in the emotion "happy", but our model outperforms the SOTA method for the emotion "angry".

Interpretation Analysis
In this section, we revisit our initial research question: how to locally identify the importance or contribution of unimodal features and the bimodal or trimodal interactions? We provide examples in this section on how multimodal routing can be used to see the variation of contributions. We first present the local interpretation and then the global interpretation.

Local Interpretation Analysis
We show how our model makes decisions locally for each specific input sample by looking at inferred coefficients p i r ij . Different samples create different p i and r ij , and their product represents how each feature vector contributes to final prediction locally, thus providing local interpretability. We provide such interpretability analysis on examples from CMU-MOSEI sentiment prediction and emotion recognition, and illustrate them in Fig. 3. For sentiment prediction, we show samples with true labels neutral (0), most negative sentiment (−3), and most positive (+3) sentiment score. For emotion recognition, we illustrate examples with true label "happy", "sad", and "disgust" emotions. A color leaning towards red in the rightmost spectrum stands for a high association, while a color leaning towards blue suggests a low association.
In the upper-left example in Fig. 3, a speaker is introducing movie Sweeny Todd. He says the movie is a musical and suggests those who dislike musicals not to see the movie. Since he has no personal judgment on whether he personally likes or dislikes the movie, his sentiment is classified as neutral (0), although the text modality (i.e., transcript) contains a "don't". In the vision modality (i.e., videos), he frowns when he mentions this movie is musical, but we cannot conclude his sentiment to be neutral by only looking at the visual modality. By looking at both vision and text together (their interaction), the confidence in neutral is high. The model gives the text-vision interaction feature a high value of p i r ij to suggest it highly contributes to the prediction, which confirms our reasoning above.
Similarly, for the bottom-left example, the speaker is sharing her experience on how to audition for a Broadway show. She talks about a very detailed and successful experience of herself and describes "love" in her audition monologue, which is present in the text. Also, she has a dramatic smile and a happy tone. We believe all modalities play a role in the prediction. As a result, the trimodal interaction feature contributes significantly to the prediction of happiness, according to our model.
Notably, by looking at the six examples overall, we could see each individual sample bears a different pattern of feature importance, even when the sentiment is the same. This is a good debuging and interpretation tool. For global interpretation, all these samples will be averaged giving more of a general trend.

Global Interpretation Analysis
Here we analyze the global interpretation of Multimodal Routing. Given the averaged routing coefficients r ij generated and aggregated locally from samples, we want to know the overall connection between each modality or modality interaction and each concept across the whole dataset. To evaluate  Table 3: Global interpretation (quantitative results) for Multimodal Routing. Confidence Interval of r ij , sampled from CMU-MOSEI sentiment task (top) and emotion task (bottom). We bold the values that have the largest mean in each emotion and are significantly larger than a uniform routing (1/J = 1/7 = 0.143).
these routing coefficients we will compare them to uniform weighting, i.e., 1 J where J is the number of concepts. To perform such analysis, we provide confidence intervals of each r ij . If this interval is outside of 1 J , we can interpret it as a distinguishably significant feature. See Supplementary for similar analysis performed on p i r ij and p i .
First we provide confidence intervals of r ij sampled from CMU-MOSEI sentiment. We compare our confidence intervals with the value 1 J . From top part of Table 3, we can see that our model relies identified language modality for neutral sentiment predictions; acoustic modality for extremely negative predictions (row r a column -3); and textacoustic bimodal interaction for extremely positive predictions (row r ta column 3). Similarly, we analyze r ij sampled from CMU-MOSEI emotion (bottom part of Table 3). We can see that our model identified the text modality for predicting emotion fear (row r t column Fear, the same indexing for later cases), the acoustic modality for predicting emotion disgust, the text-acoustic interaction for predicting emotion surprise, and the acoustic-visual interaction for predicting emotion angry. For emotion happy and sad, either trimodal interaction has the most significant connection, or the routing is not significantly different among modalities.
Interestingly, these results echo previous research. In both sentiment and emotion cases, acoustic features are crucial for predicting negative sen-timent or emotions. This well aligns with research results in behavior science (Lima et al., 2013). Furthermore, (Livingstone and Russo, 2018) showed that the intensity of emotion angry is stronger in acoustic-visual than in either acoustic or visual modality in human speech.

Conclusion
In this paper, we presented Multimodal Routing to identify the contributions from unimodal, bimodal and trimodal explanatory features to predictions in a locally manner. For each specific input, our method dynamically associates an explanatory feature with a prediction if the feature explains the prediction well. Then, we interpret our approach by analyzing the routing coefficients, showing great variation of feature importance in different samples. We also conduct global interpretation over the whole datasets, and show that the acoustic features are crucial for predicting negative sentiment or emotions, and the acoustic-visual interactions are crucial for predicting emotion angry. These observations align with prior work in psychological research. The advantage of both local and global interpretation is achieved without much loss of performance compared to the SOTA methods. We believe that this work sheds light on the advantages of understanding human behaviors from a multimodal perspective, and makes a step towards introducing more interpretable multimodal language models.   ments use the same hyper-parameter configuration in this paper.

A.3 Remarks on CMU-MOSEI Sentiment
Our model poses the problem as classification and predicts only integer labels, so we don't provide mean average error and correlation metrics.

A.4 Remarks on CMU-MOSEI Emotion
Due to the introduction of concepts in our model, we transform the CMU-MOSEI emotion recognition task from a regression problem (every emotion has a score in [0, 3] indicating how strong the evidence of that emotion is) to a classification problem. For each sample with six emotion scores, we label all emotions with scores greater than zero to be present in the sample. Then a data sample would have a multiclass label.

A.5 Global Interpretation Result
We analyze global interpretation of both CMU-MOSEI sentiment and emotion task.

CMU-MOSEI Sentiment
The analysis of the routing coefficients r ij is included in the main paper. We then analyze p i (table 6) and the products p i r ij (table 4). Same as analysis in the main paper, our model relies on acoustic modality for extremely negative predictions (row r a column -3) and textacoustic bimodal interaction for extremely positive predictions (row r ta column 3). The sentiment that is neutral or less extreme are predicted by contributions from many different modalities / interactions. The activation table shows high activation value (> 0.8) for most modality / interactions except p vl .
CMU-MOSEI Emotion Same as above, we analyze p i (Table 7) and the product p i r ij (Table  5).The result is very similar to that of r ij . The activation table shows high activation value (> 0.8) for most modality / interactions except p vl , same as CMU-MOSEI sentiment. We see strong connections between audio-visual interactions and angry, text modality and fear, audio modality and disgust, and text-audio interactions and surprise. The activation table shows high activation value (> 0.8) for most modality / interactions except p vl as well.