Multimodal Transformer for Unaligned Multimodal Language Sequences

Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise crossmodal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.


Introduction
Human language possesses not only spoken words but also nonverbal behaviors from vision (facial attributes) and acoustic (tone of voice) modalities (Gibson et al., 1994). This rich information provides us the benefit of understanding human behaviors and intents . Nevertheless, the heterogeneities across modalities often increase the difficulty of analyzing human language. For example, the receptors for audio and vision streams may vary with variable receiving frequency, and hence we may not obtain optimal mapping between them. A frowning face may relate to a pessimistically word spoken in the past. * *equal contribution.  That is to say, multimodal language sequences often exhibit "unaligned" nature and require inferring long term dependencies across modalities, which raises a question on performing efficient multimodal fusion.
To address the above issues, in this paper we propose the Multimodal Transformer (MulT), an end-to-end model that extends the standard Transformer network (Vaswani et al., 2017) to learn representations directly from unaligned multimodal streams. At the heart of our model is the crossmodal attention module, which attends to the crossmodal interactions at the scale of the entire utterances. This module latently adapts streams from one modality to another (e.g., vision → language) by repeated reinforcing one modality's features with those from the other modalities, re-gardless of the need for alignment. In comparison, one common way of tackling unaligned multimodal sequence is by forced word-aligning before training (Poria et al., 2017;Zadeh et al., 2018a,b;Tsai et al., 2019;Pham et al., 2019;Gu et al., 2018): manually preprocess the visual and acoustic features by aligning them to the resolution of words. These approaches would then model the multimodal interactions on the (already) aligned time steps and thus do not directly consider long-range crossmodal contingencies of the original features. We note that such wordalignment not only requires feature engineering that involves domain knowledge; but in practice, it may also not always be feasible, as it entails extra meta-information about the datasets (e.g., the exact time ranges of words or speech utterances). We illustrate the difference between the word-alignment and the crossmodal attention inferred by our model in Figure 1.
For evaluation, we perform a comprehensive set of experiments on three human multimodal language benchmarks: CMU-MOSI (Zadeh et al., 2016), CMU-MOSEI (Zadeh et al., 2018b), and IEMOCAP (Busso et al., 2008). Our experiments show that MulT achieves the state-of-theart (SOTA) results in not only the commonly evaluated word-aligned setting but also the more challenging unaligned scenario, outperforming prior approaches by a margin of 5%-15% on most of the metrics. In addition, empirical qualitative analysis further suggests that the crossmodal attention used by MulT is capable of capturing correlated signals across asynchronous modalities.

Related Works
Human Multimodal Language Analysis. Prior work for analyzing human multimodal language lies in the domain of inferring representations from multimodal sequences spanning language, vision, and acoustic modalities. Unlike learning multimodal representations from static domains such as image and textual attributes (Ngiam et al., 2011;Srivastava and Salakhutdinov, 2012), human language contains time-series and thus requires fusing time-varying signals (Liang et al., 2018;Tsai et al., 2019). Earlier work used early fusion approach to concatenate input features from different modalities (Lazaridou et al., 2015;Ngiam et al., 2011) and showed improved performance as compared to learning from a sin-gle modality. More recently, more advanced models were proposed to learn representations of human multimodal language. For example, Gu et al. (2018) used hierarchical attention strategies to learn multimodal representations, Wang et al. (2019) adjusted the word representations using accompanying non-verbal behaviors, Pham et al. (2019) learned robust multimodal representations using a cyclic translation objective, and Dumpala et al. (2019) explored cross-modal autoencoders for audio-visual alignment. These previous approaches relied on the assumption that multimodal language sequences are already aligned in the resolution of words and considered only short-term multimodal interactions. In contrast, our proposed method requires no alignment assumption and defines crossmodal interactions at the scale of the entire sequences.

Transformer
Network. Transformer network (Vaswani et al., 2017) was first introduced for neural machine translation (NMT) tasks, where the encoder and decoder side each leverages a self-attention (Parikh et al., 2016;Lin et al., 2017;Vaswani et al., 2017) transformer. After each layer of the self-attention, the encoder and decoder are connected by an additional decoder sublayer where the decoder attends to each element of the source text for each element of the target text. We refer the reader to (Vaswani et al., 2017) for a more detailed explanation of the model. In addition to NMT, transformer networks have also been successfully applied to other tasks, including language modeling (Dai et al., 2018;Baevski and Auli, 2019), semantic role labeling (Strubell et al., 2018), word sense disambiguation (Tang et al., 2018), learning sentence representations (Devlin et al., 2018), and video activity recognition (Wang et al., 2018). This paper absorbs a strong inspiration from the NMT transformer to extend to a multimodal setting. Whereas the NMT transformer focuses on unidirectional translation from source to target texts, human multimodal language time-series are neither as well-represented nor discrete as word embeddings, with sequences of each modality having vastly different frequencies. Therefore, we propose not to explicitly translate from one modality to the others (which could be extremely challenging), but to latently adapt elements across modalities via the attention. Our model (MulT) therefore has no encoder-decoder structure, but it is built up from multiple stacks of pairwise and bidirectional crossmodal attention blocks that directly attend to low-level features (while removing the self-attention). Empirically, we show that our proposed approach improves beyond standard transformer on various human multimodal language tasks.

Proposed Method
In this section, we describe our proposed Multimodal Transformer (MulT) (Figure 2) for modeling unaligned multimodal language sequences. At the high level, MulT merges multimodal timeseries via a feed-forward fusion process from multiple directional pairwise crossmodal transformers. Specifically, each crossmodal transformer (introduced in Section 3.2) serves to repeatedly reinforce a target modality with the low-level features from another source modality by learning the attention across the two modalities' features. A MulT architecture hence models all pairs of modalities with such crossmodal transformers, followed by sequence models (e.g., self-attention transformer) that predicts using the fused features.
The core of our proposed model is crossmodal attention module, which we first introduce in Section 3.1. Then, in Section 3.2 and 3.3, we present in details the various ingredients of the MulT architecture (see Figure 2) and discuss the difference between crossmodal attention and classical multimodal alignment.

Crossmodal Attention
We consider two modalities α and β, with two (potentially non-aligned) sequences from each of them denoted X α ∈ R Tα×dα and X β ∈ R T β ×d β , respectively. For the rest of the paper, T (·) and d (·) are used to represent sequence length and feature dimension, respectively. Inspired by the decoder transformer in NMT (Vaswani et al., 2017) that translates one language to another, we hypothesize a good way to fuse crossmodal information is providing a latent adaptation across modalities; i.e., β to α. Note that the modalities consider in our paper may span very different domains such as facial attributes and spoken words.
We define the Querys as Q α = X α W Qα , Keys (1) Note that Y α has the same length as Q α (i.e., T α ), but is meanwhile represented in the feature space of V β . Specifically, the scaled (by √ d k ) softmax in Equation (1) computes a score matrix softmax (·) ∈ R Tα×T β , whose (i, j)-th entry measures the attention given by the i-th time step of modality α to the j-th time step of modality β. Hence, the i-th time step of Y α is a weighted summary of V β , with the weight determined by i-th row in softmax(·). We call Equation (1) a singlehead crossmodal attention, which is illustrated in Figure 3(a).
Following prior works on transformers (Vaswani et al., 2017;Devlin et al., 2018;Dai et al., 2018), we add a residual connection to the crossmodal attention computation. Then, another positionwise feed-forward sublayer is injected to complete a crossmodal attention block (see Figure 3(b)). Each crossmodal attention block adapts directly from the low-level feature sequence (i.e., Z [0] β in Figure 3(b)) and does not rely on self-attention, which makes it different from the NMT encoderdecoder architecture (Vaswani et al., 2017;Shaw et al., 2018) (i.e., taking intermediate-level features). We argue that performing adaptation  from low-level feature benefits our model to preserve the low-level information for each modality. We leave the empirical study for adapting from intermediate-level features (i.e., Z ) in Ablation Study in Section 4.3.

Overall Architecture
Three major modalities are typically involved in multimodal language sequences: language (L), video (V ), and audio (A) modalities. We denote with X {L,V,A} ∈ R T {L,V,A} ×d {L,V,A} the input feature sequences (and the dimensions thereof) from these 3 modalities. With these notations, in this subsection, we describe in greater details the components of Multimodal Transformer and how crossmodal attention modules are applied.
Temporal Convolutions. To ensure that each element of the input sequences has sufficient awareness of its neighborhood elements, we pass the input sequences through a 1D temporal convolutional layer: (2) where k {L,V,A} are the sizes of the convolutional kernels for modalities {L, V, A}, and d is a common dimension. The convolved sequences are expected to contain the local structure of the sequence, which is important since the sequences are collected at different sampling rates. Moreover, since the temporal convolutions project the features of different modalities to the same dimension d, the dot-products are admittable in the crossmodal attention module.
Positional Embedding. To enable the sequences to carry temporal information, following (Vaswani et al., 2017), we augment positional embedding (PE) toX {L,V,A} : {L,V,A} are the resulting low-level positionaware features for different modalities. We leave more details of the positional embedding to Appendix A.
Crossmodal Transformers. Based on the crossmodal attention blocks, we design the crossmodal transformer that enables one modality for receiving information from another modality. In the following, we use the example for passing vision (V ) information to language (L), which is denoted by "V → L". We fix all the dimensions (d {α,β,k,v} ) for each crossmodal attention block as d.
Each crossmodal transformer consists of D layers of crossmodal attention blocks (see Figure  3(b)). Formally, a crossmodal transformer computes feed-forwardly for i = 1, . . . , D layers: where f θ is a positionwise feed-forward sublayer parametrized by θ, and CM divisible by the number of heads). LN means layer normalization (Ba et al., 2016).
In this process, each modality keeps updating its sequence via low-level external information from the multi-head crossmodal attention module. At every level of the crossmodal attention block, the low-level signals from source modality are transformed to a different set of Key/Value pairs to interact with the target modality. Empirically, we find that the crossmodal transformer learns to correlate meaningful elements across modalities (see Section 4 for details). The eventual MulT is based on modeling every pair of crossmodal interactions. Therefore, with 3 modalities (i.e., L, V, A) in consideration, we have 6 crossmodal transformers in total (see Figure 2).

Self-Attention Transformers and Prediction.
As a final step, we concatenate the outputs from the crossmodal transformers that share the same target modality to yield Z {L,V,A} ∈ R T {L,V,A} ×2d .
A→L ]. Each of them is then passed through a sequence model to collect temporal information to make predictions. We choose the self-attention transformer (Vaswani et al., 2017). Eventually, the last elements of the sequences models are extracted to pass through fully-connected layers to make predictions.

Discussion about Attention & Alignment
When modeling unaligned multimodal language sequences, MulT relies on crossmodal attention blocks to merge signals across modalities. While the multimodal sequences were (manually) aligned to the same length in prior works before training (Zadeh et al., 2018b; (Yu et al., 2016)). We illustrate their differences in Figure 4.

Experiments
In this section, we empirically evaluate the Multimodal Transformer (MulT) on three datasets that are frequently used to benchmark human multimodal affection recognition in prior works (Pham et al., 2019;Tsai et al., 2019;Liang et al., 2018). Our goal is to compare MulT with prior competitive approaches on both word-aligned (by word, which almost all prior works employ) and unaligned (which is more challenging, and which MulT is generically designed for) multimodal language sequences.

Datasets and Evaluation Metrics
Each task consists of a word-aligned (processed in the same way as in prior works) and an unaligned version. For both versions, the multimodal features are extracted from the textual (GloVe word embeddings (Pennington et al., 2014)), visual (Facet (iMotions, 2017)), and acoustic (CO-VAREP (Degottex et al., 2014)) data modalities. A more detailed introduction to the features is included in Appendix.
For the word-aligned version, following (Zadeh et al., 2018a;Tsai et al., 2019;Pham et al., 2019), we first use P2FA (Yuan and Liberman, 2008) to obtain the aligned timesteps (segmented w.r.t. words) for audio and vision streams, and we then perform averaging on the audio and vision features within these time ranges. All sequences in the word-aligned case have length 50. The process remains the same across all the datasets. On the other hand, for the unaligned version, we keep the original audio and visual features as extracted, without any word-segmented alignment or manual subsampling. As a result, the lengths of each modality vary significantly, where audio and vision sequences may contain up to > 1, 000 time steps. We elaborate on the three tasks below. (Zadeh et al., 2016) is a human multimodal sentiment analysis dataset consisting of 2,199 short monologue video clips (each lasting the duration of a sentence). Acoustic and visual features of CMU-MOSI are extracted at a sampling rate of 12.5 and 15 Hz, respectively (while textual data are segmented per word and expressed as discrete word embeddings). Meanwhile, CMU-MOSEI (Zadeh et al., 2018b) is a sentiment and emotion analysis dataset made up of 23,454 movie review video clips taken from YouTube (about 10× the size of CMU-MOSI). The unaligned CMU-MOSEI sequences are extracted at a sampling rate of 20 Hz for acoustic and 15 Hz for vision signals.

CMU-MOSI & MOSEI. CMU-MOSI
For both CMU-MOSI and CMU-MOSEI, each sample is labeled by human annotators with a sentiment score from -3 (strongly negative) to 3 (strongly positive). We evaluate the model performances using various metrics, in agreement with those employed in prior works: 7-class accuracy (i.e., Acc 7 : sentiment score classification in Z ∩ [−3, 3]), binary accuracy (i.e., Acc 2 : positive/negative sentiments), F1 score, mean absolute error (MAE) of the score, and the correlation of the model's prediction with human. Both tasks are frequently used to benchmark models' ability to fuse multimodal ( IEMOCAP. IEMOCAP (Busso et al., 2008) consists of 10K videos for human emotion analysis. As suggested by Wang et al. (2019), 4 emotions (happy, sad, angry and neutral) were selected for emotion recognition. Unlike CMU-MOSI and CMU-MOSEI, this is a multilabel task (e.g., a person can be sad and angry simultaneously). Its multimodal streams consider fixed sampling rate on audio (12.5 Hz) and vision (15 Hz) signals. We follow (Poria et al., 2017;Wang et al., 2019;Tsai et al., 2019) to report the binary classification accuracy and the F1 score of the predictions.

Baselines
We choose Early Fusion LSTM (EF-LSTM) and Late Fusion LSTM (LF-LSTM) as baseline models, as well as Recurrent Attended Variation Embedding Network (RAVEN) (Wang et al., 2019) and Multimodal Cyclic Translation Network (MCTN) (Pham et al., 2019), that achieved SOTA results on various word-aligned human multimodal language tasks. To compare the models comprehensively, we adapt the connectionist temporal classification (CTC) (Graves et al., 2006) method to the prior approaches (e.g., EF-LSTM, MCTN, RAVEN) that cannot be applied directly to the unaligned setting. Specifically, these models train to optimize the CTC alignment

Quantitative Analysis
Word-Aligned Experiments. We first evaluate MulT on the word-aligned sequencesthe "home turf" of prior approaches modeling human multimodal language (Sheikh et al., 2018;Tsai et al., 2019;Pham et al., 2019;Wang et al., 2019). The upper part of the Table 1, 2, and 3 show the results of MulT and baseline approaches on the wordaligned task. With similar model sizes (around 200K parameters), MulT outperforms the other competitive approaches on different metrics on all tasks, with the exception of the "sad" class results on IEMOCAP.
Unaligned Experiments. Next, we evaluate MulT on the same set of datasets in the unaligned setting. Note that MulT can be directly applied to unaligned multimodal stream, while the baseline models (except for LF-LSTM) require the need of additional alignment module (e.g., CTC module).
The results are shown in the bottom part of Table 1, 2, and 3. On the three benchmark datasets, MulT improves upon the prior methods (some with CTC) by 10%-15% on most attributes. Em-1 All experiments are conducted on 1 GTX-1080Ti GPU. The code for our model and experiments can be found in https://github.com/yaohungt/ Multimodal-Transformer pirically, we find that MulT converges faster to better results at training when compared to other competitive approaches (see Figure 5). In addition, while we note that in general there is a performance drop on all models when we shift from the word-aligned to unaligned multimodal timeseries, the impact MulT takes is much smaller than the other approaches. We hypothesize such performance drop occurs because the asynchronous (and much longer) data streams introduce more difficulty in recognizing important features and computing the appropriate attention.
Ablation Study. To further study the influence of the individual components in MulT, we perform comprehensive ablation analysis using the unaligned version of CMU-MOSEI. The results are shown in Table 4. First, we consider the performance for only We found that the crossmodal attention has learned to correlate certain meaningful words (e.g., "movie", "disappointing") with segments of stronger visual signals (typically stronger facial motions or expression change), despite the lack of alignment between original L/V sequences. Note that due to temporal convolution, each textual/visual feature contains the representation of nearby elements.  using unimodal transformers (i.e., language, audio or vision only). We find that the language transformer outperforms the other two by a large margin. For example, for the Acc h 2 metric, the model improves from 65.6 to 77.4 when comparing audio only to language only unimodal transformer. This fact aligns with the observations in prior work (Pham et al., 2019), where the authors found that a good language network could already achieve good performance at inference time.
Second, we consider 1) a late-fusion transformer that feature-wise concatenates the last elements of three self-attention transformers; and 2) an early-fusion self-attention transformer that takes in a temporal concatenation of three asynchronous sequences [X L ,X V ,X A ] ∈ R (T L +T V +T A )×dq (see Section 3.2). Empirically, we find that both EF-and LF-Transformer (which fuse multimodal signals) outperform unimodal transformers.
Finally, we study the importance of individual crossmodal transformers according to the target modalities (i.e., using Table 4, we find crossmodal attention modules consistently improve over the late-and earlyfusion transformer models in most metrics on unaligned CMU-MOSEI. In particular, among the three crossmodal transformers, the one where language(L) is the target modality works best. We also additionally study the effect of adapting intermediate-level instead of the low-level features from source modality in crossmodal attention blocks (similar to the NMT encoder-decoder architecture but without self-attention; see Section 3.1). While MulT leveraging intermediatelevel features still outperform models in other ablative settings, we empirically find adapting from low-level features works best. The ablations suggest that crossmodal attention concretely benefits MulT with better representation learning.

Qualitative Analysis
To understand how crossmodal attention works while modeling unaligned multimodal data, we empirically inspect what kind of signals MulT picks up by visualizing the attention activations. Figure 6 shows an example of a section of the crossmodal attention matrix on layer 3 of the V → L network of MulT (the original matrix has dimension T L × T V ; the figure shows the attention corresponding to approximately a 6-sec short window of that matrix). We find that crossmodal attention has learned to attend to meaningful signals across the two modalities. For example, stronger attention is given to the intersection of words that tend to suggest emotions (e.g., "movie", "disappointing") and drastic facial expression changes in the video (start and end of the above vision sequence). This observation advocates one of the aforementioned advantage of MulT over conventional alignment (see Section 3.3): crossmodal attention enables MulT to directly capture potentially long-range signals, including those offdiagonals on the attention matrix.

Discussion
In the paper, we propose Multimodal Transformer (MulT) for analyzing human multimodal language. At the heart of MulT is the crossmodal attention mechanism, which provides a latent crossmodal adaptation that fuses multimodal information by directly attending to low-level features in other modalities. Whereas prior approaches focused primarily on the aligned multimodal streams, MulT serves as a strong baseline capable of capturing long-range contingencies, regardless of the alignment assumption. Empirically, we show that MulT exhibits the best performance when compared to prior methods.
We believe the results of MulT on unaligned human multimodal language sequences suggest many exciting possibilities for its future applications (e.g., Visual Question Answering tasks, where the input signals is a mixture of static and time-evolving signals). We hope the emergence of MulT could encourage further explorations on tasks where alignment used to be considered necessary, but where crossmodal attention might be an equally (if not more) competitive alternative.

A Positional Embedding
A purely attention-based transformer network is order-invariant. In other words, permuting the order of an input sequence does not change transformer's behavior or alter its output. One solution to address this weakness is by embedding the positional information into the hidden units (Vaswani et al., 2017).
Following (Vaswani et al., 2017), we encode the positional information of a sequence of length T via the sin and cos functions with frequencies dictated by the feature index. In particular, we define the positional embedding (PE) of a sequence X ∈ R T ×d (where T is length) as a matrix where: for i = 1, . . . , T and j = 0, d 2 . Therefore, each feature dimension (i.e., column) of PE are positional values that exhibit a sinusoidal pattern. Once computed, the positional embedding is added directly to the sequence so that X + PE encodes the elements' position information at every time step.

Connectionist
Temporal Classification (CTC) (Graves et al., 2006) was first proposed for unsupervised Speech to Text alignment. Particularly, CTC is often combined with the output of recurrent neural network, which enables the model to train end-to-end and simultaneously infer speech-text alignment without supervision. For the ease of explanation, suppose the CTC module now are aiming at aligning an audio signal sequence [a 1 , a 2 , a 3 , a 4 , a 5 , a 6 ] with length 6 to a textual sequence "I am really really happy" with length 5. In this example, we refer to audio as the source and texts as target signal, noting that the sequence lengths may be different between the source to target; we also see that the output sequence may have repetitive element (i.e., "really"). The CTC (Graves et al., 2006) module we use comprises two components: alignment predictor and the CTC loss.
First, the alignment predictor is often chosen as a recurrent networks such as LSTM, which performs on the source sequence then outputs the possibility of being the unique words in the target sequence as well as a empty word (i.e., x). In our example, for each individual audio signal, the alignment predictor provides a vector of length 5 regarding the probability being aligned to [x, 'I', 'am', 'really', 'happy'].
Next, the CTC loss considers the negative loglikelihood loss from only the proper alignment for the alignment predictor outputs. The proper alignment, in our example, can be results such as iii) ['I', 'am', x, 'really', x, 'happy'] When the CTC loss is minimized, it implies the source signals are properly aligned to target signals.
To sum up, in the experiments that adopting the CTC module, we train the alignment predictor while minimizing the CTC loss. Then, excluding the probability of blank words, we multiply the probability outputs from the alignment predictor to source signals. The source signal is hence resulting in a pseudo-aligned target singal. In our example, the audio signal is then transforming to a audio signal [a 1 , a 2 , a 3 , a 4 , a 5 ] with sequence length 5, which is pseudo-aligned to ['I', 'am', 'really', 'really', 'happy']. Table 5 shows the settings of the various MulTs that we train on human multimodal language tasks. As previously mentioned, the models are contained at roughly the same sizes as in prior works for the purpose of fair comparison. For hyperparameters such as the dropout rate and number of heads in crossmodal attention module, we perform a basic grid search. We decay the learning rate by a factor of 10 when the validation performance plateaus.

D Features
The features for multimodal datasets are extracted as follows: -Language. We convert video transcripts into pre-trained Glove word embeddings (glove.840B.300d) (Pennington et al., 2014). The embedding is a 300 dimensional vector.
-Audio. We use COVAREP (Degottex et al., 2014) for extracting low level acoustic features. The feature includes 12 Mel-frequency cepstral coefficients (MFCCs), pitch tracking and voiced/unvoiced segmenting features, glottal source parameters, peak slope parameters and maxima dispersion quotients. Dimension of the feature is 74.