Integrating Multimodal Information in Large Pretrained Transformers

Recent Transformer-based contextual word representations, including BERT and XLNet, have shown state-of-the-art performance in multiple disciplines within NLP. Fine-tuning the trained contextual models on task-specific datasets has been the key to achieving superior performance downstream. While fine-tuning these pre-trained models is straightforward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused on modeling face-to-face communication). More specifically, this is due to the fact that pre-trained models don’t have the necessary components to accept two extra modalities of vision and acoustic. In this paper, we proposed an attachment to BERT and XLNet called Multimodal Adaptation Gate (MAG). MAG allows BERT and XLNet to accept multimodal nonverbal data during fine-tuning. It does so by generating a shift to internal representation of BERT and XLNet; a shift that is conditioned on the visual and acoustic modalities. In our experiments, we study the commonly used CMU-MOSI and CMU-MOSEI datasets for multimodal sentiment analysis. Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only fine-tuning of BERT and XLNet. On the CMU-MOSI dataset, MAG-XLNet achieves human-level multimodal sentiment analysis performance for the first time in the NLP community.


Introduction
Human face-to-face communication flows as a seamless integration of language, acoustic, and vision modalities. In ordinary everyday interactions, we utilize all these modalities jointly to convey our * -Equal contribution intentions and emotions. Understanding this faceto-face communication falls within an increasingly growing NLP research area called multimodal language analysis (Zadeh et al., 2018b). The biggest challenge in this area is to efficiently model the three pillars of communication together. This gives artificial intelligence systems the capability to comprehend the multi-sensory information without disregarding nonverbal factors. In many applications such as dialogue systems and virtual reality, this capability is crucial to maintain the high quality of user interaction.
The recent success of contextual word representations in NLP is largely credited to new Transformer-based (Vaswani et al., 2017) models such as BERT (Devlin et al., 2018) and XLNet . These Transformer-based models have shown performance improvement across downstream tasks (Devlin et al., 2018). However, their true downstream potential comes from finetuning their pre-trained models for particular tasks (Devlin et al., 2018). This is often done easily for lexical datasets which exhibit language modality only. However, this fine-tuning for multimodal language is neither trivial nor yet studied; simply because both BERT and XLNet only expect linguistic input. Therefore, in applying BERT and XLNet to multimodal language, one must either (a) forfeit the nonverbal information and fine-tune for language, or (b) simply extract word representations and proceed to use a state-of-the-art model for multimodal studies.
In this paper, we present a successful framework for fine-tuning BERT and XLNet for multimodal input. Our framework allows the BERT and XL-Net core structures to remain intact, and only attaches a carefully designed Multimodal Adaptation Gate (MAG) to the models. Using an attention conditioned on the nonverbal behaviors, MAG essentially maps the informative visual and acoustic factors to a vector with a trajectory and magnitude. During fine-tuning, this adaptation vector modifies the internal state of the BERT and XLNet, allowing the models to seamlessly adapt to the multimodal input. In our experiments we use the CMU-MOSI (Zadeh et al., 2016) and CMU-MOSEI (Zadeh et al., 2018d) datasets of multimodal language, with a specific focus on the core NLP task of multimodal sentiment analysis. We compare the performance of MAG-BERT and MAG-XLNet to the above (a) and (b) scenarios in both classification and regression sentiment analysis. Our findings demonstrate that fine-tuning these advanced pre-trained Transformers using MAG yields consistent improvement, even though BERT and XLNet were never trained on multimodal data.
The contributions of this paper are therefore summarized as: • We propose an efficient framework for finetuning BERT and XLNet for multimodal language data. This framework uses a component called Multimodal Adaptation Gate (MAG) that introduces minimal overhead to both the models.
• MAG-BERT and MAG-XLNet set new state of the art in both CMU-MOSI and CMU-MOSEI datasets, when compared to scenarios (a) and (b). For CMU-MOSI, MAG-XLNet achieves performance on par with reported human performance.

Related Works
The studies in this paper are related to the following research areas:

Multimodal Language Analyses
Multimodal language analyses is a recent research trend in natural language processing (Zadeh et al., 2018b) that helps us understand language from the modalities of text, vision and acoustic. These analyses have particularly focused on the tasks of sentiment analysis , emotion recognition (Zadeh et al., 2018d), and personality traits recognition (Park et al., 2014). Works in this area often focus on novel multimodal neural architectures (Pham et al., 2019;Hazarika et al., 2018) and multimodal fusion approaches Tsai et al., 2018). Related to content in this paper, we discuss some of the models in this domain including TFN, MARN, MFN, RMFN and MulT. Tensor Fusion Network (TFN)   creates a multi-dimensional tensor to explicitly capture all possible interactions between the three modalities: unimodal, bimodal and trimodal. Multiattention Recurrent Network (MARN) (Zadeh et al., 2018c) uses three separate hybrid LSTM memories that have the ability to propagate the cross-modal interactions. Memory Fusion Network (Zadeh et al., 2018a) synchronizes the information from three separate LSTMs through a multi-view gated memory. Recurrent Memory Fusion Network (RMFN)  captures the nuanced interactions among the modalities in a multi-stage manner, giving each stage the ability to focus on a subset of signals. Multimodal Transformer for Unaligned Multimodal Language Sequences (MulT) (Tsai et al., 2019) deploys three Transformers -each for one modality -to capture the interactions with the other two modalities in a selfattentive manner. The information from the three Transformers are aggregated through late-fusion.

Pre-trained Language Representations
Learning word representations from large corpora has been an active research area in NLP community (Mikolov et al., 2013;Pennington et al., 2014). Glove (Pennington et al., 2014) and Word2Vec (Mikolov et al., 2013) contributed to advancing the state-of-the-art of many NLP tasks. A major setback of these word representations is their non-contextual nature. Recently, contextual language representation models trained on large text corpora have achieved state of the art results on several NLP tasks including question answering, sentiment classification, part-of-speech (POS) tagging and similarity modeling (Peters et al., 2018;Devlin et al., 2018). The first two notable contextual representation based models were ELMO (Peters et al., 2018) and GPT (Radford et al., 2018). However, they only captured unidirectional context and therefore, missed more nuanced interactions among words of a sentence. BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2018) outperforms both ELMO and GPT since it can provide better representation through capturing bi-directional context using Transformers. XLNet  gives new contextual representations through building an auto-regressive model capable of capturing all possible factorizations of the input. Fine-tuning pretrained mod-els for BERT and XLNet has been a key factor in achieving state of the art performance for downstream tasks. Even though previous works have explored using BERT to model multimodal data (Sun et al., 2019), to the best of our knowledge, directly fine-tuning BERT or XLNet for multimodal data has not been explored in previous works.

BERT and XLNet
To better understand the proposed multimodal framework in this paper, we first present an overview of both the BERT and XLNet models. We start by quickly formalizing the operations within Transformer and Transformer-XL models, followed by an overview of BERT and XLNet.

Transformer
Transformer is a non-recurrent neural architecture designed for modeling sequential data (Vaswani et al., 2017). The superior performance of Transformer model is largely credited to a Multi-head Self-Attention module. Using this module, each element of a sequence is attended by conditioning on all the other sequence elements. Figure 2 summarizes internal operations of a Transformer layer (for M such layers). Commonly, a Transformer uses an encoder-decoder paradigm. A stack of encoders is followed by a stack of decoders to map an input sequence to an output sequence. An additional embedding step with Positional Input Embedding is applied before the input goes through the stack of encoders and decoders.

Transformer-XL
Transformer-XL  is an extension of the Transformer which offers two improvements: a) it enhances the capability of the Transformer to capture long-range dependencies (specifically for the case of context fragmentation), and b) it improves the capability to better predict first few symbols (which are often crucial for the rest of the sequence). It does so with a recurrence mechanism designed to pass context information from one segment to the next and a relative positional encoding mechanism to enable state reuse without causing temporal confusion.

BERT
BERT is a successful language model that provides rich contextual word representation (Devlin et al., 2018). It follows an auto-encoding approach -masking out a portion of input tokens and predicting those tokens based on all other non-masked tokens -and thus learning a vector representation for the masked out tokens in that process. We use the variant of BERT used for Single Sentence Classification Tasks. First, input embeddings are generated from a sequence of word-piece tokens by adding token embeddings, segment embeddings and position embeddings . Then multiple Encoder layers are applied on top of these input embeddings. Each Encoder has a Multi-Head Attention layer and a Feed Forward layer, each followed by a residual connection with layer normalization. A special [CLS] token is appended in front of the input token sequence. So, for a N length input sequence, we get N + 1 vectors from the last Encoder layer -the first of those vectors is used to predict the label of the input after that vector undergoes an affine transformation.

XLNet
XLNet  sets out to improve two critical aspects of the BERT model: a) independence among the masked out tokens and b) pretrainfinetune discrepancy in training vs inference, since inference inputs do not have masked out tokens. XLNet is an auto-regressive model and therefore, is free from the need of masking out certain tokens. However, auto-regressive models usually capture the unidirectional context (either forward or backward). XLNet can learn bidirectional context by maximizing likelihood over all possible permutations of factorization order. In essence, it randomly samples multiple factorization orders and trains the model on each of those orders. Therefore, it can model input by taking all possible permutations into consideration (in expectation).
XLNet utilizes two key ideas from Transformer-XL : relative positioning and segment recurrence mechanism. Like BERT, it also has a Input Embedder followed by multiple Encoders. The Embedder converts the input tokens into vectors after adding token embedding, segment embedding and relative positional embedding information. Each encoder consists of a Multi-Head attention layer and a feed forward layer -each followed by a residual addition and normalization layer. The embedder output is fed into the encoders to get a contextual representation of input. as input a lexical input vector, as well as its visual and acoustic accompaniments. Subsequently, an attention over lexical and nonverbal dimensions is used to fuse the multimodal data into another vector, which is subsequently added to the input lexical vector (shifting).

Multimodal Adaptation Gate (MAG)
In multimodal language, a lexical input is accompanied by visual and acoustic information -simply gestures and prosody co-occurring with language. Consider a semantic space that captures latent concepts (positions in the latent space) for individual words. In absence of multimodal accompaniments, the semantic space is directly conditioned on the language manifold. Simply put, each word falls within some part of this semantic space, depending only on the meaning of the word in a linguistic structure (i.e. sentence). Nonverbal behaviors can have an impact on the meaning of words, and therefore on the position of words in this semantic space. Together, language and nonverbal accompaniments decide on the new position of the word in the semantic space. In this paper, we regard to this new position as addition of the language-only position with a displacement vector; a vector with trajectory and magnitude that shifts the language-only position of the word to the new position in light of nonverbal behaviors. This is the core philosophy behind the Multimodal Adaptation Gate (MAG). A particularly appealing implementation of such displacement is studied in RAVEN (Wang et al., 2018), where displacements are calculated using cross-modal self-attention to highlight relevant nonverbal information. Figure 1 shows the studied MAG in this paper. Essentially, a MAG unit receives three inputs, one is purely lexical, one is visual, and the last one is acoustic. Let the triplet (Z i , A i , V i ) denote these inputs for ith word in a sequence. We break this displacement into bimodal factors [Z i ; A i ] and [Z i ; V i ] by concatenating lexical vector with acoustic and visual information respectively and use them to produce two gating vectors g v i and g a i : where W gv , W ga are weight matrices for visual and acoustic modality and b v and b a are scalar biases. R(x) is a non-linear activation function. These gates highlight the relevant information in visual and acoustic modality conditioned on the lexical vector.
We then create a non-verbal displacement vector H i by fusing together A i and V i multiplied by their respective gating vectors: where W a and W v are weight matrices for acoustic and visual information respectively and b H is the bias vector. Subsequently, we use a weighted summation between Z i and its nonverbal displacement H i to create a multimodal vectorZ i : where β is a hyper-parameter selected through the cross-validation process. Z i 2 and H i 2 denote the L 2 norm of the Z i and H i vectors respectively. We use the scaling factor α so that the effect of nonverbal shift H i remains within a desirable range. Finally, we apply a layer normalization and dropout layer toZ i .

MAG-BERT
MAG-BERT is a combination of MAG applied to a certain layer of BERT network (Figure 2 demonstrates the structure of MAG-BERT as well as MAG-XLNet). Essentially, at each layer, BERT contains lexical vectors for ith word in the sequence. For the same word, nonverbal accompaniments are also available in multimodal language setup. MAG essentially forms an attachment to the desired layer in BERT; an attachment that allows for multimodal information to leak into the BERT model and displace the lexical vectors. The operations within MAG allows for the lexical vectors within BERT to adapt to multimodal information by changing their positions within the semantic space. Aside from the attachment of MAG, no change is made to the BERT structure. Given an N length language sequence L = [L 1 , L 2 , . . . L N ] carrying word-piece tokens, a [CLS] token is appended to L so that we can use it later for class label prediction. Then, we input L to the Input Embedder which outputs E = [E CLS , E 1 , E 2 , . . . E N ] after adding token, segment and position embeddings. Then, we input E to the first Encoding layer and then apply j Encoders on it successively. After that encoding process, we get the output Z j = [Z j CLS , Z j 1 , Z j 2 , . . . Z j N ] which denotes the Lexical Embeddings after j layers of Encoding.
For injecting audio-visual information into these embeddings, we prepare a sequence of triplets Each of these triplets are passed through the Multimodal Adaptation Gate which transforms the ith triplet intō Z j i -a unified multimodal representation of the corresponding Lexical Embedding.
As there exists M = 12 Encoder layers in our BERT model, we inputZ j = [Z j 1 ,Z j 2 , . . .Z j N ] to the next Encoder and apply M − j Encoder layers on it successively. At the end, we getZ M from the M th Encoder layer. As the first elementZ M CLS represents the [CLS] token, it has the information necessary to make a class label prediction. Therefore,Z M CLS goes through an affine transformation to produce a single real-value which can be used to predict a class label.

MAG-XLNet
Like MAG-BERT, MAG-XLNet also has the capability of injecting audio-visual information at any of its layers using MAG. At each position j of any of its layer, it holds the lexical vector corresponding to that position. Utilizing the audio-visual information available for that position, it can invoke MAG to get an appropriately shifted lexical vector in multimodal space. Although it mostly follows the general paradigm presented in Figure 2 verbatim, it uses the XLNet specific Embedder and Encoders. One other key difference is the position of the [CLS] token. Unlike BERT, the [CLS] token is appended at the right end of the input token sequence, and therefore in all the intermediate representations, the vector corresponding to the [CLS] will be the rightmost one. Following the same logic, the output from the final Encoding layer will bē The last item, Z M CLS can be used for class label prediction after it goes through an affine transformation.

Experiments
In this section we outline the experiments in this paper. We first start by describing the datasets, followed by description of extracted features, baselines, and experimental setup.

CMU-MOSI Dataset
CMU-MOSI (CMU Multimodal Opinion Sentiment Intensity) is a dataset of multimodal language specifically focused on multimodal sentiment analysis (Zadeh et al., 2016). CMU-MOSI contains 2199 video segments taken from 93 Youtube movie review videos. The dataset has real-valued highagreement sentiment intensity annotations in the range [−3, +3].

Computational Descriptors
For each modality, the following computational descriptors are available: Language: We transcribe the videos using Youtube API followed by manual correction. Acoustic: COVAREP (Degottex et al., 2014) is used to extract the following relevant features: fundamental frequency, quasi open quotient, normalized amplitude quotient, glottal source parameters (H1H2, Rd, Rd conf), VUV, MDQ, the first 3 formants, PSP, HMPDM 0-24 and HM-PDD 0-12, spectral tilt/slope of wavelet responses (peak/slope), MCEP 0-24. Visual: For the visual modality, the Facet library (iMotions, 2017) is used to extract a set of visual features including facial action units, facial landmarks, head pose, gaze tracking and HOG features.
For each word, we align all three modalities following the convention established in . Firstly, the word alignment between language and audio is obtained using forced alignment (Yuan and Liberman, 2008). Afterwards, the boundary of each word denotes the co-occurring visual and acoustic features (FACET and COVAREP). Subsequently, for each word, the co-occurring acoustic and visual features are averaged across each feature -thus achieving A i and V i vectors corresponding to word i.

Baseline Models
We compare the performance of MAG-BERT and MAG-XLNet to a variety of state-of-the-art models for multimodal language analysis. These models are trained using extracted BERT and XLNet word embeddings as their language input: TFN (Tensor Fusion Network) explicitly models both intra-modality and inter-modality dynamics  by creating a multidimensional tensor that captures unimodal, bimodal and trimodal interactions across three modalities. MARN (Multi-attention Recurrent Network) models view-specific interactions using hybrid LSTM memories and cross-modal interactions using a Multi-Attention Block (MAB) (Zadeh et al., 2018c). MFN (Memory Fusion Network) has three separate LSTMs to model each modality separately and a multi-view gated memory to synchronize among them (Zadeh et al., 2018a).

RMFN (Recurrent Memory Fusion Network)
captures intra-modal and inter-modal information through recurrent multi-stage fashion . MulT (Multimodal Transformer for Unaligned Multimodal Language Sequence) uses three sets of Transformers and combines their output in a late fusion manner to model a multimodal sequence (Tsai et al., 2019). We use the aligned variant of the originally proposed model, which achieves superior performance over the unaligned variant.
We also compare our model to fine-tuned BERT and XLNet using language modality only to measure the success of the MAG framework.

Experimental Design
All the models in this paper are trained using Adam (Kingma and Ba, 2014) optimizer with learning rates between {0.001, 0.0001, 0.00001}. We For MulT, we use {3, 5, 7} layers in the network and {1, 3, 5} attention heads. All models use the designated validation set of CMU-MOSI for finding best hyper-parameters.
We perform two different evaluation tasks on CMU-MOSI datset: i) Binary Classification, and ii) Regression. We formulate it as a regression problem and report Mean-absolute Error (MAE) and the correlation of model predictions with true labels. Besides, we convert the regression outputs into categorical values to obtain binary classification accuracy (BA) and F1 score. Higher value means better performance for all the metrics except MAE. We use two evaluation metrics for BA and F1, one used in (Zadeh et al., 2018d) and one used in (Tsai et al., 2019). Table 1 shows the results of the experiments in this paper. We summarize the observations from the results in this table as following:

Performance of MAG-BERT
In all the metrics across the CMU-MOSI dataset, we observe that performance of MAG-BERT is superior to state-of-the-art multimodal models that use BERT word embeddings. Furthermore, MAG-BERT also performs superior to fine-tuned BERT. This essentially shows that the MAG component is allowing the BERT model to adapt to multimodal information during fine-tuning, thus achieving superior performance.

Performance of MAG-XLNet
A similar performance trend to MAG-BERT is also observed for MAG-XLNet. Besides superior performance than baselines and fine-tuned XLNet, MAG-XLNet achieves near-human level performance for CMU-MOSI dataset. Furthermore, we train MulT using the fine-tuned XLNet embeddings and get the following performance: 83.6 85.3, 82.6 84.2, 0.810, 0.759 which is lower than both MAG-XLNet and XLNet. It is notable that the p-value for student t-test between MAG-XLNet and XLNet in Table 1 is lower than 10e − 5 for all the metrics.
The motivation behind the experiments reported in Table 1 is as follows: we extracted word embeddings from pre-trained BERT and XLNet models and trained the baseline models using those embeddings. Since BERT and XLNet are often perceived to provide better word embeddings than Glove, it is not fair to compare MAG-BERT/MAG-XLNet with previous models trained with Glove embeddings. Therefore, we retrain previous works us-  (Zadeh et al., 2018c) and the right side is measures calculated based on (Tsai et al., 2019). Human performance for CMU-MOSI is reported as (Zadeh et al., 2018a).   Table 3: Examples from the CMU-MOSI dataset. The ground truth sentiment labels are between strongly negative (-3) and strongly positive (+3). For each example, we show the Ground Truth and prediction output of both the MAG-XLNet and XLNet. XLNet seems to be replicating language modality mostly while MAG-XLNet is integrating the non-verbal information successfully.
fair comparison between proposed approach in this paper, and previous work. Based on the information from Table 1, we observe that MAG-BERT/MAG-XLNet models outperforms various baseline models using BERT/XLNet/Glove models substantially.

Adaptation at Different Layers
We also study the effect of applying MAG at different encoder layers of the XLNet. Specifically, we first apply the MAG to the output of the embedding layer. Subsequently, we apply the MAG to the layer j ∈ {1, 4, 6, 8, 12} of the XLNet. Then, we apply MAG at all the XLNet layers. From Table 2, we observe that earlier layers are more suitable for application of MAG. We believe that earlier layers allow for better integration of the multimodal information, as they allow the word shifting to happen from the beginning of the network. If the semantics of words should change based on the nonverbal accompaniments, then initial layers should reflect the semantic shift, otherwise, those layers are only working unimodally. Besides, the higher layers of BERT learn more abstract and higher-level information about the syntactic and semantic structure of linguistic features (Coenen et al., 2019). Since, the acoustic and visual information present in our model corresponds to each word in the utterance, it will be more difficult for the MAG to shift the vector extracted from a later layer since that vector's information will be very abstract in nature.

Input-level Concatenation and Addition
From Table 2, we see that both input-level concatenation and addition of modalities perform poorly. For Concatenation, we simply concatenate all the modalities. For Addition, we add the audio and visual information to the language embedding after mapping both of them to the language dimension. These results demonstrate the rationale behind using an advanced fusion mechanism like MAG.

Results on Comparable Datasets
We also perform experiments on the CMU-MOSEI dataset (Zadeh et al., 2018d) to study the generalization of our approach to other multimodal language datasets. Unlike CMU-MOSI which has sentiment annotations at utterance level, CMU-MOSEI has sentiment annotations at sentence level. The experimental methodology for CMU-MOSEI is similar to the original paper. For the sake of comparison, we suffice 1 to comparing the binary accuracy and f1 score for the top 3 models in Table 1