Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition

This paper aims to bring a new lightweight yet powerful solution for the task of Emotion Recognition and Sentiment Analysis. Our motivation is to propose two architectures based on Transformers and modulation that combine the linguistic and acoustic inputs from a wide range of datasets to challenge, and sometimes surpass, the state-of-the-art in the field. To demonstrate the efficiency of our models, we carefully evaluate their performances on the IEMOCAP, MOSI, MOSEI and MELD dataset. The experiments can be directly replicated and the code is fully open for future researches.


Introduction
Understanding expressed sentiment and emotions are two crucial factors in human multimodal language yet predicting affective states from multimedia remains a challenging task. The emotion recognition task has existed working on different types of signals, typically audio, video and text. Deep Learning techniques allow the development of novel paradigms to use these different signals in one model to leverage joint information extraction from different sources. These models usually require a fusion between modality, a crucial step to compute expressive multimodal features used by a classifier to output probabilities over the possible answers.
In this paper, we propose an architecture based on two stages: an independent sequential stage based on LSTM (Hochreiter and Schmidhuber, 1997) where modality features are computed separately, and a second hierarchical stage based on Transformer (Vaswani et al., 2017) where we iteratively compute and fuse new multimodal representations. This paper proposes the fusion between the acoustic and linguistic features through attention modulation (Yu et al., 2019) and linear modulation (Dumoulin et al., 2018), a powerful tool to shift and scale the feature maps of one modality given the representation of another.
The association of this horizontal-vertical encoding and modulated fusion shows really strong results across a wide range of datasets for emotion recognition and sentiment analysis. In addition to the interesting performances it offers, the modulation requires no or very few learning parameters, making it fast and easy to train. The paper is structured as follows: we first present the different researches used for comparison in our experiments in section 2, we then briefly present the different datasets in section 3. Then we carefully describe our sequential feature extraction based on LSTM in section 4 and the two hierarchical modulated fusion model, the Modulated Attention Transformer (MAT) and Modulated Normalization Transformer (MNT), in section 5. Finally, we explain the experimental settings in section 6 and report the results of our model variants in section 7.

Related Work
The presented related work is used for comparison for our experiments. We proceed to briefly describe their proposed models.
First, Zadeh et al. (2018b) proposed a novel multimodal fusion technique called the Dynamic Fusion Graph (DFG) to study the nature of crossmodal dynamics in multimodal language. DFG contains built-in efficacies that are directly related to how modalities interact.
To capture the context of the conversation through all modalities, the current speaker and listener(s) in the conversation, and the relevance and relationship between the available modalities through an adequate fusion mechanism, Shenoy and Sardana (2020) proposed a recurrent neural network architecture that attempts to take into account all the mentioned drawbacks, and keeps track of the context of the conversation, interlocutor states, and the emotions conveyed by the speakers in the conversation. Pham et al. (2019) presented a model that learns robust joint representations by cyclic translations between modalities (MCTN), that achieved strong results on various word-aligned human multimodal language tasks. Wang et al. (2019) proposed the Recurrent Attended Variation Embedding Network (RAVEN) to model expressive nonverbal representations by analyzing the fine-grained visual and acoustic patterns that occur during word segments. In addition, they seek to capture the dynamic nature of nonverbal intents by shifting word representations based on the accompanying nonverbal behaviors.
But the related work that is probably the closest to ours is the Multimodal Transformer (Tsai et al., 2019;Delbrouck et al., 2020) because they also use Transformer based solutions to encode their modalities. Nonetheless, we differ in many ways. First, their best solutions and scores reported are using visual support. Secondly, they use Transformer for cross-modality encoding for every modality pairs; this equals to 6 Transformer modules (2 pairs per modality) while we only use two Transformer (one per modality). Finally, each output pairs is concatenated to go though a second stage of Transformer encoding. We also differ on how the features are extracted: they base their solution on CNN while we use LSTM. In this paper, it is important to note that we compare our results to their word-unaligned scores, as we do not use word-alignment either.

IEMOCAP dataset
IEMOCAP (Busso et al., 2008) is a multimodal dataset of dyadic conversations of actors. The modalities recorded are Audio, Video and Motion Capture data. All conversations were segmented, transcribed and annotated with two different emotional types of labels: emotion categories (6 basic emotions (Ekman, 1999) -happiness, sadness, anger, surprise, fear, disgust -plus frustrated, excited and neutral) and continuous emotional dimensions (valence, arousal and dominance).
For categorical labels, the annotators could also select "other" if they found the emotion could not be described with one of the adjectives. The categorical labels were given by 3-4 evaluators. Majority vote was used to have the final label. In case of ex aequo, it was considered not consistent in terms of inter-evaluator agreement; 7532 segments out of the 10039 segments reached agreement.
To be comparable to previous research, we use the four categories: neutral, sad, happy, angry. Happy category is obtained by merging excited and happy labeled (Yoon et al., 2018), we obtain a total of 5531 utterances: 1636 happy, 1084 sad, 1103 angry, 1708 neutral. The train-test split is made according to Poria et al. (2017) as it seems to be the norm for recent works.

CMU-MOSI dataset
CMU-MOSI (Zadeh et al., 2016) dataset is a collection of video clips containing opinions. The collected videos come from YouTube and were selected with metada using the #vlog hashtag for video-blog which desribes a specific type of video that often contains people expressing their opinion. The resulting dataset included clips with speakers with different ethnicities but all speaking in english. The speech was manually transcribed. These transcriptions were aligned with audio at word level. The videos were annotated in sentiment with a 7point Likert scale (from -3 to 3) by five workers for each video using Amazon's Mechanical Turk.

CMU-MOSEI dataset
MOSEI (Zadeh et al., 2018c) is the next generation of MOSI dataset. They also took advantage of online videos containing expressed opinions. They analyzed videos with a face detection algorithm and selected videos with only one speaker with an attention directed to the camera.
They used a set of 250 different keywords to scrape the videos and kept a maximum of 10 videos for each one with manual transcription included. The dataset was then manually curated to keep only data with good quality. It is annotated with a 7-point Likert scale as well as the six basic emotion categories (Ekman, 1999).

MELD dataset
The Multimodal EmotionLines Dataset (MELD)  contains dialogue instances that encompasses audio and visual modality along with text. MELD has more than 1400 dialogues and 13000 utterances from Friends TV series. Multiple speakers participated in the dialogues. Each utterance in a dialogue has been labeled by any of these seven emotions: Anger, Disgust, Sadness, Joy, Neutral, Surprise and Fear. MELD also has sentiment (positive, negative and neutral) annotation for each utterance.

Feature extractions
This sections aims to describe the linguistic and acoustic features used as the input of our proposed modulated fusions based on Transformers. The extraction is performed independently for each sample of a dataset. We denote the extracted linguistic features as x and acoustic as y. In the end, both x and y have a size [T, C] where T is the temporal axis size and C the feature size. Its important to note that T is different for each sample, while C is a hyper-parameter.

Linguistic
A sentence is tokenized and lowercased. We remove special characters and punctuation. We build our vocabulary against the train-set of the datasets and embed each word in a vector of 300 dimensions using GloVe (Pennington et al., 2014). If a word from the validation or test-set is not in present our vocabulary, we replace it with the unknown token "unk". Each sentence is run through an unidirectional one-layered LSTM of size C. The size of each linguistic example x is therefore [T, C] where T is the number of words in the sentence.

Acoustic features
In the litterature of multimodal emotion recognition, many works use hand designed acoustic features sets that capture information about prosody and vocal quality such as ComPaRe (Computational Paralinguitic Challenge) feature sets from Interspeech conference.
However, with the evolution of deep learning models, lower level features such as melspectrograms have shown to be very powerful for speech related tasks such as speech recognition and speech synthesis. In this work we extract melspetrograms with the same procedure as a typical seq2seq Text-to-Speech system.
Specifically, our mel-spectrograms were extracted with the same procedure as in (Tachibana et al., 2018) with librosa python library (McFee et al., 2015) with 80 filter banks (the embedding size is therefore 80). A temporal reduction is then applied by selecting one frame every 16 frames. Each spectrogram is then run through an unidirectional one-layered LSTM of size C. The size of each acoustic example y is therefore [T, C] where T is the number of frames in the spectrogram.

Models
This section aims to describe the three model variants evaluated in our experiments. First, we describe the projection (P) of the features extracted in section 4 over emotion and sentiment classes without using any Transformer. This corresponds to the baseline for our experiments. Secondly, we present the Naive Transformer (NT) model, a transformerbased encoding where the inputs are encoded separately, the linguistic and acoustic features do not interact with each other: there is no modulated fusion. Finally, we present the two highlights of the paper, the Modulated Attention Transformer (MAT) and the Modulated Normalization Transformer (MNT), two solutions where the encoded linguistic representation modulates the entire process of the acoustic encoding.

Projection
Given the linguistic features x and acoustic features y extracted at section 4, we define the projection as a two-step process. First, we use an attentionreduce mechanism over each modality, and then fuse both modality vectors using a simple elementwise sum.
Att. Reduce The attention-reduce mechanism consists of a soft-attention over itself followed by a weightedsum computed according to the attention weights. If we consider the feature input x of size [T, C]: After this reduce mechanism, the input becomes vectors of size [1, C]. We can then apply the element-wise sum as follows: where p is the distribution of probabilities over possible answers and LayerNorm denotes Layer Normalization (Ba et al., 2016). If we assume the input feature x has the shape [T, C], for each feature channel c ∈ {1, 2, · · · , C} Finally, for each channel, we have learnable parameters γ c and β c , such that:

Naive Transformer
The Naive Transformer model consists of stacking a Transformer on top of the linguistic and acoustic features extracted at section 4 before the projection of section 5.1. Transformers are independent and their respective input features do not interact with each other. A Transformer is composed of a stack of B identical blocks but with their own set of training parameters. Each block has two sub-layers. There is a residual connection around each of the two sublayers, followed by layer normalization (Ba et al., 2016). The output of each sub-layer can be written like this: where Sublayer(x) is the function implemented by the sub-layer itself. In traditional Transformers, the two sub-layers are respectively a multi-head self-attention mechanism and a simple Multi-Layer Perceptron (MLP). The attention mechanism consists of a Key K and Query Q that interacts together to output a attention map applied to Value V : In the case of self-attention, K, Q and V are the same input. If this input is of size T × C, the operation QK results in a squared attention matrix containing the affinity between each row T . Expression √ C is a scaling factor. The multi-head attention (MHA) is the idea of stacking several selfattention attending the information from different representation sub-spaces at different positions: A subspace is defined as slice of the feature dimension k. In the case of four heads, a slice would be of size k 4 . The idea is to produce different sets of attention weights for different feature sub-spaces. In the context of Transformers, Q, K and V are x for the linguistic Transformer and y for the acoustic Transformer. Throughout the MHA, the feature size of x and y remains unchanged, namely C.
The MLP consists of two layers of respective sizes [C → C] and [C → C]. After encoding through the blocks, the outputsx andỹ can be used by the projection layer (section 5.1) for classification. In Figure 2, we show the encoding of the linguistic features x and its corresponding output x.

Modulated Fusion
The Modulated Fusion consists of modulating the encoding of the acoustic features y given the encoded linguistic featuresx. This modulation in the acoustic Transformer allows for an early fusion of both modality whose result is going to beỹ. This modulation can be performed through the Multi-Head Attention or the Layer-Normalization. After, the outputx andỹ are used as input of the projection from section 5.1. We proceed to describe both approaches in the next sub-sections.

Modulated Attention Transformer
To modulate the acoustic self-attention by the linguistic output, we switch the key K and value V of the self-attention from y tox. The operation QK results in an attention map that acts like an affinity matrix between the rows of modality matrixx and y. This computed alignment is applied over the Value V (nowx) and finally we add the residual connection y. The following equation describes the new attention sub-layer in the acoustic Transformer. y = LayerNorm(y + MHA(y, x, x)) (8) For the operation QK to work as well as the residual connection (the addition), the feature sizes C ofx and y must be equal. This can be adjusted with the different transformation matrices of the MHA module or the LSTM size of section 4. If we consider thatx is of size [T x , C] and y of size [T y , C], then the sizes of the matrix multiplication operations of this modulated attention can be written as follows (where × denotes matrix multiplication): y × x T = T y , C × C, T x = T y , T x (9) (9) × x = T y , T x × T x , C = T y , C (10) (10) + y = T y , C + T y , C = T y , C where equation 11 denotes the (y + MHA(y, x, x)) operation.
We call the Modulated Attention Transformer "MAT" in the experiments.

Modulated Normalization Transformer
It is possible to modulate the normalization layers by predicting two scalars per block fromx, namely ∆γ and ∆β, that will be added to the learnable parameters of equation 4: where ∆γ, ∆β = MLP(x) and the MLP has one layer of sizes [C, 4 × B]. Two pairs of scalars per block are predicted, so no scalars are shared amongst normalization layers.
We update the layer normalization equation accordingly: The Modulated Normalization is a computationally efficient and powerful method to modulate neural activations. It enables the linguistic output to manipulate entire acoutisc feature maps by scaling them up or down, negating them, or shutting them off. As there is only two parameters per feature map, the total number of new training parameters is small. This makes the Modulated Normalization a very scalable method.
We call the Modulated Normalization Transformer "MNT" in the experiments.

Experimental settings
We train our models using the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 1e − 4 and a mini-batch size of 32. If the accuracy score on the validation set does not increase for a given epoch, we apply a learning-rate decay of factor 0.5. We decay our learning rate up to 2 times. Afterwards, we use an early-stop of 10 epochs on accuracy. Results presented in this paper are from the averaged predictions of at most 10 models.
Unless stated otherwise, the LSTM size C (and therefore the Transformer size) is 512. We use B = 2 Transformer blocks for P and NT models and B = 4 for MNT and MAT models. We use 8 multi-heads regardless of the models or the modality encoded. The size C of the Transformer MLP is set at 2048. We apply dropout of 0.1 on the output of each block iteration, and 0.5 on the input (x + y) of the projection layer (equation 2).

Results
We present the results on four sentiment and emotion recognition datasets: IEMOCAP, MOSEI, MOSI and MELD. For each dataset, the results are presented in terms of the popular metrics used for the dataset. Most of the time, F1-score is used, and sometimes the weighted F1-scores to take into account the imbalance between emotion or sentiment classes.
IEMOCAP We first compare the precision, recall and unweighted F1-scores of our two model variants on IEMOCAP in Table 3. We notice that our MAT model comes on top.

Model
Prec. Recall  Table 1: Results of the 4-emotions task of IEMOCAP. Prec. stands for precision and F1 is the unweighted F1score.
If we compare the F1-score per class (table 2), we notice that our model MAT outperforms previous researches, the biggest margin being in the happy category. The model MulT (Tsai et al., 2019) still comes on top in the neutral category.  We can see in Figure 4 that our MNT model has a really good recall on the neutral category but MAT significantly outperforms MNT in the happy cateogry. However, we can see that the happy class surprisingly remains a challenge for the models presented. Our MAT model predicted around 17% of the time "angry" when the true class was happy. On the contrary, our model predicted "happy" 19% of the time when the true label was "sad" and 17% of the time when the true class was "angry". We can see that this is still a significant margin of error for such contradictory labels. It shows that visual cues might be necessary to further improve the performances. MOSI MOSI is a small dataset with few training examples. To train such models, regularization is usually needed to not overfit the training-set. In our case, dropout was enough to top the state-ofthe-art results on this dataset.
Even if the dataset is a bit unbalanced between the binary answers (positive and negative), weighting the loss accordingly did not improve the results. It shows that our model variants manage to efficiently discriminate between both classes.  MOSEI MOSEI is a relatively large-scale dataset. We expect to see a more noticeable difference of score between our Modulated Transformer variants and the Naive Transformer and Projection baselines.
For the emotion task in Table 4, MNT comes on top with a noticeable improvement over the state-of-the-art in the Surprise and Fear category.

MELD
MELD is a dataset for Emotion Recognition in Conversation. Even if our approaches do not take into account the context, we can see that it leads to interesting results. More precisely, our variants are able to detect difficult emotion, such as fear and disgust, even though they are present in very low quantity in the training and test-set.
We can see in Table 6 that even if we do not use the contextual nor the speaker information, our models achieve good results in two categories: fear and disgust. To help understand these results, we give two MELD examples in Figure 5. In the top example, it is unlikely to answer "anger" to the sentence "you fell asleep!" without context, it could be surprise or fear. This is why our "anger" score is really low. In the bottom example, "you have no idea how loud they are" could very well be "anger" too, but happens to be labeled "disgust".

Model
Ang  It is possible that our model, without any prior or contextual bias about an utterance, classify sentences similar to "you fell asleep" or "you have no idea how" as "disgust" or "fear". Further analysis on why our model perform so well could shed the light on this odd behavior. We also fall short on the sad and surprise category compared to GCN, showing that a variant of our proposed models that takes into account the context could lead to competitive results.

Further analysis
A few supplementary comments can be made about the results. First, we notice that the hierarchical structure of the network brought by the transformers did bring improvements across all datasets. Indeed, even the NT model does bring significant performances boost compared to the P model that only consists of an LSTM and the projection layer. A very nice property of our solutions is that few Tranformers layers are required to be the found settings. It usually varies from 2 to 4 layers, allowing our solutions to converge very rapidly.  Another point is that the MAT variant does not require additional training parameters nor computational power (as shown in Table 7), the solution only switch one input of the Multi-Head Attention from one modality matrix to another. For MNT, the Transformer block implements only 2 normalization layers, therefore the conditional layer must only compute 2048 scalars (given C is 512) for ∆γ and ∆β or roughly 1 Million parameters per block. This solution grows linearly with the hidden size but we got better results with C = 512 rather than 1024.
The difference between MAT and MNT variant is slim, but it seems that MAT is more suitable for the binary sentiment classification. The computed alignment by the modulated attention of the linguistic and acoustic modality proves to be an acceptable solution for 2-class problem, but seems to fall short for more nuanced classification such as multi-class emotion recognition. MNT seems more suitable for that task, as shown for MOSEI and MELD. A potential issue for MAT is that we work with shallow architectures (B = 4) compared to recent NLP solutions like BERT using up to 48 layers. In the scope of the dataset presented, we have not enough samples to train such architectures. It is possible that MNT adjust better with shallow layers because it can modulate entire feature maps twice per blocks.

Conclusions
In this paper, we propose two different architectures, MAT (Modulated Attention Transformer) and MNT (Modulated Normalization Transformer), for the task of emotion recognition and sentiment analysis. They are based on Transformers and use two modalities: linguistic and acoustic.
The performance of our methods were thoroughly studied by comparison with a Naive Transformer baseline and the most relevant related works on several datasets suited for our experiments.
We showed that our Transformer baseline encoding separately both modalities already performs well compared to state-of-the-art. The solutions including modulation of one modality from the other show a higher performance. Overall, the architectures offer an efficient, lightweight and scalable solution that challenges, and sometimes surpasses, the previous works in the field.