Towards Emotion-aided Multi-modal Dialogue Act Classification

The task of Dialogue Act Classification (DAC) that purports to capture communicative intent has been studied extensively. But these studies limit themselves to text. Non-verbal features (change of tone, facial expressions etc.) can provide cues to identify DAs, thus stressing the benefit of incorporating multi-modal inputs in the task. Also, the emotional state of the speaker has a substantial effect on the choice of the dialogue act, since conversations are often influenced by emotions. Hence, the effect of emotion too on automatic identification of DAs needs to be studied. In this work, we address the role of both multi-modality and emotion recognition (ER) in DAC. DAC and ER help each other by way of multi-task learning. One of the major contributions of this work is a new dataset- multimodal Emotion aware Dialogue Act dataset called EMOTyDA, collected from open-sourced dialogue datasets. To demonstrate the utility of EMOTyDA, we build an attention based (self, inter-modal, inter-task) multi-modal, multi-task Deep Neural Network (DNN) for joint learning of DAs and emotions. We show empirically that multi-modality and multi-tasking achieve better performance of DAC compared to uni-modal and single task DAC variants.


Introduction
Dialogue Act Classification (DAC) is concerned with deciding the type i.e., communicative intention (question, statement, command etc.) of the speaker's utterance. DAC is very important in the context of discourse structure, which in turn supports intelligent dialogue systems, conversational speech transcription and so on. Considerable works have been done on classical Machine Learning (ML) based DAC , (Stolcke et al., 2000), (Verbree et al., 2006), etc. and Deep * The authors have contributed equally.
Humans are emotional entities. A speaker's emotional state considerably influences or affects its intended content or its pragmatic content (Barrett et al., 1993). An utterance such as "Okay sure" or "Ya right" (say) can be considered as "agreement" or-in case of sarcasm-"disagreement". For expressive DAs such as "greeting", "thanking", "apologizing" etc., the speaker's feeling or emotion can assist in recognizing true communicative intent and vice-versa. Thus, it is important to consider the speaker's emotion when deciding on the DA.
There is considerable work on ER (Cowie et al., 2001), (Jain et al., 2018), , etc. and adapting the Virtual Agents (VAs) to act accordingly , (Zhou et al., 2018), (Fung et al., 2018), etc. But very little research has been done, that addresses the impact of emotion while deciding the DA of an utterance (Novielli and Strapparava, 2013), (Bosma and André, 2004). As DAs primarily dictate the flow of any dialogue conversation (be it human-human or human-computer), such synergy of ER and DAC is required. Research too has shown the benefit of utilizing the combination of text and nonverbal cues (Poria et al., 2017b), (Poria et al., 2017a) etc., for solving various Natural Language Processing (NLP) tasks. The main advantage of integrating other modalities to text is the usage of behavioral signs present in acoustic (vocal modulations) and visual (facial expression) modalities. In addition, the various modalities offer important signals to better identify the speaker's communicative intention and emotional state. This will in effect help create sturdy and more reliable DAC models.
In this paper, we study the influence of emotion on the identification of DAs, by utilizing the com-bination of text, vocal modulations and facial expressions for task-independent conversations. DAC is our primary task, assisted by Emotion Recognition (ER) as an auxiliary task. We implement an attention based multi-modal, multi-tasking DNN to do joint modeling of DAC and ER. Also, we introduce a new dataset to help advance research in multi-modal DAC.
The key contributions of this paper are as follows: i. We curate a new dataset called EMO-TyDA for facilitating multi-modal DAC research with high-quality annotations, including emotionally aided cues and conversational context features. We believe this dataset will advance research in multi-modal DAC; ii. We point to different scenarios where discrepancy in DAC is evident across different modalities, thus, showing the importance of multi-modal approaches to DAC; iii. We show using various instances, the usefulness of considering the emotional state of the user while identifying DAs. Consequently, we deduce that EMO-TyDA will lead to a novel sub-task for future research: emotion aware DAC; iv. We propose an attention based (self, inter-modal, inter-task) multitask, multi-modal DNN for jointly optimizing the DAC and ER task and show its benefit over single task DAC variants. Through this, we also establish that multi-modal DAC performs significantly better than uni-modal DAC.

Related Works
The tasks of ER and DAC are extensively explored.
Dialogue Act Frameworks: DAC has been investigated since late 90s (Reithinger and Klesen, 1997), (Stolcke et al., 1998) and early 2000's (Stolcke et al., 2000, (Grau et al., 2004). Much of this research, however, uses chat transcripts with only the text mode, due partly due to unavailability of multi-modal open-source dataset. In (Khanpour et al., 2016), authors apply stacked LSTM to classify speech acts. In , the author developed a Hierarchical Network based approach using Bi-LSTMs and the CRF. A contextual self-attention system fused with hierarchical recurrent units was proposed by the authors of (Raheja and Tetreault, 2019) to develop a sequence label classifier. The authors of  proposed a method for the capture of long-range interactions that span a series of words using a Convolutional Network based approach. In (Saha et al., 2019), authors proposed several ML and DL based approaches such as Conditional Random Fields, clustering and word embeddings to identify DAs. However, all these works identify DAs by utilizing solely the textual modality without the use of emotional cues.
Emotion aware DAs. Within a multi-modal setting, little work is available in the literature to study the impact of emotional state in the evaluation of DAs. The effect of integrating facial features as a way of identifying emotion to classify DAs was examined by authors in (Boyer et al., 2011). They exhibited their work for tutorial dialogue session typically task-oriented and applied logistic regression to identify DAs. But they studied only the cognitive-affecting states such as confusion and flow as the emotional categories to learn DAs. In (Novielli and Strapparava, 2013), authors examined the impact of affect analysis in DA evaluation for an unsupervised DAC model. The authors made use of lexicon based features from WordNet Affect and SentiWordNet to map them with emotion labels to model the DAs in a LSA based approach. Authors of (Ihasz and Kryssanov, 2018), also inspected the impact of emotions mediated with intention or DAs for an in-game Japanese dialogue. Their goal was to construct DA-emotion combinations from the pre-annotated corpus. However, such stringent associations or dis-associations amongst DA-emotion pairs may not truly hold for real life conversations.

Dataset
To facilitate and enhance the research in multimodal DAC assisted with user emotion, we introduce a new dataset (EMOTyDA) consisting of short videos of dialogue conversations manually annotated with its DA along with its pre-annotated emotions.

Data Collection
To gather potentially emotion rich conversations to explore its affect on DAC, we scanned the literature for existing multi-modal ER dataset. During our initial search, we obtained several multi-modal ER datasets which include Youtube (Morency et al., 2011), MOUD (Pérez-Rosas et al., 2013, IEMO-CAP (Busso et al., 2008), ICT-MMMO (Wöllmer et al., 2013), CMU-MOSI (Zadeh et al., 2016), CMU-MOSEI (Zadeh et al., 2018) and MELD (Poria et al., 2019) etc. However, we zeroed down on IEMOCAP and MELD datasets for the further investigations of our problem statement. The reason behind this choice was that remaining all the datasets mentioned above were particularly monologues involving opinions and product reviews. Whereas our research requires task-independent dyadic or multi-party conversations to analyze its full potential. Both these available datasets are not annotated for their corresponding DAs.
Also, benchmark DAC datasets such as Switchboard (SWBD) (Godfrey et al., 1992), ICSI Meeting Recorder (Shriberg et al., 2004) consist of text and audio-based conversations whereas TRAINS (Heeman and Allen, 1995) consist of solely textbased conversations with no emotional tags. HCRC Map Task corpus (Anderson et al., 1991) additionally encompasses audio modality with the transcripts but the corpus itself has task-oriented conversations and is not annotated for its emotion tags. It is to be noted that task-oriented conversations generally restrict the presence of diverse tags which are commonly encountered in task-independent conversations.
To the best of our knowledge, at the time of writing, we were unaware of any sizable and openaccess DA and emotion annotated multi-modal dialogue data. Thus, MELD and IEMOCAP datasets have been manually annotated for the corresponding DAs to encourage and promote novel research on multi-modal DACs to build a multi-tasking system that allows DA and emotion for an utterance to be learned jointly.

Data Annotation
Over the years, SWBD-DAMSL tag-set comprising of 42 DAs developed by (Jurafsky, 1997) has been used widely for the task of DAC for taskindependent dyadic conversation such as SWBD corpus. Thus, we use SWBD-DAMSL tag-set as the base for conceiving tag-set for the EMO-TyDA dataset since both these datasets contain task-independent conversations. Of the 42 SWBD-DAMSL tags, 12 most commonly occurring tags have been used to annotate utterances of the EMO-TyDA dataset. The choice of 12 tags is because of the limited length of the EMOTyDA dataset in comparison to the SWBD corpus. It stems from the fact that it is highly likely that many of the tags of the SWBD-DAMSL tag-set will never appear in the EMOTyDA dataset due to lesser number of utterances and lower diversity of occurrence of such fine-grained tags. The 12 most commonly occur-ring chosen tags are Greeting (g), Question (q), Answer (ans), Statement-Opinion (o), Statement-Non-Opinion (s), Apology (ap), Command (c), Agreement (ag), Disagreement (dag), Acknowledge (a), Backchannel (b) and Others (oth).
For the current work, we have selected a subset of 1039 dialogues from MELD amounting to 9989 utterances and the entire IEMOCAP dataset of 302 dialogues amounting to 9376 utterances to curate EMOTyDA dataset. Details of the original MELD and IEMOCAP datasets are provided in the Appendix 6. Three annotators who were graduate in English linguistics were accredited to annotate the utterances with the appropriate DAs out of the 12 chosen tags. They were asked to annotate these utterances by only viewing the video available considering the dialogue history without the information of the pre-annotated emotion tags. This was done so as to assure that the dataset does not get biased by specific DA-emotion pairs. The inter-annotator score over 80% was considered as reliable agreement. It was determined based on the count that for a given utterance more than two annotators agreed on a particular tag. To remove the discrepancy in the number of emotion tags for IEMOCAP and MELD datasets, we mapped the joy tag of the MELD to the happy tag of the IEMOCAP to finally settle on 10 tags from the IEMOCAP for the EMOTyDA dataset.

Emotion-DA Dataset: EMOTyDA
The EMOTyDA dataset 1 now comprises of 1341 dyadic and multi-party conversations resulting in a total of 19,365 utterances or annotated videos with the corresponding DA and emotion tags considering the dialogue history. The dataset contains approximately 22 hours of recordings. Source distribution and major speakers statistics of the dataset are shown in Figures 3a and 3b, respectively. Since DAC and ER tasks are known to exploit the contextual features, i.e., dialogue history  so, utterances in the dataset are accompanied with their corresponding contextual utterances, which are typically preceding dialogue turns by the speakers participating in the dialogue. Each of the utterances contains three modalities: video, audio, and text. All the utterances are even followed by their speaker identifiers. Table 1 shows few utterances along with the corresponding DAs and emotion la-  bels from the proposed dataset. Distributions of DA and emotion labels across the source datasets are shown in Figure 1a and 1b, respectively.

Qualitative Aspects
In the current work, we seek to analyze the affect of emotion in classifying DAs. Also, DAC in text usually involves extra information that can be benefitted from associated modalities. Below, we analyze some samples that require emotion aided and multi-modal reasoning. We exemplify using few instances from our proposed dataset in order to support our claim of DA often being expressed in a multi-modal way along with exploiting the emotional state of the speaker.
Role of Emotion. In Figure 2b, we present two instances from the dataset where the emotional state of the user seems beneficial in deciding the DA of an utterance. In the first example, the reference to the sad and dismal state of the speaker directs it to acknowledge the presence of the hearer. In the second case, the angry emotional state of the speaker forces her to disagree with people's opinion or suggestion involved in the conversation. The examples above illustrate the importance of having emotional information as emotions affect the communicative intention or DA of the speaker discussed above. The presence of emotion in our dataset caters the models with the ability to use additional information while reasoning about DA.
Role of Multi-modality. Figure 2a shows two cases where DA is articulated through incongruity between modalities. In the first instance, the facial modality implies anger or fury. Whereas the textual modality lacks any visible sign of displeasure, on the contrary it indicates an agreement. So, the textual claims does not validate the facial features.
In the second case, the textual modality hints pure agreement. Whereas the audio modality expresses a sarcastic appeal. In both these cases, there exists inconsistency between modalities, which acts as a strong indicator that multi-modal information is also important in providing additional cues for identifying DAs. The availability of complementary information across multiple modalities improves the model's ability to learn discriminatory patterns that are responsible for this complex process.

Proposed Methodology
This section describes the proposed multi-task, multi-modal approach followed by the implementation details.

Multi-modal Feature Extraction
Here, we discuss, the process of multi-modal feature extraction.
Textual Features. The transcriptions available for each video forms the source of the textual modality 2 . To extract textual features, pretrained GloVe (Pennington et al., 2014) embeddings of dimension 300 have been used to obtain representation of words as word vectors. The resultant word embeddings of each word are concatenated to obtain a final utterance representation. While it is indeed possible to use more advanced textual encoding techniques (for e.g., convolutional or recurrent neural network), we decided to use the same pre-trained extractive strategy as in the case of other modalities.
Audio Features. To elicit features from the audio, openSMILE (Eyben et al., 2010), an open source software has been used. The features obtained by openSMILE include maxima dispersion quotients (Kane and Gobl, 2013), glottal source parameters , several lowlevel descriptors (LLD) such as voice intensity, voice quality (for eg., jitter and shimmer), MFCC, voiced/unvoiced segmented features (Drugman and Alwan, 2011), pitch and their statistics (for eg., root quadratic mean, mean etc.), 12 Mel-frequency coefficients etc. All the above features are then concatenated together to form a d q = 256 dimensional representation for each window. The final audio representation of each utterance (A) is obtained by concatenating the obtained d q for every window 2 Original dataset with its video and transcript are downloaded from : https://github.com/ SenticNet/MELD, https://sail.usc.edu/ iemocap/iemocap_release.htm i.e., A ∈ R w×dq where w represents total window segments.
Video Features. To elicit visual features for each of the f frames from the video of an utterance, we use a pool layer of an ImageNet (Deng et al., 2009), pretrained ResNet-152 (He et al., 2016 image classification model. Initially, each of the frames is preprocessed which includes resizing and normalizing. So, the visual representation of each utterance (F ) is obtained by concatenating the obtained d f = 4096 dimensional feature vector for every frame, i.e., F ∈ R f ×d f (Castro et al., 2019), (Illendula and Sheth, 2019), (Poria et al., 2017b), (Poria et al., 2017a).

Network Architecture
The proposed network consists of three main components : (i) Modality Enocoders (ME) which primarily takes as input the uni-modal features (extracted above) and produce as outputs the individual modality encodings, (ii) Triplet Attention Subnetwork (TAS) that encompasses self, inter-modal and inter-task attention and (iii) classification layer that contains outputs of both the tasks (DAC and ER).

Modality Encoders
In this section, we discuss how different modalities are encoded in the architectural framework.
Textual Modality. The obtained utterance representation (U) from the extracted textual features (discussed above) is then passed through three different Bi-directional LSTMs (Bi-LSTMs) (Hochreiter and Schmidhuber, 1997) to sequentially encode these representations into hidden states and learn different semantic dependency based features pertaining to different task, i.e., DAC and ER. One Bi-LSTM learns DAC features that are tuned in accordance with the emotion features. Second learns features for the ER task regulated by the learning of DA features. The third Bi-LSTM learns private features for the task of DAC which is not influenced by the features learnt from emotion.
For each of these word features, its corresponding forward and backward hidden states where H ∈ R n×2d . d represents the number of hidden units in each LSTM and n is the sequence length. Thus, the obtained three hidden state matrices correspond to three Bi-LSTMs, i.e., H 1 , H 2 , H 3 . These representations are then passed through three fully connected layers, each of dimension say d c to learn attention of different variants.
Audio and Video Modalities. The audio and video features (A and F ) extracted are also passed through three fully connected layers, each of dimension say d c , to learn attention of different variants.

Triplet Attention Subnetwork
We use a similar concept as in (Vaswani et al., 2017), where the authors proposed to compute attention as mapping a query and a set of key-value pairs to an output. The output is estimated as a weighted sum of the values, where the weight assigned to each value is calculated by a compatibility function of the query with its corresponding key. So, the representations obtained from each of the modality encoders above which are passed through three fully-connected layers each are termed as queries and keys of dimension d k = d c and values of dimension d v = d c . We now have five triplets of (Q, K, V ) as : where first three triplets are from the textual modality encoder (one each for DA shared, DA private and Emotion shared) 3 followed by one from audio and video encoder each. These triplets are then used in different combinations to compute attention scores meant for specific purposes that includes self attention, inter-modal attention and inter-task attention. Self Attention. We compute self attention (SA) for all these triplets by computing the matrix multiplication of all its corresponding queries to its corresponding keys.
Inter-modal Attention. We compute intermodal attention (IMA) amongst triplets of all the modalities for the multi-task by computing the matrix multiplication of combination of queries and keys of different modalities using Equation 4. In this manner, we obtain five IM A scores as This is done in order to identify significant contributions amongst different modalities to learn optimal features for an utterance.
Inter-task Attention. We compute inter-task attention (ITA) amongst triplets of different tasks from the textual modality by computing the matrix multiplication of combinations of queries and keys of different tasks using Equation 4. In this manner, we obtain three IT A scores as IT A 12 ∈ R n×n , IT A 21 ∈ R n×n and IT A 31 ∈ R n×n . This is done in order to learn joint features of an utterance for identification of DAs and emotions.

Fusion of Attentions.
We then obtain softmax of all these computed different attention scores to squash them in a range of [0,1] so that the ones having maximum contribution gets the highest probability values and vice-versa. We then compute the matrices of attention outputs for different tasks and modalities from the different attention scores as: where A ∈ R n×dc . So, we obtain 13 different attention outputs from its corresponding attention scores which are SA ∈ R n×dc for SA 1 , Next, we obtain mean of different attention outputs in varying combinations to finally obtain representations for each of the modalities and tasks as where M ∈ R 1×dc . Next, we focus on learning appropriate weights to combine these representations to obtain final sentence representation for each of the tasks to be optimized jointly.  where * represents dot product of two vectors. Finally, we obtain sentence representation (S) for each of the tasks as follows:

Classification Layer
The output, i.e., sentence representation for each of the tasks (S DA and S E ) from the TAS are connected to a fully-connected layer which in turn consists of the output neurons for both the tasks (DAC and ER). The errors computed from each of these channels are back-propagated jointly to the successive prior layers of the model in order to learn the joint features of both the tasks thereby, allowing them to benefit from the TAS layer. As the main aim of this study is to learn DA with the help of emotion, the performance of the DAC task also banks on the quality of features learned for the ER task with useful and better features assisting the collective learning process and vice-versa.

Implementation
EMOTyDA dataset was divided into two parts of 80% -20% split for train and test set respectively. The statistics of the train and test set are shown in Table 2. For all the experiments conducted, same train and test sets were employed to allow a fair distinction between all approaches. For encoding the textual modality, a Bi-LSTM layer with 200 memory cells was used followed by a dropout rate of 0.1. Fully-connected layer of dimension 300 was used in all the subsequent layers. The first and the second channel contain 12 and 10 output neurons, respectively, for the DA and the emotion tags. Categorical crossentropy loss function is used in both the channels. A learning rate of 0.01 was found to be optimum. Adam optimizer was used in the final experimental setting. All these values are selected after a thorough sensitivity analysis of the parameters.

Results and Analysis
EMOTyDA contains dialogues pertaining to dyadic and multi-party speakers, so, we performed experi-   Table 4: Results of various baseline models for the multi-task framework for the EMOTyDA dataset ments segregating dyadic and multi-party conversations as well in addition to the whole dataset for the multi-task framework along with different modalities. Additionally, we also provide results of the multi-task framework with its varying combinations of different attentions applied to provide analysis on the effectiveness of each attention for the entire EMOTyDA dataset. Along with this, we also include results of some simple baselines such as feature level, hidden state level and hypothesis level concatenation. It is to be noted that the purpose of the current work is to examine the effect of emotion while deciding the DA of an utterance from multiple modalities. We, therefore, do not focus on enhancements or analysis of the ER task and view it as an auxiliary task aiding the primary task, i.e., DAC. In regards to this, the results and findings are reported with respect to only the DAC task and its different combinations. Table 3 shows the results of all the various models. As visible, the textual modality provides the best results amongst the uni-modal variants. The addition of audio and visual features individually improves this uni-modal baseline. The combination of visual and textual features achieves the best score throughout all the combinations of the dataset. The tri-modal variant is not able to attain the best score supposedly because of suboptimal ple utterances for the tri-modal variant. V, A and T represent attention scores of video, audio and textual features, respectively. Sample utterance -u1:"I am not in the least bit drunk.", u2: "There's a lot of people looking for jobs.", u3: "It was ridiculous. Completely ridiculous.", u4: "You don't have to explain.", u5: "No, Rachel doesn't want me to..." performance of the audio modality. Though it still improves the performance compared to all the unimodal baselines. Figure 5 shows the heatmap visualization of the tri-modal variant to highlight the contributions of different modalities.
As is also evident from the results, the multi-task variant performs consistently well throughout all the experiments compared to its single task DAC variant. As a baseline, we also show that using emotion as a feature in the single task DAC counterpart doesn't outperform the proposed multi-task variant. This shows that the joint optimization of both these tasks boosts the performance of DAC. Table 4 shows the results of few simple baselines along with the ablation study of different attentions used in the proposed framework to highlight the importance and effectiveness of each of the attentions used for the whole EMOTyDA dataset. As seen from the table, the combinations of all three attention mechanisms, i.e., SA, IMA and ITA, yields the best results, thus, stressing the roles of incorporating across-task and across-modal relationships. Figure 6 shows the visualization of the learned weights of different words for a sample utterance for the single task DAC as well as the multi-task model to highlight the importance of incorporat-  Table 5: Sample utterances with its predicted labels for the best performing multi-task (MT) (T+V) model and its single task (ST) DAC variants; These examples show that ER as an auxiliary task helps DAC for better performance in MT. Figure 6: The visualization of the learned weights for an utterance -u 1 : "Oh yes, yes I am, you can't stop me." for the best performing model (T+V), single task DAC (baseline) and multi-task DAC+ER (proposed) model ing ER as an auxiliary task. The true DA label of the utterance in Figure 6 is disagreement with emotion as anger. With the multi-task approach, the attention is laid on appropriate disagreement bearing words whereas with single task, attention is learnt on agreement words such as yes which here has just been used in a sarcastic way to disagree. It is also observed that the experiments with dyadic conversations attain better results as compared to multi-party conversations. This is supposedly due to the constant change of speakers in multi-party conversations that misleads the classifier to learn suboptimal features, thus, stressing on the role of using speaker information as valuable cues for DAC.
Error Analysis. Plausible reasons behind the faults in the DA prediction are as follows : (i) Skewed dataset : The occurence of most of the tags in the proposed dataset is very less, i.e., the dataset is skewed as shown in Figure 1a. This consistently conforms with real time task-independent conversations where some tags occur less frequently as compared to others; (ii) Composite and longer length utterance: Most of the utterances in the dataset are longer in length and is also composite in nature encompassing diversified intentions in a single utterance. In such cases, it becomes difficult to learn features for discrete DAs; (iii) Mis-classification of emotion labels: Misclassification of the DAs can be attributed to the mis-classification of the emotions for that partic-ular utterance. Some examples for the same are shown in Table 5.

Conclusion and Future Work
In this paper, we investigate the role of emotion and multi-modality in determining DAs of an utterance. To enable research with these aspects, we create a novel dataset, EMOTyDA, that contains emotion-rich videos of dialogues collected from various open-source datasets manually annotated with DAs. Consequently, we also propose an attention based (self, inter-modal, inter-task) multi-modal, multi-task framework for joint optimization of DAs and emotions. Results show that multi-modality and multi-tasking boosted the performance of DA identification compared to its unimodal and single task DAC variants. In future, conversation history, speaker information, fine-grained modality encodings can be incorporated to predict DA with more accuracy and precision.