EmotionX-AR: CNN-DCNN autoencoder based Emotion Classifier

In this paper, we model emotions in EmotionLines dataset using a convolutional-deconvolutional autoencoder (CNN-DCNN) framework. We show that adding a joint reconstruction loss improves performance. Quantitative evaluation with jointly trained network, augmented with linguistic features, reports best accuracies for emotion prediction; namely joy, sadness, anger, and neutral emotion in text.


Introduction
Emotion recognition in content is an extensively studied area. It deals with associating words, phrases or documents with various categories of emotions. The importance of emotion analysis in human communication and interactions has been discussed by Picard (1997). Historically studied using multi-modal data, the study of human emotion from text and other published content has become an important topic in language understanding. Word correlation with social and psychological processes is discussed by Pennebaker (2011). Preotiuc-Pietro et al. (2017) studied personality and psycho-demographic preferences through Facebook and Twitter content. The analysis of emotion in interpersonal communication such as emails, chats and longer written articles is necessary for various applications including the study of consumer behavior and psychology, understanding audiences, and opinions in computational social science, and more recently for dialogue systems and conversational agents. This is an active research space today.
In contrast to sentiment analysis, emotion analysis in user generated content such as tweets (Dodds et al., 2011), blogs (Aman and Sz-pakowicz, 2007) and chats remains a space less trodden. The WASSA-2017 task on emotion intensity (Mohammad and Bravo-Marquez, 2017) aims at detecting the intensity of emotion felt by the author of a tweet. Whereas (Alm et al., 2005;Aman and Szpakowicz, 2007;Brooks et al., 2013;Neviarouskaya et al., 2009;Bollen et al., 2011) provide discrete binary labels to text instances for emotion classification. Typical discrete categories are a subset of those proposed by Ekman (Ekman, 1992) namely anger, joy, surprise, disgust, sadness, and fear. Paper Structure: The remainder of the paper is organized as follows. We summarize the Emo-tionLines dataset in Section 2. Section 3 describes different parts of our system. We present our experiments in Section 4. Section 5 discusses the results of our final system submitted to the Emo-tionX challenge. Finally, we present conclusion and future directions in section 6.

Data
EmotionLines dataset contains dialogues from the Friends TV series and EmotionPush chat logs. Both Friends TV scripts and EmotionPush chat logs contain 1,000 dialogues split into training(720), development(80), and testing(200) set separately. In order to preserve completeness of any dialogue, the corpus was divided by the dialogues, not the utterances. Refer to Chen et al. (2018) for details on the dataset collection and construction.
The EmotionX task on EmotionLines dialogue dataset tries to capture the flow of emotion in a conversation. Given a dialogue, the task requires participants to determine the emotion of each utterance (in that dialogue) among four label candidates: joy, sadness, anger, and neutral.

System Description
In this section, we provide the technical details of our model.

Architecture Overview
We propose a joint learning framework for emotion detection built on a convolutional encoder (CNN). We introduce a joint learning objective where the network needs to learn the (1) utterance text (the data itself) and the (2) emotion information from the labeled data (EmotionLines) together. The CNN along with a deconvolutional decoder (DCNN) provides the mechanism for text reconstruction, i.e. to learn the text sequences. On the other hand, the learned encoding, augmented with linguistic features, acts as the input feature space for emotion detection. Consider a text input d to the model. Each word w t d in d is embedded into a k-dimensional repre- where E is a learned matrix. The embedding layer is passed through a CNN encoder to create a fixed-length vector h L for the entire input text d. This latent representation, appended with linguistic features is then sent to a fully connected layer with a softmax classifier on top. Along with this, h L is also fed to a deconvolutional decoder which attempts to reconstruct d from the latent vector. Therefore, the final loss function: α ae L ae + (1 − α ae )L c for the model is a combination of the classification error L c and the reconstruction error L ae explained in the following subsections. Zhang et al. (2017) introduce a sequence-tosequence convolutional encoder followed by a deconvolutional decoder (CNN-DCNN) framework for learning latent representations from text data. Their proposed framework outperforms RNNbased networks for text reconstruction and semisupervised classification tasks. We leverage their network in our work. Convolutional Encoder. CNN with L layers, inspired from Radford et al. (2015) is used to encode the document into a latent representation vector, h L . Former L − 1 convolutional layers create a feature map which is fed into a fully-connected layer implemented as a convolutional layer. This final layer produces the latent representation h L which acts as a fixed-dimensional summarization of the document. Deconvolutional Decoder. We leverage the deconvolutional decoder introduced by Zhang el al. (2017) as is for our model. The reconstruction loss is defined as,

CNN-DCNN Autoencoder
where D is the set of observed sentences. w t d and w t d correspond to the words in the input and output sequences respectively.

Linguistic Features
Here, we explain the various linguistic features used in our network. Inspired from Chhaya et al, (2018), we use 68 linguistic features further divided into 4 sub-groups: Lexical, Syntactic, Derived and Affect-based.
The lexical and syntactic features include features such as 'avera-geNumberofWords per sentence' and 'number of capitalizations'. Features that can help quantify readability of text are the part of derived features.
Thus, this set contains features like Hedges, Contractions, and Readability scores. The fourth group of features are the Affect-related features. These features are lexica-based and quantify the amount of affective content present in the text. All features used by Pavlick et al. (2016) for formality detection and by Danescu et al. (2013) for politeness detection are included in our analysis. We use Stanford CoreNLP 1 and TextBlob 2 feature extraction and pre-processing.
Lexical and Syntactic Features: The lexical features capture various counts associated with the content like '#Question Marks', 'Average Word Length' etc. Syntactic features include NER-based features, Number of blank lines, and text density which is defined as follows: where ρ is the text density, #(sentences) denotes number of sentences in the text content and #(lines) number of lines including blank lines in the text message. Prior art in NLP extensively relies on these features for their analysis.
Derived: Readability Features: The derived features capture information such as readability of text, existence of hedges, subjectivity, contractions and sign-offs. Subjectivity, contractions and hedges are based on the TextBlob implementation. Readability is measured based on Flesh-Kincaid readability score. This score is a measure of ease of reading of given piece of text. We use the textstat package 3 in Python for implementation.
Psycholinguistic Features: The affect features used in our analysis include: 1. Valence-Arousal-Dominance (VAD) Model (Mehrabian, 1980): We use the Warriner's lexicon (Warriner et al., 2013) for these features. This lexicon contains real-valued scores for Valence, Arousal, and Dominance (VAD) on a scale of 1 9 each for 13915 English words. 1, 5, 9 correspond to the low, moderate (i.e. neutral), and high values for each dimension respectively. PERMA model is a scale to measure positivity and well-being in humans (Seligman, 2011). This model defines the 5 dimensions: Positive Emotions, Engagement, Relationships, Meaning, and Accomplishments as quantifiers and indicators of positivity and well-being. Schwartz et al. (Schwartz et al., 2013) published a PERMA lexicon. We use this lexicon in our work.
Formality Lists:We use the formality list, provided by Brooke et al. (2010), for our experiments.
It contains a set of words usually used to express formality or informality in text.

Supervised Classification
Traditional affective language studies focus on analyzing features including lexical (Pennebaker et al., 2001), syntactic, and psycholinguistic features to detect emotions. We augment the latent vector produced by CNN encoder with the set of linguistic features (Section 3.3) to capture emotions.
Let h denote the representation vector for linguistic features extracted from the input data d. h is normalized and concatenated with h L to derive h = h L h . h , producing a probability p n for each neuron in the softmax layer, where y n denotes the ground-truth for corresponding class n.
We use cross-entropy based classwise loss as given below: Since, EmotionLines suffers from class imbalance, we give higher weight (w n ) to the losses incurred on data samples of minority classes.
where a n denote the number of samples of class n in the training set. Finally, we use a weighted Table 1 provides a summary of the features considered. Ngrams and other semantic features are ignored as they introduce domain-specific biases. Word-embeddings are treated separately and considered as raw features to train a supervised model.

Joint learning
The CNN-DCNN network learns the text information i.e. sequences, the linguistic features learn the emotional aspect. Joint learning introduces the mechanism to learn shared representations during the network training. We implement joint learning using simultaneous optimization for both sequence reconstruction (CNN-DCNN) and emotion detection (linguistic features). The combined loss function is given by, where α ae is a balancing hyperparameter with 0 ≤ α ae ≤ 1. Higher the value of α ae , higher is the importance given to the reconstruction loss L ae while training and vice versa.

Experiments
In this section, we show the experimental evaluation of our system on the EmotionLines dataset.

Experimental Setup
CNN encoder with MLP Classifier: We use 300dimensional pre-trained glove word-embeddings (Pennington et al., 2014) as input to the model. The encoder contains two convolutional layers. Size of the latent representation is set to 600. The MLP classifier contains one fully-connected layer followed by a softmax layer. Joint Training: We set α ae = 0.5 as this gives equal importance to both objectives and reports best results. Linguistic Features: We concatenate a full set of 68 linguistic features with the latent representation for emotion detection.   Table 2 shows the results for models trained on individual training sets using our weighted loss function. The performance is evaluated using both, the weighted accuracy (WA) and the unweighted accuracy (UWA), as defined by the chal-lenge authors (Chen et al., 2018).

Results
where a c denotes the accuracy of emotion class c and p c denotes the percentage of utterances in emotion class c.
Adding a reconstruction loss with classification loss improves performance. We attribute this to improved generalizability provided by a semisupervised loss. Concatenating linguistic features improves minority class accuracies for both Friends TV dialogues and EmotionPush chats. The improvements due to joint loss and linguistic features are more significant for EmotionPush chat log dataset. Accuracies of majority class (Neutral) take a considerable hit with the addition of J and L for both datasets, whereas minority emotions like Sadness and Anger consistently benefit from addition of linguistic features. Table 3 contains results for models trained on both Friends and EmotionPush training data. Increase in training data, even though from a different domain, improves performance for Joy and Anger emotions. Accuracy on sadness dips significantly for EmotionPush. Overall WA and UWA also increase slightly for Friends dataset.

EmotionX Submission and Analysis
We implement an ensemble of the four model variants trained on the Friends + EmotionPush data as our final submission for the EmotionX challenge. We arrive at the final class predictions using the algorithm explained in Algorithm 1. For each test return ensemble predictions 28: end procedure sample, we find models for which the maximum output probability associated with a class is greater than a threshold of 0.75 (High Confidence). Predictions from this subset are considered as the candidate high confidence classes. The most common class in this subset is taken as the final prediction for EmotionX submission. If the subset is empty, a similar approach is followed but with a reduced threshold of 0.50 (Moderate Confidence). Predictions for samples which do not satisfy any of the above thresholds are termed as Low Confidence Predictions.
The results on the test-set for both datasets are shown in Table 4. Comparison with the best results in each class shows that for Friends dataset, our model tops for all emotions except Neutral. Whereas, for the EmotionPush dataset, we perform well on Joy and Anger. Our model had the best unweighted accuracy (UWA) for both datasets in the EmotionX challenge.  Table 5: An example dialogue from Friends dataset with corresponding predictions and labels.

Error Analysis
Our model does not explicitly import contextual information from other utterances in the conversation. Therefore, quite expectedly, we found that most of the utterances misclassified by our model occur in dialogues where the current utterance does not exhibit the emotion it is tagged with.
Another set of errors occur where the whole conversation is not able to explain the respective emotions of each utterance. Table 5 shows an example conversation where it might be difficult for even a human to classify the utterances without the associated multi-modal cues.

Conclusion and Future Work
We propose a CNN-DCNN autoencoder based approach for emotion detection on EmotionLines dataset. We show that addition of a semisupervised loss improves performance. We propose multiple linguistic features which are concatenated to the latent encoded representation for classification. The results show that our model detects emotions successfully. The network, using a weighted classification loss function, tries to handle the class imbalance in the dataset.
In future, we plan to include results of modeling emotion on the whole dialog using an LSTM layer over our network. We would experiment with concatenating subsets of linguistic features to better estimate the contribution of each feature group. We also plan to use data-augmentation techniques such as backtranslation and word substitution using Wordnet and word-embeddings in order to handle class-imbalance in the dataset.