ELiRF-UPV at SemEval-2019 Task 3: Snapshot Ensemble of Hierarchical Convolutional Neural Networks for Contextual Emotion Detection

This paper describes the approach developed by the ELiRF-UPV team at SemEval 2019 Task 3: Contextual Emotion Detection in Text. We have developed a Snapshot Ensemble of 1D Hierarchical Convolutional Neural Networks to extract features from 3-turn conversations in order to perform contextual emotion detection in text. This Snapshot Ensemble is obtained by averaging the models selected by a Genetic Algorithm that optimizes the evaluation measure. The proposed ensemble obtains better results than a single model and it obtains competitive and promising results on Contextual Emotion Detection in Text.


Introduction
Emotion Detection problem arises in the context of conversational interactions, among two or more agents, when one agent is interested in knowing the emotional state of other agent involved in the conversation. The detection of emotions is a difficult task when the content is expressed by using only text, due to the lack of facial and hand gesture expressions, voice modulations, etc. Moreover, the task becomes more complex if the detection of emotions is applied only on a short piece of text without including context. This is because the context can act as an emotion modifier of a given turn in the conversation.
Although, researchers mainly focus on emotion detection on text in absence of context   (Klinger et al., 2018), tipically extracted from social media, recently, there are few works that approach the emotion detection in conversations by using context information (Hazarika et al., 2018b) (Majumder et al., 2018) (Hazarika et al., 2018a). These contextual systems work on long conversations where different users are involved and they use multimodal data, specifically, text, audio and video in order to address the emotion detection problem on large multi-party conversations.
In this work, we present an approach to the Semeval 2019 Task 3: Contextual Emotion Detection in Text (Chatterjee et al., 2019). This task is a simplification of the text emotion detection problem on conversations where each conversation have only three utterances. Only two different users are involved in each conversation, where the first and third turn corresponds to the first user and the second turn corresponds to the second user. The goal of this tasks is to predict the emotion of the third turn. We propose a Snapshot Ensemble (SE) of 1D Hierarchical Convolutional Neural Networks (HCNN) trained to extract useful information from 3-turn conversations. Our system was designed following some ideas of (Morris and Keltner, 2000) and (Majumder et al., 2018). Concretely, we consider the inter-turn and self-turn dependencies (Morris and Keltner, 2000) along with the context given by the preceding utterances (Majumder et al., 2018) to determine the emotion of a given turn.

Preprocessing
For the tokenization process, our system used TweetTokenizer from NLTK (Loper and Bird, 2002). In addition, we performed some other actions. All the text was transformed to lowercase. Multiple spaces were converted to a single space. Urls were replaced by the tag "url". We transformed multiple instances of punctuation marks in a single one (e.g., "???" → "?"). In order to extract semantic representations of the unicode emojis, they are replaced by their description using the Common Locale Data Repository (CLDR) Short Name (e.g., → "grinning face with star eyes"). Moreover, non relevant and common words are removed from these descriptions ("grinning face with star eyes" → "grinning star eyes").

Word Embeddings
It is well known that word embeddings (WE) learned from the same domain of a downstream task usually lead to obtain better results than those obtained using general domain WE. Due to the fact that we did not have sentences of the task to learn word embeddings from them, we used embeddings learned from Twitter posts because we considered that the characteristics of tweets are similar to the task language. Both of them have a noisy nature and they share common features of the internet language (slang, letter homophones, onomatopoeic spelling, emojis, lexical errors, etc.). Therefore, we used 400-dimensional WE obtained from a skip-gram model trained with 400 million tweets gathered from 1/3/2013 to 28/2/2014 (Godin et al., 2015).

Hierarchical Convolutional Neural Networks
We considered several characteristics of the task in order to design our system. First, the utterances are short and there are many short-term dependencies among these words. Therefore, we propose to use 1D Convolutional Neural Networks (Kim, 2014) (CNN) to extract a rich semantic representation of each utterance. Second, the conversations are composed only by 3 utterances, for that reason, it is not required to uses models with high capacity to learn long contexts. Thus, we propose to use another CNN on top of the first CNN that extracts sentence representations, in order to obtain representation of conversations. We called this approach Hierarchical Convolutional Neural Networks (HCNN) following the work of (Yang et al., 2016).
As input to the model, each utterance j (composed by a maximum of N words) in a conversation i is arranged in a matrix M j ∈ R N ×d , where each row corresponds with a word in the utterance j, represented by using d-dimensional WE. As each conversation is a sequence of three utterances, these conversations are arranged in a 3dimensional matrix where each channel j is the representation of the utterance j in the conversation, i.e. for the conversation i, M i ∈ R 3×N ×d . On all the matrices of M i , 1D Dropout (Srivastava et al., 2014) was used to augment the dataset, by deleting words of each utterance with p = 0.3.
Given the representation of the conversation i, M i , for each utterance independently, a CNN with kernels of different sizes is applied in order to obtain a composition of word embeddings that can extract semantic/emotional properties from each utterance. At this first level, we use f 1 = 256 kernels of sizes {2, 4, 6} and their weights are shared among the three channels. From that, for each utterance, three new matrices are obtained. These matrices capture relevant features for each kernel size and utterance. These features are pooled into a vector by using 1D Global Max Pooling (GMP).
The resulting three vectors from the previous level were concatenated as rows to obtain a matrix representation of the conversation i composed by the CNN map of its sentences, W i ∈ R 3×f 1 . We considered that conversation features could be relevant for the task. At this level, in order to extract these relevant features and following the ideas in (Morris and Keltner, 2000) (Majumder et al., 2018), the system is intended to take into account the context and potentially the emotions given by preceding utterances to determine the emotion expressed by the last utterance. To do this, a CNN with f 2 = 256 kernels of sizes {1, 2, 3} were used. The size of the filters is crucial to understand what features the system is capable to learn.
Concretely, 3-size kernels: semantic/emotional features over all the contexts (full conversation); 2-size kernels: inter-turn features and semantic/emotional features of preceding and later utterances given a context of two utterances; 1-size kernels: self-turn features and semantic/emotional features of each utterance independently.
On the output maps of this second CNN 1 , GMP is used in order to extract the most relevant features from each dimension and the resulting vectors are concatenated. Later, a fully connected layer L 1 with 512 neurons is used to fuse the concatenated vectors. Finally, to obtain a probability distribution over C classes ({happy, sad, angry, others}) we use a softmax fully connected layer L 2 . Figure 1 shows the proposed model architecture.

Snapshot Ensemble
Generally, ensemble models outperform single models in similar tasks (Duppada et al., 2018) (Rozental et al., 2018). Therefore, we decided  to use ensemble methods instead of trying different architectures. We used the ideas of Snapshot Ensemble (SE) (Huang et al., 2017) to combine HCNN trained until reaching good and diverse local minima by using SGD and a cosine learning rate with T = 24 training iterations, M = 6 learning cycles, and initial learning rate alpha = 0.4.
From this training method, we took 24 snapshots (one for each training iteration). From the set of snapshots S = {s i / 1 ≤ i ≤ 24 ∧ s i : R 3×N ×d → R C }, we generate 4 different systems:

Best snapshot of all iterations
2. Average of all snapshots 3. Average of best snapshot at each learning cycle 4. Average of genetic selected snapshots where x and y are the input and the target, respectively, and g(S) i is the decision of a genetic algorithm to include the snapshot s i in the ensemble. We used this method in order to discretely select (g(S) i ∈ {0, 1}) what snapshots are wellsuited for the final averaging ensemble which tries to optimize µF 1 . The genetic algorithm (Mitchell, 1998) starts with a population of 400 individuals, they are crossed by using two point crossover, mutated with flip bit and selected by using tournament selection during 100 generations. Moreover, this algorithm addresses a multi-objective problem, it must to reach combinations of snapshots whose averaged predictions yield to high values of µF 1 while minimizing the number of models in the ensemble (the final genetic ensemble is composed by 6 system, i.e. as many systems as learning cycles) These decisions were taken in order to reduce the overfiting risk during the learning of the ensemble i.e. we prioritize simpler ensembles which are composed by discretely selected snapshots.

Analysis of Results
In order to evaluate different configurations of our system we used the development set given by the task organizers. On this development set, ablation analysis on single HCNN was carried out in order to observe if the input Dropout and the incorporation of L 1 layer yield to better results (the capacity of HCNN must be greater when including both techniques). The results of this ablation analysis are shown in Table 1. Vanilla system is a single HCNN without input Dropout neither the L 1 layer. It can be observed that, the systems with Dropout and L 1 outperformed the Vanilla version of HCNN in terms of µP , µR and µF 1 . In terms of µP , the systems which incorporate L 1 achieved better results. However, although Dropout + L 1 obtained the best improvement in terms of µP , the highest µR was obtained using only Dropout. This could indicate that data augmentation could be useful to increase the µR but it is required more network capacity to handle this augmentation in order to increase also the µP .
These results were obtained by using a single HCNN with adam as update rule (Kingma and Ba, 2014) with default learning rate. However, the SE training mode with Vanilla SGD and cosine learning rate, along with the proposed ensemble generation, allows the Dropout + L 1 system to reach better results (  In this case, the best single model (Best snapshot) obtained in the SE training mode, provided higher µR than Dropout + L 1 at the expense of a reduction in µP . This improvement of 3 points of µR yields also an increase of the µF 1 measure.
Among the ensembles, only Genetic Average improves the Best snapshot and Dropout + L 1 systems in all the metrics. This is due to a big increase in µR. This suggests that it is possible to improve the µF 1 results by balancing µP and µR.
The other ensembles obtain lower results in terms of µR and µF 1 than Best snapshot, which is a single model. Moreover, all SE (including Best snapshot) except Genetic Average are less accurate (lower µP ) than Dropout + L 1 . However, all of them improved considerably the µR.
Due to the SE HCNN models generally outperformed the best single model Dropout + L 1 in terms of µF 1 on the development set, we submitted all these systems to be evaluated on the test set. The results are shown in Table 3. It can be seen that the best system is Genetic Average, the same behavior observed on the development set. Although Best snapshot is more accurate than the ensembles (higher µP ), two of the three ensembles yields better results µF 1 . Moreover, a big degradation in the results are observed, all systems goes from 77 µF 1 on the development set, to 74 µF 1 on the test set.

Conclusion and Future Work
In this paper, we have presented Snapshot Ensembles of Hierarchical Convolutional Neural Networks to address the Semeval 2019 Task 3: Contextual Emotion Detection in Text. Our system is based on the use of a Genetic Algorithm in order to ensemble different snapshots of the same model. This ensemble outperformed single models and also classical snapshot ensembles, obtaining competitive results in the addressed task.
Due to the fact that in the proposed system, the semantic and emotional information is only provided by the representation of the words and the utterances, as future work we plan to study different word and sentence embeddings. It would be also interesting to incorporate other emotional or sentiment features such as: Sentiment Unit (Radford et al., 2017), DeepMoji (Felbo et al., 2017), Sentiment Specific WE (Tang et al., 2014); or polarity lexicons. Moreover, we are also interested in work with more powerful word embeddings such as BERT (Devlin et al., 2018) in order to incorporate a richer semantic word representation.