Mutux at SemEval-2018 Task 1: Exploring Impacts of Context Information On Emotion Detection

This paper describes MuTuX, our system that is designed for task 1-5a, emotion classification analysis of tweets on SemEval2018. The system aims at exploring the potential of context information of terms for emotion analysis. A Recurrent Neural Network is adopted to capture the context information of terms in tweets. Only term features and the sequential relations are used in our system. The results submitted ranks 16th out of 35 systems on the task of emotion detection in English-language tweets.


Introduction
Emotion analysis on social media is attracting more and more reserach interests (Strapparava and Mihalcea, 2008;Balahur et al., 2011;Agrawal and An, 2012;Wang et al., 2012;Hasan et al., 2014a;Canales and Martínez-Barco, 2014) from industry and academia. Commercial applications such as product recommendation, online retailing, and marketing are turning their interests from traditional sentiment analysis to emotion analysis as well. Emotion analysis is generally taken as a multi-lable classification problem. Given a piece of text, such as a tweet, it assigns several lables such as depressed, sad, angry and so on to it (Mohammad et al., 2018) based on the meaning contained in the text.
Techniques related to emotion detection can be divided into lexicon-based approaches (Valitutti, 2004;Strapparava and Mihalcea, 2008;Balahur et al., 2011) and machine learning approaches (Hasan et al., 2014b;Wang et al., 2012;Roberts et al., 2012;Suttles and Ide, 2013). Lexiconbased approaches leverage lexical resources to detect emotions, the resources can be keywords (Hasan et al., 2014a), WordNet-Affect (Valitutti, bantaray et al., 2012) generally take emotion detection as a classification problem using SVM, neural network (Abdul-Mageed and Ungar, 2017;Bravo-Marquez et al., 2016), naieve bayes, Decision Tree, KNN and so on, or using certain unsupervised techniques such as LSA (Deerwester et al., 1990;Wang and Zheng, 2013;Gill et al., 2008), pLSA, NMF to transform the feature space into a more reasonable one before conducting classification. The main challenges of emotion analysis of tweets are the following: 1. Informal languages used on social media are not always obeying formal grammar, which makes traditional grammatical features less reliable for detecting emotions on social media.
2. New words are frequently created on social media, making it difficult to understand their emotional meaning even for a human being.
To address the challenges above, we use recurrent neural network to make use of terms, sequential information, and contextual information simultaneously for emotion detection. We believe that contextual information can partly solve the new-term problem and grammar-breach problem. To use recurrent neural network, a pre-trained embedding is used as our initial input of each term.

External Resource
We only used one external resource in our analysis, which is a pre-trained word embedding (Mikolov et al., 2013) word2vec provided by Google. It is trained on a part of the Google News dataset (about 100 billion words) and it contains 300-dimensional vectors for 3 million words and phrases.

System Description
To explore the limit of term features with RNN for emotion detection, we did not use various features other than term embedding. The system could be improved by using features like emojis or emoticons. We will conduct further analysis afterwards by addressing problems in combining different feature space. As for the system used for SemEval18 task 1, the main steps, features used in the model are described in this section.

Preprocessing
Since the method heavily depends on terms appear in the text, the corpus is carefully pre-processed as described below.
• Normalization Each word in each tweet is converted into lowercase. Non-linguistic content such as URLs, emojis, emoticons, user names are removed (some important features such as emojis and emoticons will be explored in the future).
• Tokenization Each tweet is split into a word sequence. No stemming is applied since some special word forms may convey more apparent emotions than its original form.
• Stop-words Removing NLTK 2 toolkits is leveraged to remove stop words from tweets. Some other meaningless terms such as single characters, digits, their compositions and so on, are also eliminated.

Embedding Usage
Word embeddings are a widely used semantic presentation of words for almost any neural network based text analysis approach. A vector of real numbers is used for a single word to represent its distributional semantics in the embedding space. Since the space is generated by the language model, words that are functional similar in certain language are close with each other in the embedding space, for example, "cat" and "dog" could be similar in the space.
In this system, a tweet is represented by concatenating embeddings of the words in it.

Our Approach
The system submitted is based on a recurrent neural network approache, GRU, to be specific.

The Basic Idea
Lexicons play the key role in lexicon-based approaches and bag-of-feature based machine learning approaches for emotion analysis. However, in addition to the emotion lexicons, we believe that linguistic characteristics may also contribute a lot to emotion analysis. For example, the context of the emotion lexicon such as negation could revert the emotion of the utterence if it is neglected. The sequence of the sentence terms also play an important role for understanding its meaning, hence important for uncovering the true emotions. Lack of newly created terms in vocabulary or grammatically incorrect utterances can also lead to poor performance of traditional emotion analysis approaches. By modeling long-term dependencies of terms inside a tweet, fusing the semantics of the terms and their contexts together with GRU, a recurrent neural network, we hope that above problems can be alleviated in the new space.

Problem Statement
We take emotion analysis as a multi-label classification problem in our system as usual. A tweet is represented as a sequence of terms where M is the length of the tweet x i . Given a tweet x i , the task aims at prediting the labels of it as y i , where y i is a d-dimensional Boolean vector y i ∈ B d , d = 11 in this case. Each dimension of vector y ij indicates an emotion label of the 11 labels space anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, suprise, and trust respectively. For example y i0 = 1 means emotion of anger is detected in the tweet x i , y i9 = 0 means emotion of sadness does not appear in the tweet.

The model architecture
The architecture of the model is shown in Figure  1. The model is composed of 3 layers. The input of the network is the pre-trained embedding of each term occurs in a tweet. The sequential term embeddings are then turned into a tweet level representation by a classical GRU, the output of GRU is directly inputed into a linear perceptron layer, and maps the tweet representation into class representation directly by this layer. The output of the linear perceptrons are then processed by a sigmoid function to get the final predictions.

The GRU Layer
The input of the GRU layer is the sequence of term embeddings of the tweet. We denote by H = h 1 , . . . , h n the input sequence of length n, where h i ∈ R d is the term representation for the ith token w i . The new representation of the whole tweet r i is then obtained from h i via a GRU network: where, Here, g i and z i are reset and update gates respectively that control the information flow from the previous timestamp. W g , U g , W z , U z , W r , and U r are weight matrices to be learned for transforming r i−1 and h i to gate units. By applying GRU on h i , the tweet representation r i ∈ R K encodes the context information and historical information simutaneously.

The Perceptron Layer
With the output of GRU, a vector r j ∈ R K representing the overall information of a tweet, we use a perceptron layer together with a sigmoid activation function to map the tweet from feature space The predicted label vectorŷ j of each tweet t j is then compared with the true label vector y j on training data to guide the training process with an appropriate loss function.

The Loss Function
Binary cross entropy loss can be used for multilabel classification problems, it is computed as the formula below: where y i is the true label vector of the tweet x i , andŷ i is its predicted label vector.

Training
To train these models, we use the training data provided by SemEval18 task 1, which includes 6, 839 human labeled english tweets for Subtask 1. A data set of 887 labeled english tweets for development is also avaible, we leverage this set for validation. The trained model is then applied on the testing set with 3260 unlabeled tweets in it. A vocabulary is generated by extracting terms from all the training set, validating set and testing set to ensure its coverage. The parameter configuration for the best system performance on validation set is defined as follows: • Hidden Dimension The initial embedding of each term is 300 as we adopt the pretrained word embedding trained on part of the Google News dataset 3 . The hidden dimension of GRU is set to 200 when we get the best validation result.
• Maximum Tweet Length The length of each tweet is different, we regularize the length with a maximum limit of 30 meaningful terms after preprocessing steps. A tweet that is longer than that is trimmed, and shorter than that is populated with zero paddings.
• Learning Rate We adopt an Adam optimizer to train the model for the submitted system. The learning rate for the optimizer is set to 0.0001 when we get the best system performance on validation set.
• Dropout Rate Dropout operation is reported to have similar effects of boosting approches in neural network based models. A dropout operation is executed on the linear perceptron layer with a dropout rate of 0.4 when achieving the optimum.
• Batch Size The batch size settings also affact the performance of the proposed system. The optimum is obtained with a setting of 20 as the batch size.
The number of epochs is used for terminating the training process when optimum is obtained. Terminating condition depends on not only the values of the loss function, but also its transient performance on the validating dataset. Some random factors, such as the initial state of various random variables also show their impacts on it. In our experiments, the optimum is achieved at the 3-rd 3 https://code.google.com/archive/p/word2vec/ epoch, it may vary with different intial states of other parameters.
Using the model parameters above, which produced the best performance on validating dataset, we predicted the labels of each tweet in the testing dataset. The evaluation results provided by Se-mEVal18 is described in the next section.

Results
Among all the 35 systems which participated in the task of emotion classification subtask of task 1 (Mohammad et al., 2018), our only submission is ranked 16-th on the evaluation metric of Accuracy, and 19-th on both metrics of micro-avg F1 and macro-avg F1, as is shown in Table 1

Conclusion
We have presented a GRU-based multi-lable classifier to leverage context information and historical information for emotion analysis. It outperforms the unigram SVM model consistently on three evaluation metrics, even though only term features and a pre-trained word embedding are used. Some key factors such emojis, emoticons, emotion lexicons and multi-layer neural structures will be explored in the futrue for further analysis.