RNN for Affects at SemEval-2018 Task 1: Formulating Affect Identification as a Binary Classification Problem

Written communication lacks the multimodal features such as posture, gesture and gaze that make it easy to model affective states. Especially in social media such as Twitter, due to the space constraints, the sources of information that can be mined are even more limited due to character limitations. These limitations constitute a challenge for understanding short social media posts. In this paper, we present an approach that utilizes multiple binary classifiers that represent different affective categories to model Twitter posts (e.g., tweets). We train domain-independent recurrent neural network models without any outside information such as affect lexicons. We then use these domain independent binary ranking models to evaluate the applicability of such deep learning models on the affect identification task. This approach allows different model architectures and parameter settings for each affect category instead of building one single multi-label classifier. The contributions of this paper are two-folds: we show that modeling tweets with a small training set is possible with the use of RNNs and we also prove that formulating affect identification as a binary classification task is highly effective.

In this paper, we present an approach that utilizes multiple binary classifiers that represent different affective categories to model Twitter posts (e.g., tweets). We train domainindependent recurrent neural network models without any outside information such as affect lexicons. We then use these domainindependent binary ranking models to evaluate the applicability of such deep learning models on the affect identification task. This approach allows different model architectures and parameter settings for each affect category instead of building one single multi-label classifier. The contributions of this paper are two-folds: we show that modeling tweets with a small training set is possible with the use of RNNs and we also prove that formulating affect identification as a binary classification task is highly effective.

Introduction
Social media platforms allow users to share information, communicate with other users, learn about new products, and get latest news. The importance of social media data is getting larger every day as social media usage grows every year (Duggan, 2015). Twitter is one such social media platform where users can write short posts as well as share links. Twitter is also used for getting news (Center, 2017).
One aspect of modeling social media posts is focusing on emotional states of users. There has been plenty of efforts on determining affective states (Schwarz and Clore, 1983) and their effects to human behavior for different domains from education (Sidney et al., 2005) to health care (Lisetti et al., 2003). For Twitter, this problem is even more challenging as the information source is limited to the number of characters allowed in a single post and multimodal features (e.g., posture, gesture, and eye gaze) are not available.
In this paper, we formulate affect identification task as a binary classification problem and investigate the applicability and effectiveness of domainindependent deep learning models as well as features. Our dataset includes eleven affect categories (i.e., anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, and trust) for each tweet. The presence of one affect category in a tweet does not stop another category to be present (e.g., joy and optimisim can both be present in a tweet). We represent each affect category as one class and build binary classifiers for each class. Recurrent neural networks are trained for each affect category and no domain-dependent features such as affect lexicons are used. Our goal is to evaluate a generic model for different affective states.
Binary models have been successfully applied to several applications including action recognition in videos (Can and Manmatha, 2013), prediction of whether or not a tweet will be retweeted (Hong et al., 2011), and topic classification (Joachims, 1998). In this paper, we describe our approach for affect recognition of English tweets (Task E-c: Detecting Emotions), a subcategory of Task 1 in the SemEval 2018 challenge (Mohammad et al., 2018).

Corpus
In this paper, we use English tweets that have been annotated by affect categories (Mohammad et al., 2018). The dataset contains emojis, hashtags, and the textual content of tweets; however, it does not have user ids. The training, validation, and test splits are done by the task organizers. Figure 1 shows top three mostly used emojis in each class and their frequencies for the training set.

Breakdown of Emojis to Classes
Due to the importance of visual cues in predicting affective states, we pay attention to a form of visual cues: emojis. Here we present some of our findings based on different affect categories.
• Trust: emojis are not frequently used. Not easy to determine through emojis.
• Sadness: The sobbing face emoji is expectedly the most common one but interestingly laughing with joy emoji is the second most common. Weary face emoji is also very common in sadness: 56.16% of all weary face emojis are used in this class.
• Anger and disgust share the same property: the most common emoji is the laughing with joy emoji and the second most common is sobbing face emoji. The fact that a joy emoji being the most commonly used in these affective classes is quite interesting and can indicate irony. The third most common emoji in these two classes are also the same: rage emoji.
• An emoji that can be intuitively associated with love (heart eyes) actually occurs more in joy tweets than love tweets.
• An unexpected finding is on fire emoji where joy and optimism classes have a large portion of all fire emojis in the training set (46.7% and 36.7% respectively).
• The affective class that uses most emojis is joy.

Methodology
Since each tweet in the data contains eleven affect categories (i.e., anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, and trust), we created eleven datasets with the same tweets but with different class information. For example, the first dataset has one values (i.e., positive) for tweets that show anger and zero (i.e., negative) for those that do not have anger.
Other datasets are created in the same way for the remaining affects classes. By building one model for each affect category, we formulated affect identification problem as a binary classification task. Then in testing time, we obtained predictions from every specific model and fused the results to obtain a unique result for each tweet.

Training Binary Classification Models
The advantage of using binary classification models for each affect category is that each model can be trained by itself, enabling different model architectures and parameters. For example, while one category may benefit from a deeper model, the other affect category can obtain the best results with a shallower model. In this way, the models do not have to be the same for each affect class.

Model Architecture
We built separate RNN models for each affect category, resulting with eleven classifiers. For the classifiers, we used three GRU layers, two of which are bi-directional. To be able to build a more generalized model, a dropout of 0.2 is used in each layer. Each bidirectional layer contains 100 neurons and the final encoding layer has 50 neurons.

Training Auto-Encoder
Because the dataset is not very big, we wanted the classifiers to learn as much information as possible without overfitting it. Therefore, we built an auto-encoder from the tweets' content (e.g., unlabeled tweets, no affect categories). The goal of the auto-encoder is to get weights that can be used in the classifiers. As shown in Figure 2, we used the trained weights from the auto-encoder to start building binary classifiers. To convert a textgenerating auto-encoder into a classifier, we added a softmax layer.

Features
For modeling affect categories in tweets, we use only the words and emojis. No domain-dependent features, or features that are aware of task in hand (e.g., affect lexicons) are used as our goal is to determine how well a generic RNN model can perform for affect recognition task.

Emojis
To represent emojis as embeddings, we used the pre-trained embeddings from the emoji2vec package (Eisner et al., 2016).

Word Embeddings
For this study, an embedding length of 200 is used. We utilized pre-trained global vectors trained on tweets (Pennington et al., 2014)

Hashtags
Hashtags have a lot of semantic information about the tweets. However, most of that information is neglected if the hashtags cannot be found in the words embeddings. Therefore, we followed a greedy approach for dividing hashtags into their corresponding words. Once the # is removed, we take the content of the hashtag and search if the content is present in the vocabulary as its entirety. If vocabulary has the hastag content, we use it. If not, more processing is done. Starting from the beggining of the word we keep a pointer, searching for a valid word that from index=0 to index=pointer. Once 0,j indices represent a substring that is a valid word, we continue the recursive search for the rest of the content (i.e., j+1 to the end of the string). The words that are found are added to the list of words that represent the hashtag. Then we use those words and represent them as embeddings.
Because this approach is greedily finding the shortest possible words contained within the hashtag, it is not guaranteed to represent the correct semantics all the time. For example, the #feelsadforyou is correctly divided to ['feel',"sad', 'for', 'you'], however, #toniteinasheville ('tonite in asheville") becomes ['tonite', 'in', 'as', 'he', 'ville'], which is not correct. Achieving perfect semantics would require human labeling, therefore, we used the greedy approach and have observed that utilizing hash tag contents significantly improves the effectiveness of the models.

Results on Validation Set
The accuracies of binary classification models for each affect category are presented in Figure 3. We compare the models' performances with majority baselines where the percentage of the class value that occurs most is taken as the majority baseline for each class.

Results
In this section, we report the results for the test set as well as discussion on the results.

Experimental Results
For all of our experiments, we used SAS Deep Learning Toolkit. We utilized an environment with 4 workers, with 24 threads in each worker, and mini-batch size per thread on each worker was 6. Adam optimizer is used in all experiments.
Using the test set, the proposed model achieved a 0.398 accuracy, 0.539 micro-avg F1, and 0.358 macro-avg F1. A random baseline achieves 0.185 accuracy, 0.307 micro-avg F1, and 0.285 macroavg F1. Compared to the random baseline, the generic RNN model is quite successful at identifying affect categories.

Discussion
Some of the affect categories have very few positive examples, therefore it is very difficult for classifiers to learn nuances of those affects. For example, surprise and trust categories have 96.05% and 95.15% majority baselines respectively. In other words, only 4-5% of all training set observations have these affect categories as true.
As can be seen in Figure 4, when the number of positive observations are limited, the classifiers tend to make more false negatives. For affect categories that have a major class value that is dominant, we experimented with sampling as well where the number of positive and negative examples were equal. However, that made the dataset significantly smaller, further making it difficult for the RNN models to learn distinctions. Rather than using smaller datasets or including external data, we prefer to employ binary models. One of the main advantages of using binary models over multi-label models is to better deal with the uneven distribution of positive examples across classes.

Conclusion
Affect identification without visual cues is a challenging task, making the text as the only source of information that can be used for machine learning models. This problem gets more challenging as the text data gets limited by the number of characters in Twitter.
This paper presented a simple yet effective approach for classifying affect categories of Tweets. The main motivation of this paper was to evaluate how well a domain-independent RNN model can perform for classifying affects. Therefore, no domain-dependent source of information such as affective lexicons or pre-trained affect features are used. We built binary classification models per each affect category. The results showed that RNNs are powerful enough to outperform the baselines significantly, even without prior knowledge about the domain and with a relatively small dataset.