TeamUNCC at SemEval-2018 Task 1: Emotion Detection in English and Arabic Tweets using Deep Learning

Task 1 in the International Workshop SemEval 2018, Affect in Tweets, introduces five subtasks (El-reg, El-oc, V-reg, V-oc, and E-c) to detect the intensity of emotions in English, Arabic, and Spanish tweets. This paper describes TeamUNCC’s system to detect emotions in English and Arabic tweets. Our approach is novel in that we present the same architecture for all the five subtasks in both English and Arabic. The main input to the system is a combination of word2vec and doc2vec embeddings and a set of psycholinguistic features (e.g. from AffectTweets Weka-package). We apply a fully connected neural network architecture and obtain performance results that show substantial improvements in Spearman correlation scores over the baseline models provided by Task 1 organizers, (ranging from 0.03 to 0.23). TeamUNCC’s system ranks third in subtask El-oc and fourth in other subtasks for Arabic tweets.


Introduction
The rise and diversity of social microblogging channels encourage people to express their feelings and opinions on a daily basis. Consequently, sentiment analysis and emotion detection have gained the interest of researchers in natural language processing and other fields that include political science, marketing, communication, social sciences, and psychology (Mohammad and Bravo-Marquez, 2017;Agarwal et al., 2011;Chin et al., 2016). Sentiment analysis refers to classifying a subjective text as positive, neutral, or negative; emotion detection recognizes types of feelings through the expression of texts, such as anger, joy, fear, and sadness (Agarwal et al., 2011;Ekman, 1993).
SemEval is the International Workshop on Semantic Evaluation that has evolved from SensE-val. The purpose of this workshop is to evaluate semantic analysis systems, the SemEval-2018 being the 12 th workshop on semantic evaluation. Task 1  in this workshop presents five subtasks with annotated datasets for English, Arabic, and Spanish tweets. The task for participating teams is to determine the intensity of emotions in text. Further details about Task 1 and the datasets appear in Section 3.
Our system covers five subtasks for both English and Arabic. The input to the system are word embedding vectors (Mikolov et al., 2013a), which are applied to fully connected neural network architecture to obtain the results. In addition, all subtasks except the last one, use document-level embeddings doc2vec (Le and Mikolov, 2014) that are concatenated with different feature vectors. The models built for detecting emotions related to Arabic tweets ranked third in subtask El-oc and fourth in the other subtasks. We use both the original Arabic tweets as well as translated tweets (to English) as input. The performance of the system for all subtasks in both languages shows substantial improvements in Spearman correlation scores over the baseline models provided by Task 1 organizers, ranging from 0.03 to 0.23.
The remainder of this research paper is organized as follows: Section 2 gives a brief overview of existing work on social media emotion and sentiment analyses, including for English and Arabic languages. Section 3 presents the requirements of SemEval Task1 and the provided datasets. Section 4 examines the TeamUNCC's system to determine the presence and intensity of emotion in text. Section 5 summarizes the key findings of the study and the evaluations. Section 6 concludes with future directions for this research. 350 2 Related work Sentiment and Emotion Analysis: Sentiment analysis was first explored in 2003 by Nasukawa and Yi (Nasukawa and Yi, 2003). An interest in studying and building models for sentiment analysis and emotion detection for social microblogging platforms has increased significantly in recent years (Kouloumpis et al., 2011;Pak and Paroubek, 2010;Oscar et al., 2017;Jimenez-Zafra et al., 2017). Going beyond the task of mainly classifying tweets as positive or negative, several approaches to detect emotions were presented in previous research papers ( Mohammad and Kiritchenko, 2015;Tromp and Pechenizkiy, 2014;Mohammad, 2012). Researchers (Mohammad and Bravo-Marquez, 2017) introduced the WASSA-2017 shared task of detecting the intensity of emotion felt by the speaker of a tweet. The stateof-the-art system in that competition (Goel et al., 2017) used an approach of ensembling three different deep neural network-based models, representing tweets as word2vec embedding vectors. In our system, we add doc2vec embedding vectors and classify tweets to ordinal classes of emotions as well as multi-class labeling of emotions.
Arabic Emotion Analysis: The growth of the Arabic language on social microblogging platforms, especially on Twitter, and the significant role of the Arab region in international politics and in the global economy have led researchers to investigate the area of mining and analyzing sentiments and emotions of Arabic tweets (Abdullah and Hadzikadic, 2017;Assiri et al., 2016). The challenges that face researchers in this area can be classified under two main areas: a lack of annotated resources and the challenges of the Arabic language's complex morphology relative to other languages (Assiri et al., 2015). Although recent research has been dedicated to detect emotions for English content, to our knowledge, there are few studies for Arabic content. Researchers (Rabie and Sturm, 2014) collected and annotated data and applied different preprocessing steps related to the Arabic language. They also used a simplification of the SVM (known as SMO) and the NaiveBayes classifiers. Another two related works (Kiritchenko et al., 2016;Rosenthal et al., 2017) shared different tasks to identify the overall sentiments of the tweets or phrases taken from tweets in both English and Arabic. Our work uses the state-of-the-art approaches of deep learning and word/doc embedding.

Task Description and Datasets
SemEval-2018 Task 1, Affect in Tweets, presents five subtasks (El-reg, El-oc, V-reg, V-oc, and E-c.) The subtasks provide training and testing for Twitter datasets in the English, Arabic, and Spanish languages (Mohammad and . Task 1 asks the participants to predict the intensity of emotions and sentiments in the testing datasets. It also includes multi-label emotion classification subtask for tweets. This paper focuses on determining emotions in English and Arabic tweets. Figure 1 shows the number of tweets for both training and testing datasets for individual subtasks. We note that subtasks El-reg and El-oc share the same datasets with different annotations, and the same for subtasks V-reg and V-oc.

Task1
El--reg  The description of each subtask is: EI-reg: Determine the intensity of an emotion in a tweet as a real-valued score between 0 (least emotion intensity) and 1 (most emotion intensity).
EI-oc: Classify the intensity of emotion (anger, joy, fear, or sadness) in the tweet into one of four ordinal classes (0: no emotion, 1, 2, and 3 high emotion).
V-reg: Determine the intensity of sentiment or valence (V) in a tweet as a real-valued score between 0 (most negative) and 1 (most positive).
V-oc: Classify the sentiment intensity of a tweet into one of seven ordinal classes, corresponding to various levels of positive and negative sentiment intensity (3: very positive mental state can be inferred, 2, 1, 0, -1, -2, and -3: very negative mental state can be inferred) E-c: Classify the tweet as 'neutral or no emotion' or as one, or more, of eleven given emotions (anger, anticipation, disgust, fear,joy, love, optimism, pessimism, sadness, surprise, and trust).

The TeamUNCC System
Our team, TeamUNCC, is the only team that participated in all subtasks of Task 1 of SemEval-2018 for both English and Arabic tweets. Subtasks El-reg and V-reg are considered similar because they determine the intensity of an emotion or a sentiment (respectively) in a tweet as a realvalued score. While subtasks El-oc and V-oc classify the intensity of the emotion or the sentiment (respectively) to ordinal classes. Our system, designed for these subtasks, shares most features and components; however, the fifth subtask, E-c, uses fewer of these elements. Figure 2 shows the general structure of our system. More details for the system's components are shown in the following subsections: Section 4.1 describes the system's input and prepocessing. Section 4.2 lists the feature vectors, and Section 4.3 details the architecture of neural network. Section 4.4 discusses the output details.

Input and Preprocessing
EngTweets: The original English tweets in training and testing datasets have been tokenized by converting the sentences into words, and all uppercase letters have been converted to lowercase. The preprocessing step also includes stemming the words and removal of extraneous white spaces. Punctuation have been treated as individual words (".,?!:;()[]#@'), while contractions (wasn't, aren't) were left untreated.
ArTweets: The original Arabic tweets in training and testing datasets have been tokenized, white spaces have been removed, and the punctuation marks have been treated as individual words (".,?!:;()[]#@').
TraTweets: The Arabic tweets have been translated using a powerful translation tool written in python (translate 3.5.0) 1 . Next, the preprocessing steps that are applied to EngTweets are also applied on TraTweets. 1 https://pypi.python.org/pypi/translate

Feature Vectors
AffectTweets-145: Each tweet, in either En-gTweets or TraTweets, is represented as 145 dimensional vectors by concatenating three vectors obtained from the AffectiveTweets Weka-package (Mohammad and Bravo-Marquez, 2017;Bravo-Marquez et al., 2014), 43 features have been extracted using the TweetToLexiconFeatureVector attribute that calculates attributes for a tweet using a variety of lexical resources; two-dimensional vector using the Sentiment strength feature from the same package, and the final 100 dimensional vector is obtained by vectorizing the tweets to embeddings attribute also from the same package.
Doc2Vec-300: Each tweet is represented as a 300 dimensional vector by concatenating two vectors of 150 dimensions each, using the documentlevel embeddings ('doc2vec') (Le and Mikolov, 2014;Lau and Baldwin, 2016). The vector for each word in the tweet has been averaged to attain a 150 dimensional representation of the tweet.
Word2Vec-300: Each tweet is represented as a 300 dimensional vector using the pretrained word2vec embedding model that is trained on Google News (Mikolov et al., 2013b), and for Arabic tweets, we use the pretrained embedding model that is trained on Arabic tweets (Twt-SG) (Soliman et al., 2017).
PaddingWord2Vec-300: Each word in a tweet is represented as a 300 dimensional vector. The same pretraind word2vec embedding models that are used in Word2Vec-300 are also used in this feature vector. Each tweet is represented as a vector with a fixed number of rows that equals the maximum length of dataset tweets and a standard 300 columns using padding of zero vectors.

Network Architecture
Dense-Network: The input 445 dimensional vector feeds into a fully connected neural network with three dense hidden layers. The activation function for each layer is RELU (Maas et al., 2013), with 256, 256, and 80 neurons for each layer, respectively. The output layer consists of one sigmoid neuron, which predicts the intensity of the emotion or the sentiment between 0 and 1. Two dropouts are used in this network (0.3, 0.5) after the first and second layers, respectively. For optimization, we use SGD (Stochastic Gradient Descent) optimizer (lr=0.01, decay=1 × 10 −6 ,

Feature Vectors
Network Architecture

EngTweets ArTweets TraTweets
AffectTweets--145 Doc2Vec--300 Word2Vec--300 PaddingWord2Vec--300 Dense--Network LSTM--Network PredicHon 2 PredicHon 3 Average Output PredicHon 1 Figure 2: The structure for our system. and momentum=0.9) 2 , optimizing for 'mse' loss function and 'accuracy' metrics. Early stopping is also applied to obtain best results. LSTM-Network: The input vector feeds an LSTM of 256 neurons that passes the vector to a fully connected neural network of two hidden layers and two dropouts (0.3, 0.5). The first hidden layer has 256 neurons, while the second layer has 80 neurons. Both layers use the RELU activation function. The output layer consists of one sigmoid neuron, which predicts the intensity of the emotion or the sentiment between 0 and 1. For optimization, we use SGD optimizer (lr=0.01, decay=1 × 10 −6 , and momentum=0.9), optimizing for 'mse' loss function and 'accuracy' metrics as well as early stopping to obtain the best results.

Output
Subtasks El-reg, El-oc, V-reg, and V-oc: These four subtasks for each language (English and Arabic) share the same structure as shown in Figure  2, the only difference is in the output stage. Each subtask passes the tweets to three different models that produces three predictions. See Table 1 and Table 2 for more comprehensive details on how each prediction with English and Arabic language is produced, respectively. The average of the predictions for each tweet is a real-valued number between 0 and 1. This output is considered the final output for both subtasks El-reg and V-reg, while subtasks El-oc and V-oc classify this real-valued number to one of the ordinal classes that are shown in Section 3. We note that El-reg and El-oc shares the same datasets. We also noticed that V-reg and V-oc shares the same dataset. Therefore, we found the ranges of values for each ordinal class by comparing the datasets. Table 3 shows the range of Subtask E-c: In this subtask, our system makes only one prediction. See Figure3 for more details on the process of predicting the results. The input is EngTweets for English language and ArTweets for Arabic language. We use Word2Vec-300 as the feature vector with GoogleNews for English tweets and Twt-SG for Arabic tweets. The network architecture is Dense-Network. This process is applied for each emotion of the eleven emotions:      anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, and trust. The output of each individual tweet is a real-valued number between 0 and 1. This output is normalized to either 1 (contains an emotion) if it is greater than 0.5 or 0 (no emotion) if it is less than 0.5.

Evaluations and Results
Each participating system in the subtasks El-reg, El-oc, V-reg, and V-oc, has been scored by using Spearman correlation score. The subtask E-c has been scored by using accuracy metric. Table 6 shows the performance of our system in E-reg and El-oc with each emotion and the average score for both English and Arabic. Table 7 shows the results for subtasks V-reg, V-oc, and E-c. The performance of our system beats the baseline model's performance, which is provided by the Task's organizers, see Figure 4 to capture the difference between the two performances. Our system ranks third in the subtask El-oc for Arabic language, and Fourth in the subtasks El-reg, V-reg, V-oc, and E-    c for Arabic language too. It is worth mentioning that these results have been obtained by using the task datasets without using any external data.

Conclusion
In this paper, we have presented our system that participated in Task 1 of Semeval-2018. Our system is unique in that we use the same underlying architecture for all subtasks for both languages -English and Arabic to detect the intensity of emotions and sentiments in tweets. The performance of the system for each subtask beats the performance of the baseline's model, indicating that our approach is promising. The system ranked third in El-oc for Arabic language and fourth in the other subtasks for Arabic language too.
In this system, we used word2vec and doc2vec embedding models with feature vectors extracted from the tweets by using the AffectTweets Wekapackage, these vectors feed the deep neural network layers to obtain the predictions.
In future work, we will add emotion and valence detection in Spanish language to our system by ap-plying the same approaches that have been used with Arabic. We also want to investigate the Arabic feature attributes in order to enhance the performance in this language.