ISCLAB at SemEval-2018 Task 1: UIR-Miner for Affect in Tweets

This paper presents a UIR-Miner system for emotion and sentiment analysis evaluation in Twitter in SemEval 2018. Our system consists of three main modules: preprocessing module, stacking module to solve the intensity prediction of emotion and sentiment, LSTM network module to solve multi-label classification, and the hierarchical attention network module for solving emotion and sentiment classification problem. According to the metrics of SemEval 2018, our system gets the final scores of 0.636, 0.531, 0.731, 0.708, and 0.408 on 5 subtasks, respectively.


Introduction
Recently, social media platforms are becoming more and more popular, such as Twitter microblogging, Facebook, and so on. Through these platforms, online users would like to share their opinions and emotions. Therefore, the analysis about the information on "affect" in the social media has attracted much interest from both academia and industries.
However, the short texts are usually consisted of informal expressions with much casual forms and emoticons, it brings great challenges for such research.
For this purpose, SemEval organized the evaluation of sentiment analysis on Tweet. This year comes the fifth edition that consists of new genres, including emotion intensity regression task, emotion intensity ordinal classification task, sentiment intensity regression task, sentiment degree ordinal classification task, and emotion classification task . We participated in SemEval 2018 task 1 for English, i.e. Affect in Tweet. Our system considers EIreg and V-reg (subtask A and C) as regression problems to get the emotion intensity and sentiment intensity by using regression models, while regards EI-oc and V-oc (subtask B and D) as categorization problems to classify each tweet into its corresponding emotion category and sentiment category by implementing hierarchical attention networks. Moreover, subtask E, i.e., E-c, is considered as a multi-label classification task. This paper is organized as follows. Section 2 overviews the framework of our system. Section 3 describes the methods for subtask A and C. Section 4 describes the hierarchical attention networks for subtask B and D. Subtask E will be introduced in Section 5. Section 6 presents the evaluation results. Section 7 will conclude this paper.

System Overview
The architecture of UIR-Miner is shown in Figure  1. UIR-Miner system is comprised of 4 modules: (1) Preprocessing module: involves data cleaning, topic classification, and tweets embedding.
(2) Regressor module: creates an ensemble regressor model by using different basic models simultaneously to calculate the emotion intensity and sentiment intensity, i.e. subtask A and subtask C; (3) Classification module: constructs an LSTM network with multi-layer attention mechanism for emotion and sentiment categorization, i.e. subtask B and subtask D; (4) Multi-label Classification module: builds a LSTM network for subtask E.

Preprocessing
Our system will firstly preprocess the Tweets data, and the main steps are as follows.  Delete the unrelated texts, including the id, some mentions, stop words, and some meaningless punctuation combinations.  Normalize synonymous words, like replacing "cant" and "can't" with "cannot".  Extract emoticons from tweets through regular expressions, and then maintain the emotional ones.

Word embedding
In the preprocessing, we used the pre-trained word embedding by Glove (Penningto et. al, 2014), in which each word will be represented by a 200dimensional vector , ∈ [1, ] , ∈ [1, ] . Here, denotes the location of the sentence in the tweet and is the maximum number of sentences for each tweet, denotes the location of the word in the sentence and is the maximum number of words for each sentence. Set = 140 and = 5.

Subtask A and C
This section will describe the methods for subtask A and C. Given a tweet and an emotion E (or a sentiment V), determine the intensity of E (or V) that best represents the mental state of the tweeter-a real-valued score between 0 and 1. We consider both of subtask A and C as a regression problem.
On the whole, we use a stacking framework to enhance the accuracy of final prediction. The original features are selected as input into the stacking model, including hashtags, emoticons, and ngram features. Then, the stacking model is divided into two layer, the base layer and the stacking layer. In the base layer, we choose four basic regressors due to their excellent performance. In the stacking layer, we still use SVM model, especially, the NuSVR model, which can control its error rate. Finally, we get the final result of intensity value.

Feature Selection
Since there are many irregular expressions in tweet, we combine the features, including emoticon, hashtag, and special punctuations. In our system, we mainly select the following features: • Hashtags: the number of hashtags in one tweet; • Ill format: the presence of ill format with some characters replacing by *; • Punctuation: the number of contiguous sequences of exclamation marks, question marks, and both exclamation and question marks; whether the last token contains an exclamation or question mark; • Emoticons: the presence of positive and negative emoticons at any position in the tweet; whether the last token is an emoticon; • OOV: the ratio of words out of vocabulary; • Elongated words: the presence of sentiment words with one character repeated more than two times, for example, 'cooool'; • URL: whether the tweet contains a URL. • Reply or Retweet: is the current tweet a reply/ retweet tweet.

Stacking Model
To avoid overfitting, we test 6 basic models to construct our stacking model.  (Zhang Y et. al, 2013)  L: Lasso Regressor (Tibshirani et al., 1996)  M: MLP Regression (Pal and Mitra, 1992)  R: Random Forest Regressor (Ho, 1995)  S: SVR (Vapnik 1995) To achieve the best performance, we also compare different combinations of our basic models with the metrics of Mean Squired Error (MSE) in the stacking method, and the experimental result is shown in Table 1. Since Stacking 6 achieves the best performance, we use the same setting in our system.

Hierarchical Attention Networks for Subtask B and D
This section will introduce our hierarchical attention model for subtask B and D. Given a tweet and an emotion category E (or a sentiment category V), classify the tweet into one of the ordinal classes of intensity of E (or V) that best represents the mental state of the tweeter. Note that, the number of category of E is 4, while that of V is 7. In our system, we consider both of subtask B and D as a classification problem. Each tweet contains several sentences that are comprised by several words. In order to better represent the semantics of emotion or sentiment, we utilize the hierarchical structure of a tweet to capture the contextual information of both intra and inter-tweet. The architecture is shown as Figure 2.
We build a hierarchical model which contains two layers, word layer and sentence layer. Since words and sentences are highly sensitive to the con-texts, recurrent neural networks based on bidirectional long short-term memory (BiLSTM) (Hochreiter and Schmidhuber, 1997) are implemented on both layers to get tweets' representations. Furthermore, since the words in one sentence or different sentences in a given tweet can indicate different emotion intensity or sentiment intensity. To better represent the semantics, attention mechanisms are added to both layers respectively (Xu et. al., 2015). We then use softmax as the activation

BiLSTM-based Word Encoder
A word level BiLSTM (Hochreiter and Schmidhuber, 1997) is used to represent each word. The BiLSTM consists of the forward LSTM and the backward LSTM. Forward LSTM reads the sentence from 1 to and represents the word as ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ ( ), ∈ [1, ]. Backward LSTM reads the sentence from to 1 and represents the word as ⃖⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ ( ), ∈ [ , 1]. Then word can be annotated by combining both forward information and backward information, ℎ = [ ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ ( ), ⃖⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ ( ) ] . The equations are listed as follows: where , and are the input gate, forget gate and output gate, is the logistic sigmoid function, ⨀ denotes elementwise multiplication, ℎ is the network output activation function, and softmax is used for categorization. To better support Twitter, we input the word embedding with 200 dimensions, and the max number of words in a sentence as 140.

Different weights
are given to different words. Attention mechanism (Xu et. al., 2015) is added to the word layer and the sentence can be represented as _ .
= tanh( ℎ + ) More specifically, after putting ℎ into a fullyconnected layer, we get . Then calculate weight with a word level context . Finally, we can get the sentence vector through an attention layer by calculating the sum of ℎ .

Sentence Layer Attention
Similarly, a sentence level BiLSTM (Hochreiter and Schmidhuber, 1997) can be used to represent sentence by adding sentence level context information, ℎ = [ ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ ( ), ⃖⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ ( ) ]. We then add weights to different sentence. Take as input and get _ to represent each tweet through an attention layer.
= tanh( ℎ + ) More specifically, after putting ℎ into a fullyconnected layer, we get . Then calculate weight with a sentence level context . Finally, we can get the tweet vector through an attention layer by calculating the sum of ℎ .

Subtask E
This section will introduce neural network model for subtask E. Given a tweet, classify the tweet as "neutral or no emotion" or as one, or more, of eleven given emotions that best represent the mental state of the tweeter.
Each tweet will be classified with different numbers of labels. Since there exists eleven labels each of which may be suitable, considering one of these labels every time is reasonable. Our system will calculate a score for each of the eleven labels for each tweet, and select the top-3 as the final results.
We also used a LSTM network for this task, and get the classification result by using softmax. The other settings of this model is quite similar to that in Section 4 except for multi-label classification.

Experiment
In this section, we will report our evaluation results in SemEval 2018 based on the given dataset as well as the metrics. The statistics of the dataset is shown in Table 2.
Note that any other extra external resources, such as sentiment lexicon, emoticons, and annotated corpus, are not used in the evaluation except for the training dataset provided by the organization.  Table 3 shows the results of our UIR-Miner for all the subtasks on both Dev set and Test set, and the final ranking.

Conclusion
In this paper, we present a framework for SemEval 2018 Affect in Tweet task. After the preprocessing, we firstly propose an ensembling method to calculate the intensity score of emotion and sentiment.
Then a LSTM network model with multi-layer attention mechanism is constructed for emotion and sentiment classification. According to SemEval 2018's metrics, our model runs got final scores of 0.636, 0.531, 0.731, 0.708, and 0.408 in terms of Pearson Correlation on 5 subtasks, respectively.