Zewen at SemEval-2018 Task 1: An Ensemble Model for Affect Prediction in Tweets

This paper presents a method for Affect in Tweets, which is the task to automatically determine the intensity of emotions and intensity of sentiment of tweets. The term affect refers to emotion-related categories such as anger, fear, etc. Intensity of emo-tions need to be quantified into a real valued score in [0, 1]. We propose an en-semble system including four different deep learning methods which are CNN, Bidirectional LSTM (BLSTM), LSTM-CNN and a CNN-based Attention model (CA). Our system gets an average Pearson correlation score of 0.682 in the subtask EI-reg and an average Pearson correlation score of 0.784 in subtask V-reg, which ranks 17th among 48 systems in EI-reg and 19th among 38 systems in V-reg.


Introduction
Affect determination is a significant part of nature language processing. Especially, affect in tweets becomes a focus in recent years. Sentiment Analysis in Twitter, which is a task of SemEval, was firstly proposed in 2013 and not replaced until 2018. In SemEval 2018, the task Affect in Tweets (AIT) (Mohammad et al., 2018) was proposed and the objective is to automatically determine the intensity of emotions (E) and intensity of sentiment (aka valence V) of tweets. In this paper, we focus on two subtasks: • EI-reg (emotion intensity regression) -Given a tweet and an emotion E, determine the intensity of E that best represents the mental state of the tweeter -a real-valued score between 0 (least E) and 1 (most E) • V-reg (sentiment intensity regression) -Given a tweet, determine the intensity of sentiment or valence (V) that best represents the mental state of the tweeter -a real-valued score between 0 (most negative) and 1 (most positive) Before 2016, most systems use Support Vector Machine (SVM), Naive Bayes, maximum entropy and linear regression (Nakov et al., 2013;Rosenthal et al., 2014Rosenthal et al., , 2015. In SemEval 2014, deep learning methods started to appear and a team using them won the second place. Since 2015, more and more teams who were rank at the top used deep learning methods and now deep learning methods including CNN and LSTM networks become really popular (Nakov et al., 2016;Rosenthal et al., 2017).
The system described in this paper is an ensemble of four different DNN methods including CNN, Bidirectional LSTM (Bi-LSTM), LSTM-CNN and a CNN-based Attention model (CA). In these methods, words in tweets are firstly mapped to word vectors. After intensity scores are calculated by these models, we use a logistic regression and finally give the scores.
The rest of the paper is organized as follows. Section 2 describes the four various methods and the ensemble method used in our system. Section 3 and Section 4 give the implementation and training details of our system for subtask EI-reg and V-reg. Section 5 states the results and discussion in the evaluation period. Finally, Section 6 makes a conclusion on this work.
2 System Description 2.1 CNN Inspired by Kim's work on sentence classification (Kim, 2014), the architecture of the CNN model used in our system is almost identical to his model. As it is shown in Figure 1, tweets are first fed into the embedding layer, which converts words into word vectors. Then the tweet is mapped into a matrix M of size n × d. In order to reduce the number c is a bias term and f is a nonlinear function such as ReLU (Jarrett et al., 2009), which is used in our approach. Filters are applied with different size of windows and in each window of size h, feature matrix c ∈ R (n−h+1)×m is produced corresponding to the filters: c = [c 1 , c 2 , ..., c k , ..., c m ] where m is the number of filters and c k ∈ R n−h+1 represents the features extracted from a word sequence. In the pooling layer, we apply a max-over-time pooling operation (Collobert et al., 2011) over feature matrix and take the maximum in each column to preserve the most important features. These maximums are concatenated and then fed into a fully-connected network (L1, L2). L2 is followed by a single sigmoid neuron node to generate the prediction of the affect on the interval [0, 1].

Bidirectional LSTM
The LSTM architecture used in our system is a kind of modern Recurrent Neural Networks (RNN). Comparing to CNN, the way RNN work is more similar to that how humans read sentences. A word vector sequence x, which is converted from a tweet, will be fed to the RNN in order.
the RNN takes the input from the cur- rent word x t and also from the previous hidden state h t−1 to calculate the hidden state h t and the outputŷ t , which meansŷ t at time t is in the influence of all previous input words x 1 , ..., x t−1 . However, this regular RNN suffers from the exploding and vanishing gradient problem when using the backpropagation algorithm (Hochreiter, 1998), which makes RNN hard to train. Therefore, we use the Long short-term memory (LSTM) network (Hochreiter and Schmidhuber, 1997) to overcome this problem. Each ordinary node of hidden layer in LSTMs is replaced by a memory cell and the following equations describe the LSTM: The vector h t is the value of hidden layer of LSTM at time t, g t is the input node, i t is the input gate, f t is the forget gate, o t is the output gate and s t is the internal state where is pointwise multiplication. According to Zaremba and Sutskever (2014), the function φ used here is the tanh function.
For every point in a given sequence, Graves et al. (2005) shows that a bidirectional LSTM can preserve more sequential information about all sequential points before and after it. As the Figure  2 shows, we concatenate the hidden states of two separate LSTMs after they process the word sequence in opposite direction and get the concatenated state h ∈ R 2m , which is fed to fully connected layers and finally give the result with a single sigmoid neuron node.

LSTM-CNN
The architecture of LSTM-CNN is a combination of previous two model. Instead of feeding the out-  put of LSTM to the fully connected layers, the output of LSTM h t at each time t are regarded as the input of CNN and Figure 3 shows the architecture.

A CNN-based Attention Model (CA)
Since attention mechanism has achieved significant improvements in many NLP tasks, including machine translation (Bahdanau et al., 2014), caption generation (Xu et al., 2015) and text summarization (Rush et al., 2015), it becomes an integral part of compelling sequence modeling and transduction models in various tasks. Motivated by Du's work on sentence classification (Du et al., 2017), the architecture of our CNNbased attention model resembles his model. We first use a CNN-based network to model the attention signal in sentences. The convolution operation here is same as that described in Section 2.1. The attention signal of original text is represented by the output of convolutional filter. In order to reduce the noise, multiple filters with same size of windows are applied. After that, we get the corresponding attention similarity: So far, we have obtained attention signal c t and the corresponding hidden state vector of RNN h t . The representation of the whole sentence can be computed by s = 1 T T −1 t=0 c t h t And then s ∈ R d is fed into a fully-connected network (L1, L2). L2 is followed by a single sigmoid neuron node to generate the prediction of the affect on the interval [0, 1]. The architecture of this model is shown in Figure 4.

Ensemble Model
According to the results of SemEval-2017 task 4, the use of ensembles stood out clearly. Therefore, we use a mix of deep learning methods to make our system obtain better predictive performance. Inspired by the boosting algorithms, we use a logistic regression to improve the accuracy of these four methods and the architecture is shown in Figure 5. In order to make the model simple, it only takes the output of the four methods as input rather than training data.

Implementation
We implemented our system with PyTorch (Paszke et al., 2017) in Python 3.
Preprocessing: For making tweets string clean, we apply a preprocessing procedure on the input tweets which removes the abbreviations like 's, 've and make them lowercased.
Model Hyper-parameters: Table 1 and Table 2 show the hyper-parameters we use in our system.
For fully connected layers, no more than two fully-connected layers are used in the four methods and all fully-connected layers are followed by ReLU. Before the outputs of pooling layers and LSTMs are fed to the fully connected layers, a dropout is applied and the details are described in Table 2.

Training
The dataset used in our system is provided by the AIT task and no external datasets are used in training period. For the subtask EI-reg and subtask Vreg, they are trained with the same model hyperparameters which are listed in Table 1 and Table  2. Also, the four methods use the same word embeddings, which is a pre-trained 300-dimensional word vectors with common crawl by GloVe algorithm. For different emotions, we train the models for 10 epochs respectively. The network parameters are learned by minimizing the Mean Absolute Error (MAE) between the gold labels and predictions and the four methods used in our system are trained separately. We optimize the loss function by back-propagating algorithm via Minibatch Gradient descent with batch size of 8 for the 4 deep learning models and full batch learning for the ensemble model, as well as the Adam opti-mization algorithm (Kingma and Ba, 2014) for all models with initial leaning rate of 0.001 and 0.01 for the four deep learning models and the ensemble model, respectively.

Result and Discussion
We compare the results of the four methods used in our system, the ensemble system, the SVM Unigrams Baseline provided from the AIT task and the best-performing system -SeerNet in Table 3. The metric for evaluating performance is Pearson Correlation.
Its remarkable that, comparing to the individual models, our ensemble model has an improvement of at least 2% on EI-reg subtask and 1.1% on Vreg subtask. However, it's obvious that there is a gap between our models and the best-performance system. The rough preprocessing method of our system is one of the reason for the low score. Because of some words in tweets are misspelled or in a special format like 'yaaaaay!', some of the information is lost in this process. So we added an experiment on the V-reg task to study the effect of preprocessing method. We replace the text preprocessing method with the ekphrasis 1 for the tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction and the keep the other parameters unchanged. As it is shown in Table 4, the four methods as well as the ensemble model all get an improvement on the results. Actually, some expressions like dates, urls, hashtags and emoticons are converted into the special tokens like <date> , <url>, <hashtag> and <joy>, but these tokens are not in the dictionary of pre-trained word vectors, which means the information of these tokens is still wasted in the embedding process.
There is much room for the improvement of our method: 1. In our system, a single pre-trained word embedding is used, which lack experimental evidence. For future work, combining more kinds of word embeddings should be taken into consideration.
2. We adjust the hyper-parameters by doing evaluation on dev dataset. For future work, we can apply a more advanced strategy like Cross Validation.   3. For the input features, we only use the word vectors. We are supposed to experiment with more features like lexicons.
4. In our system, we just use a simple logistic regression but achieve an impressive result on the two subtasks. There is an interesting idea that we can do more work on finding a better ensemble model.

Conclusion
In this paper, we propose a model on the sub-task EI-reg and V-reg of SemEval-2018 Task 1: Affect on Tweets. The submitted system is an ensemble model based on CNN, Bidirectional LSTM (BLSTM), LSTM-CNN and a CNN-based Attention model (CA). All methods are described in detail to make our work replicable. For future work, it would be significant to make an improvement on preprocessing of tweets, doing more experiment on word embeddings and feature selection, model validation and ensemble method.