YZU-NLP Team at SemEval-2016 Task 4: Ordinal Sentiment Classification Using a Recurrent Convolutional Network

Sentiment analysis of tweets has attracted considerable attention recently for potential use in commercial and public sector applications. Typical sentiment analysis classifies the sentiment of sentences into several discrete classes (e.g., positive and negative). The aim of Task 4 subtask C of SemEval-2016 is to classify the sentiment of tweets into an ordinal five-point scale. In this paper, we present a system that uses word embeddings and recurrent convolutional networks to complete the competition task. The word embeddings provide a continuous vector representation of words for the recurrent convolutional network to use in building sentence vectors for multi-point classification. The proposed method ranked second among eleven teams in terms of micro-averaged MAE (mean absolute error) and eighth for macro-averaged MAE.


Introduction
Sentiment analysis seeks to detect and analyze sentiment within texts. Following the rapid increase of user generated content in the form of social media, sentiment analysis has attracted considerable interest. Typical approaches to sentiment analysis classify the sentiment of a sentence into several discrete classes such as positive and negative polarities, or six basic emotions: anger, happiness, fear, sadness, disgust and surprise (Ekman, 1992). Based on this representation, various techniques have been investigated including supervised learn-ing and lexicon-based approaches. Supervised learning approaches require training data for sentiment classification (Go et al., 2009;Yu et al., 2009;Saif et al., 2016), while lexicon-based approaches do not require training data but use a sentiment lexicon to determine the overall sentiment of a sentence (Liu, 2010;Hu et al., 2013).
A five-point scale (Nakov et al., 2016) is also a popular way to evaluate sentiment. Many companies, such as Amazon, Google, and Alibaba all use a multi-point scale to evaluate product or APP reviews. Unlike typical classification approaches, ordinal classification can assign different ratings (e.g., very negative, negative, neutral, positive and very positive) according to sentiment strength (Taboada et al., 2011;Li et al., 2011;Yu et al., 2013;Wang and Ester, 2014).
Task 4 subtask C of SemEval-2016 seeks to classify the sentiment of tweets into an ordinal five-point scale. This paper presents a system that uses word embeddings (Mikolov et al., 2013) and recurrent convolutional networks to this end. The word embeddings can capture both semantic and syntactic information of words to provide a continuous vector representation of those words. These word vectors are then used to build sentence vectors through a recurrent convolutional neural network. For multi-point classification, we discretize the continuous sentiment intensity to a five partitions of equal intervals.
The proposed recurrent convolutional network consists of two parts: a convolutional neural network (CNN) (LeCun et al., 1990) on the bottom to reduce the dimension of a sentence matrix, fol-lowed by a long short-term memory (LSTM) (Hochreiter et al., 1997) layer to form the sentence representation, and a linear regression layer on the top to fit the sentiment intensity of sentences. The details of the CNN, LSTM and their combination are described in the following section.

Combining LSTM and CNN for Ordinal Classification
Ordinal classification of sentiment aims at classifying the sentence into ordinal discrete values according to their sentiment intensity. Figure 1 shows the system architecture of the proposed CNN-LSTM model for ordinal classification. In the bottom layer, the word vectors of vocabulary words are first trained from a large corpus using word embeddings. For each given sentence, a sentence vector is then built based on the word vectors of words in the sentences, which is further transformed into a matrix representation. The sentence matrix is sequentially passed through a convolutional layer and max pooling layer for multi-point classification. Unlike a conventional LSTM model which directly uses word embeddings as input, the proposed model takes uses outputs from a singlelayer CNN with max pooling.

Convolutional Neural Network
In our model, the input of the LSTM layer is an output from the CNN. CNNs have achieved the state-of-the-art results in computer vision applications, and also have been shown to be effective for various NLP applications (Krizhevsky et al., 2012;Kim, 2014;Ma et al., 2015). The CNN architecture used for our tasks is described as follows.
Let V denote the vocabulary of words, while d denotes the dimensionality of word vectors, and d n S R × ∈ denotes the sentence matrix built by concatenating the word vectors occurring in the sentences. Suppose that the sentence T is made up of a sequence of words [d1, d2, …, dn], where n is the length of sentence T. Then the representation of T is given by the matrix T d n S R × ∈ , where the j-th column corresponds to the embeddings for word dj. Note that for batch processing we the zero-pad sentence matrix S T so that the number of columns is a constant (equal to the max length of sentences) for all sentences in the corpus.
We apply a narrow convolution between S T and a filter d w F R × ∈ of a width w. We then add a bias term and apply a nonlinearity function to obtain a feature map The feature maps are input into a max pooling layer to capture the most salient feature (i.e., the one with the highest value) for a given filter. Filter operation is useful for determining the n-grams, where the size of the n-gram corresponds to the filter length.
The above description uses just one filter matrix to generate one feature. In practice, the proposed convolutional layer uses multiple filters in parallel to obtain the feature vectors.

Recurrent Neural Network
A recurrent neural network (RNN) architecture particularly suited for modelling sequence phenomena (Sak et al., 2014;. At each time step t, the RNN takes the input vector xt Here W, U, b are the parameters of an affine transformation and f is an element-wise nonlinearity function. In theory, the RNN can summarize all historical information up to time t with the hidden state ht. In practice, however, a vanilla RNN has difficulty learning long-term dependencies due to the vanishing gradient problem, as the gradient decreases exponentially with the number of network layers and the front layer trains very slowly. Approaches have been developed to deal with vanishing gradient problem, and certain types of RNNs (like LSTM, GRU) are specially designed to get around them. LSTM (Hochreiter et al., 1997) addresses the problem of learning long-term dependencies by augmenting the RNN with a gating mechanism. To illustrate this, the following formulas show how a LSTM calculates a hidden state ht.
Here ( ) σ ⋅ and tanh( ) ⋅ are the element-wise sigmoid and hyperbolic tangent functions,  is the element-wise multiplication operator, and t i , t f , t o are called the input, forget and output gates respectively. All the gates have the same dimension ds, which is equal to the size of hidden state, and c0, h0 are initialized to zero vectors at t=1. ct is the internal memory of the unit, which could be regarded as how we want to combine previous memory and the new input. The gating mechanism allows LSTM to model long-term dependencies. By learning the parame- , the network learns how its memory cells should behave.

Experiments and Evaluation
Dataset. We evaluated the proposed CNN-LSTM model by submitting the results to the SemEval-2016 Task 4 subtask. The statistics of the dataset used in this competition are summarized in Table 1. As the original tweets may be removed by Twitter users themselves, we can just download a part of the data in gold training, gold development, and gold development-test dataset. The distribution of sentiment labels shown in Table 2 shows data imbalance. Most of the data were annotated in [-1, 0, 1] labels, and only a few were annotated Very Negative (-2) or Very Positive (2). Implementation details. As mentioned earlier, the proposed method consists of word embeddings and a recurrent convolutional network. Both parts may have their own parameters for optimization. For word embeddings, we used popular pre-trained word vectors from GloVe (Pennington et al., 2014). GloVe is an unsupervised learning algorithm for learning word representation. Training is performed on aggregated global word co-occurrence statistics from a large corpus, and the resulting representation showcases interesting linear substructures in the word vector space. They provide pretrained word vectors trained on 840B tokens from common crawls and have a length of 300.
Although the pre-trained word embeddings can capture important semantic and syntactic in-  formation of words, they are not sufficient to capture sentiment behaviors in texts. To further improve word embeddings to capture sentiment information, we trained our recurrent convolutional network using an additional dataset from the Vader corpus (Hutto et al., 2014). It contains 4,000 tweets pulled from Twitter's public timeline, independently annotated by 20 human raters with sentiment ratings in a range of [-4, 4]. We discretized the continuous human-assigned ratings of [-4, 4] to discrete numbers [-2, -1, …, 2] to make them compatible with the task context. The hyper-parameters of the network are chosen based on the performance on the development-test data. We use: rectified linear units (ReLU), filter windows (w) of 3 with 64 feature maps, dropout rate (p) of 0.25, pool length of 2, and mini-bath size of 16. Adagrad update rule is used to automatically tune the learning rate, and micro-averaged MAE is used as the loss function. Early stop mechanism is used to avoid overfitting. The activation function in the top layer is a sigmoid function, which scales each sentiment intensity to the range 0 to 1. These continuous intensity scores are transformed into a five-point scale through the cut-offs: [0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0] for strongly negative, negative, negative or neutral, positive, strongly positive, respectively. Evaluation metrics. SemEval-2016 Task 4 subtask C published the results for all participants using both macro-averaged mean absolute error ( M MAE ) and micro-averaged mean absolute error ( MAE µ ) (Nakov et al., 2015). The M MAE is defined as: where i y denotes the true label of i X , ( ) i h X denotes the predicted label, and j Te denotes the set of test documents whose true class is j c . The MAE µ is defined as: Compared to the micro-averaged MAE µ , the macro-averaged M MAE is more appropriate to measure the classification robustness of systems for imbalanced data.

Results.
A total of eleven teams participated in subtask C. Table 3 Table 4 shows the experimental results after the release of test set ratings. We found that the CNN-LSTM achieved better performance on the development test set than the test set. Conversely, the CNN alone yielded better performance on the test data than the development-test set.

Conclusions
This study presents a deep learning approach to classifying tweets into a five-point scale. The proposed model combines the convolutional neural networks and long short-term memory networks. To better capture the sentiment aspect of words, we further tuned our model using an additional sentiment corpus. Experimental results show that the proposed method archived good performance on the micro-averaged MAE.
Future work will focus on exploring more effective features and machine learning methods to improve classification performance for both microand macro-averaged MAE.