YNUDLG at SemEval-2017 Task 4: A GRU-SVM Model for Sentiment Classification and Quantification in Twitter

Sentiment analysis is one of the central issues in Natural Language Processing and has become more and more important in many fields. Typical sentiment analysis classifies the sentiment of sentences into several discrete classes (e.g.,positive or negative). In this paper we describe our deep learning system(combining GRU and SVM) to solve both two-, three- and five-tweet polarity classifications. We first trained a gated recurrent neural network using pre-trained word embeddings, then we extracted features from GRU layer and input these features into support vector machine to fulfill both the classification and quantification subtasks. The proposed approach achieved 37th, 19th, and 14rd places in subtasks A, B and C, respectively.


Introduction
Sentiment analysis (SA) is a field of knowledge which deals with the analysis of people's opinions, sentiments, evaluations, appraisals, attitudes and emotions towards particular entities (Liu, 2012). Typical approaches to sentiment analysis is to classify the sentiment of a sentence into several discrete classes such as positive and negative polarities, or six basic emotions: anger, happiness, fear, sadness, disgust and surprise (Ekman,1992). SA is widely considered to be one of the most popular and challenging, competitive and the hot research area in computational linguistics. There are many ways to tackle the sentiment classification problems, such as random forest, support vector machine (SVM), Bayes classifier. In addition, there are many cha-llenges, such as analysis of noise texts (e.g. oral language) in natural language processing tasks, despite numerous notable advances in recently years (e.g., Breck et al,. 2007;Yessenalina and Cardie, 2011;). Based on this, our way is to extract features with Gated Recurrent Unit (GRU) and classify sentences by SVM using these features.
Task 4 subtask A is to classify a tweet's sentiment as positive, negative, or neutral. Subtask B (Tweet classification according to a two-point scale) requires classifying a tweet's sentiment towards the given topic. Similar to B, subtask C is a five-point scale (Nakov et al., 2016). Unlike typical classification approaches, ordinal classification can assign different ratings (e.g., very negative, negative, neutral, positive and very positive) according to the sentiment strength (Taboada et al., 2011;Li et al., 2011;Yu et al., 2013;Wang and Ester, 2014). This paper presents a system that combine GRU and SVM to process subtasks A, B and C. Our system uses a GRU neural network with word embeddings (Mikolov et al,. 2013) that are slightly fine-tuned (Yoon Kim et al,. 2014) on each training set. The word embeddings were obtained by training GloVe (Jeffrey Pennington et al,. 2014) on 2 billion tweets that we crawled for this purpose. These word vectors are then used to build sentence vectors through a recurrent convolutional neural network.
The proposed gated recurrent neural network consists of the GRU layer and SVM classifier. The choice based on the following two reasons: (1) it is more computational efficient than Convolution Neural Network (CNN) models (Lai et al., 2015); (2) unlike CNN, it also can extract long semantic patterns without tuning the parameter when training the model. Our system architecture is composed of a word embedding layer, drop out layer, GRU layer, a hyperbolic relu layer, SVM classifier and softmax layer. By capturing features from GRU layer, we obtain training and test data features, and integrate them with given labels as inputs, so SVM classifier can train the parameters.

Embedding Layer
The first layer in the network , we let xi∈Rk be the k-dimensional word vector corresponding to the i-th word in the sentence. A sentence of length n (padded where necessary) is represented as: x1:n = x1 ⊕ x2 ⊕ . . . ⊕ xn， (1) n is the maximum length of sentences and we set it to 50. When meeting short tweets we use fixed characters padding. Each word xi is represented by embedding vectors (w1,w2...wl) where l is set to 200. For word embeddings, we use pre-trained word vectors from GloVe (Pennington et al., 2014). GloVe is an unsupervised learning technology for learning word representation. The purpose of training is to use statistical information to find similarities among words and based on co-occurrence matrix and statistical information. We use them to provide pre-trained word vectors trained on 27B tokens from Twitter and with a length of 200. Words not presented in the set of pre-trained words are initialized randomly.

GRU Layer
The main layer in our model, the input to it is the sequence of length L and each word in it having k dimension. The gated recurrent network proposed in (Bahdanauetal., 2014) is a recurrent neural network (a neural network with feedback connection, see (Atiya and Parlos, 2000)) where the activation hj of the neural unit j at time t is a linear interpolation between the previous Where t j z is the update gate that determines how much time the units update its content, j t h is the newly computed candidate state.

Dropout Layer
Dropout is a regularization technique for reducing overfitting in neural networks by preventing complex coadaptations on training data Dropout refers to randomly let the weight of some hidden layer of network does not work in model when training, those nodes does not work can temporarily thought is not part of the network but the weight of it is retained(just temporarily not update), because it may work again the next time throwing samples into it.

Relu Layer
This layer is to allow the network to make complex decision by learning non-linear classification boundaries. We used more efficient function Rectified Linear Units (ReLU).

Soft-Max Layer
The output of the GRU layer and dropout layer is passed to a fully connected softmax layer. This layer calculates the classes probability distribution: where k w and k b are the weight vector and bias of the k-th class, respectively. For subtask A and B, the difference is three neurons and two neurons used (i.e., K=3 or K=2).

SVM classifier
Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection. Given a set of training examples, each training instance is marked as belonging to one or the other of the two categories, the SVM training algorithm to create a new instance will be assigned to one of two categories of models, making it a nonprobability binary linear classifier. Sklearn is a Python library of scientific computing and it provides several clustering algorithms. In our system, we from sklearn.svm import SVC module and redefine this function and parameters. As we all known, SVM is a binary classifier, but we also need process three and five classification problem. In the multiclass case, this is extended as per Wu et al. (2004). SVC and NuSVC implement the "one-against-one" approach (Knerr et al., 1990) for multi-class classification. If n_class is the number of classes, then n_class * (n_class -1) / 2 classifiers are constructed and each one trains data from two classes. To provide a consistent interface with other classifiers, the decisi-on_function_shape option allows to aggregate the results of the "one-against-one" classifiers to a decision function of shape (n_samples, n_classes).

Data
The training and development datasets used in our experiments were all datasets from SemEval 2013-2016 that labeled. Before training, we processed the data with the follow procedures: 1). The texts were lowercased by string.strip().lower(), 2).We only retain punctuation, exclamation mark, question mark and comma, 3). Tokenize each tweet using blank in sentences, 4). Emoticons like (^.^) have been deleted, 5).Using the patterns described in Table 1

Experiments and Results
All our experiments have been developed using Keras deep learning library with Theano backend, and with CUDA enabled. And all our experiments were performed on a computer with Intel Core(TM) i3 @3.4GHz 16GB of RAM and GeForce GTX 1060 GPU. The hyper-parameters of the network are chosen based on the performance on the dev-test data. We firstly carry out our system: put data into GRU model, as we known GRU can also train and test dataset at the same time we adjust some important parameters to make our GRU model to be the newest. Then we define a Theano function, and input para-meter is the GRU network portal, output is the GRU network dense layer before softmax layer. Using the Theano function, once we throw new data, we can get features that meeting our requirements. Last we throw these features into SVM so we can get classification results as we hope. After experiment we know our system performance well on two classification question, and poor on five classification. Some factors may cause this: small training data and in five-classification dataset many twitters belong to negative, neutral and positive; our deep model GRU maybe not actually called deep due to number of layers and our manually tuning.   Table 5: Results for Subtask C "Tweet classification according to a five-point scale", English. The systems are ordered by their MAEMscore (lower is better).

Conclusion
In this paper, we presented our GRU-SVM system used for SemEval 2017 Task4 (Subtasks A, B and C). The system used a gated recurrent layer as a core layer to extract features and then feed these features into SVM classifier.

Rank
System 1