YNU-HPCC at SemEval 2017 Task 4: Using A Multi-Channel CNN-LSTM Model for Sentiment Classification

In this paper, we propose a multi-channel convolutional neural network-long short-term memory (CNN-LSTM) model that consists of two parts: multi-channel CNN and LSTM to analyze the sentiments of short English messages from Twitter. Un-like a conventional CNN, the proposed model applies a multi-channel strategy that uses several filters of different length to extract active local n-gram features in different scales. This information is then sequentially composed using LSTM. By combining both CNN and LSTM, we can consider both local information within tweets and long-distance dependency across tweets in the classification process. Officially released results show that our system outperforms the baseline algo-rithm.


Introduction
Social network services (SNSs) such as Twitter, Facebook, and Weibo are used daily to express thoughts, opinions, and emotions. In Twitter, 6000 short messages (tweets) are posted by users every second 1 . Therefore, Twitter is considered as one of the most concentrated opinion-expressing venues on the Internet. Subjective analysis of this type of user-generated content has become a vital task for politics, social networking, marketing, and advertising.
The potential application of sentiment analysis has been the motivation behind the SemEval 2017 Task 4, which is a competition involving a series of subtasks that focus on Twitter sentiment classifications. Subtask A involves message polarity classification, which requires a system to classify 1 http://www.internetlivestats.com/twitter-statistics/ whether a message is of positive, negative, or neutral sentiment. Subtasks B and C involve topicbased message polarity classification, which require a system to classify a message on two-and five-point scales toward a certain topic.
Various approaches have been proposed to analyze sentiment of text, and deep neural network has achieved state-of-the-art results in recent years. Proven successful text classification methods include convolutional neural networks (CNN) (LeCun et al., 1990;Y. Kim, 2014;Kalchbrenner et al., 2014) and Long Short-Term Memory (LSTM) (Hochreiter et al, 1997;Tai et al., 2015). In general, CNN applies a convolutional layer to extract active local n-gram features, but lost the order of words. By contrast, LSTM can sequentially model texts. However, it focuses only on past information and draws conclusions from the tail part of texts. It fails to capture the local response from temporal data.
In this paper, we propose a multi-channel CNN-LSTM model for sentiment classification. It consists of two parts: multi-channel CNN, and LSTM. Unlike a conventional CNN model, we apply a multi-channel strategy that uses several filters of different length. The model is thus able to extract active n-gram features of different scales. LSTM is then applied to compose those features sequentially. By combining both CNN and LSTM, both local information within tweets and long-distance dependency across tweets can be considered in the classification process. To train the proposed neural model effectively using many parameters, we pretrained the model using a distant supervision approach (Go et al., 2009). In our experiment, we presented our participation of the proposed model for the SemEval 2017 Task 4 Subtasks A, B, and C (Rosenthal et al., 2017).
The remainder of this paper is organized as follows. In Section 2, we detail the architecture and multi-channel strategy of our model. Section 3 summarizes the comparative results of our proposed model against the baseline algorithm. Section 4 offers a conclusion. Figure 1 shows the architecture of our model. The model consists of six types of layers: embedding, convolution, max-pooling, LSTM, dense, and softmax. First, a tweet is input as a series of vectors of constituent words and transformed into a feature matrix by an embedding layer. The feature matrix is then passed into three parallel CNNs having different filter lengths. The max pooling layer extracts the max-over different CNNs results that are intended to be the salient features, and input them to the LSTM layer. Then, normal dense and softmax layers use outputs from LSTM and output the final classification result.

Embedding Layer
The embedding layer is the first layer of the model. Each tweet is regarded as a sequence of word tokens t1, t2, …, tN, where N is the length of the token vector. According to statistics of tweets collected from twitter in Section 3.1, about 95% tweets is shorter than 30 words. Thus, we empirically limit the maximum of N to 30. Any tweet longer than 30 tokens is truncated to 30, and any tweet shorter than 30 is padded to 30 using zero padding. Every word is mapped to a d-dimension word vector. The output of this layer is a matrix Nd T   .

CNN Layer
In each CNN layer, m filters are applied to a sliding window of width w over the matrix of previous embedding layer. Let wd F   denote a filter matrix and b a bias. Assuming that Ti:i+j denotes the token vectors ti, ti+1, …, ti+j (if k > N, tk= 0), the result of each filter will be , where the i-th element of f is generated by: where  denotes convolution action. Before processing f to the next layer, a nonlinear activation function is applied. Here, we use ReLU function (Nair and Hinton, 2010) for faster calculation. Convolving filters with window width w can extract wgram feature. By applying multiple convolving filters in this layer, we can extract active local n-gram features in different scales. To keep output sizes of different filters identical, we apply zero padding to token vectors before convolution.

Max-over Pooling Layer
In this layer, the maximum value from different filters is taken as the most salient feature. Because the CNN layer with window width w can extract wgram features, the maximum values of the output from CNN layer are considered the most salient information in the target tweet. We choose max rather than mean pooling because the salient feature represents the most distinguishing trait of a tweet.

LSTM Layer
The architecture of a recurrent neural network (RNN) is suitable for processing sequential data. However, a simple RNN is usually difficult to train because of the gradient vanishing problem. To address this problem, LSTM introduces a gating structure that allows for explicit memory updates and deliveries. As shown in Figure 2, LSTM calculates hidden state ht using the following equations:  Gates:  Input transformation:  State update: where xt is the input vector; ct is the cell state vector; W, U, and b are layer parameters; ft, it, and ot are gate vectors; and σ is a sigmoid function. Note that  denotes the Hadamard product.

Hidden Layer
This is a fully connected layer. It multiplies results from the previous layer with a weight matrix and adds a bias vector. The ReLU activation function is also applied. The result vectors are finally input to the output layer.

Output Layer
This layer outputs the final classification result. It is a fully connected layer using softmax as an activation function. The output of this layer is a vector calculated by: where x is the input vector, w is the weight vector, and K is the number of classes. Thus, the final classification result ̂ will be: ˆarg max ( | ) j y P y j x  (6)

Data Preparation
We implemented a simple tokenizer to process tweets into array of tokens. Because we are only participating in English tasks, all characters other than English letters or punctuations are ignored. Every tweet is applied with the patterns shown in Table 1. We applied the first four patterns and lowered all letters to accommodate the known tokens in GloVe (Pennington et al., 2014) pretrained word vectors.
Before training on given tweets, we pretrained the model using data with distant supervision. Two external datasets were used. The first was crawled from Twitter. Thanks to the streaming API kindly provided by Twitter, we collected approximately 428 million tweets (all tweets were published between Nov. 2016 and Jan. 2017). Approximately one sixth of them had only one emoji or emoticon 2 , which perfectly fit the condition for weak labeled.   Table 1: Example of pre-processing pattern.
The second dataset was from Sentiment140, which provided 1.6 million balanced-distribution tweets. We used GloVe pretrained data 3 to initialize the weight of the embedding layer. GloVe is a popular unsupervised machine learning algorithm to acquire word embedding vectors. It is trained on global word co-occurrence counts and achieves state of the art performance on word analogy datasets. In this competition, we used the 200-dimension word vectors that were pretrained on two billion tweets.
The hyper-parameters were tuned in train and dev sets using the scikit-learn (Pedregosa et al., 2012) grid search function, which can iterate through all possible parameter combinations to identify the best performance. The best-tuned parameters include as follows. The CNN filter count is m = 200; the length of multi-convolving filters are 1, 2, and 3; and the dimension of the hidden layer in LSTM is 512. To prevent over-fitting, we also applied dropout (Tobergte and Curtis, 2013) after LSTM layer and fully connected layer at rate of 0.5. The training also runs with early stopping (Prechelt, 1998), terminating processing if validation loss has not improved within the last 5 epochs.

Evaluation Metrics
We evaluated our system on Subtasks A, B, and C. Subtask A was a message polarity classification of three points. Subtasks B and C involved ordinal sentiment classification of two and five points. Metrics of Subtasks A and B were average F1-score, 3 http://nlp.stanford.edu/projects/glove/ average recall, and accuracy. The F1-score was calculated as: where 1 p F is the F1-score of one class (p denotes positive here as an example), p  and p  denote precision and recall, respectively.
Metrics of subtask C were MAE M and MAE μ , which were calculated as: where yi is the true label of item xi, h(xi) is the predicted label, and Tej is the set of test documents whose true class is cj. A higher F1-score, recall, accuracy, and a lower MAE μ and MAE M value indicate more accurate forecasting performance.

Results and Discussion
To prove the advantages of our system architecture, we ran a 5-fold cross validation on different sets of layers excepting embedding and hidden layers. A single LSTM achieved 0.617 accuracy on train and dev data. A single CNN achieved 0.606, a multichannel CNN 0.563, and a single CNN with LSTM 0.603. Our multi-channel CNN with LSTM outperformed all other architecture with a 0.640 accuracy. Table 2 presents the detailed results of our evaluation against the baseline algorithm. That our system achieved 0.647 accuracy on Subtask A is noteworthy, as the best score for this subtask was 0.651. The evaluation results revealed that our proposed system is considerably improved than the average baseline, which we attribute to our multi-channel CNN with LSTM architecture and distant supervision training. The proposed system can effectively

Conclusion
In this paper, we described our system submissions to the SemEval 2017 Workshop Task 4, which involved sentiment analysis in Twitter. The proposed multi-channel CNN-LSTM model combines CNN and LSTM to extract both local information within tweets and long-distance dependency across tweets. A large number of tweets with distant supervision were leveraged to pretrain the model. Officially released results revealed that our system outperformed all baseline algorithms, and ranked 14th on Subtask A, 10th on Subtask B, and 8th on MAE μ of Subtask C. In the future, we will attempt to enhance the tokenizer and model architecture to achieve an improved classification system.