MI&T Lab at SemEval-2017 task 4: An Integrated Training Method of Word Vector for Sentiment Classification

A CNN method for sentiment classification task in Task 4A of SemEval 2017 is presented. To solve the problem of word2vec training word vector slowly, a method of training word vector by integrating word2vec and Convolutional Neural Network (CNN) is proposed. This training method not only improves the training speed of word2vec, but also makes the word vector more effective for the target task. Furthermore, the word2vec adopts a full connection between the input layer and the projection layer of the Continuous Bag-of-Words (CBOW) for acquiring the semantic information of the original sentence.


Introduction
The polarity of a Twitter message is classified into positive, negative and neutral in Twitter sentiment analysis. However, the difficulty of sentiment analysis greatly increases due to the ambiguity and the rhetorical of natural language (Liu, 2012). In recent years, the deep learning model has shown great potential in the task of sentiment classification (Socher et al., 2011;Poria et al., 2015;Socher et al., 2013). For short text data such as Twitter, Convolutional Neural Network (CNN) model (Kim, 2014;Dos Santos and Gatti, 2014;Chen et al., 2016) is the most widely and successfully used, and in the SemEval 2016-task4A competition, the system ranked first also uses CNN model (Deriu et al., 2016). So CNN model is used to complete the task in our system. The task 4A of SemEval 2017 1 is a polarity classification task which requires participated systems to classify a given Twitter message into positive, negative or neutral (Rosenthal et al., 2017).
The system integrates the word2vec and CNN to train the labeled data, generating the word vector of each word in the data. This method can improve the training speed of word vector. In order to preserve the more semantic information of the original sentence effectively, the word2vec is fully connected between the input layer and the projection layer of the Continuous Bag-of-Words (CBOW).
2 System description 2.1 Word vector representation method Word2vec can represent every word that appears in a large number of training texts as a lower dimension vector (usually 50-100 dimensions). (Mikolov et al., 2013b;Rong, 2014;Mikolov et al., 2013a) have a detail description of word2vec.
Word2vec in our system uses Continuous Bag-of-Words (CBOW) model, and the structure is shown in Figure 1, in which w i is a word, and the sequence w (i−c) , ..., w (i−1) , w (i+1) , ..., w (i+c) represents the context of the word w i , and c is the window size. The word-vector length of the word w i is d. The traditional word2vec adds 2c words' word vector on the input layer to the projection layer. (Mikolov et al., 2013b) has a detail description of this method, which has been exactly used in Google Word2vec released in 2013. However, in the sentiment analysis task, whether there is a negative word before the sentiment word will influence the identification of polarity (Liu, 2012). So in order to preserve more emotional semantic information, the input layer and the projection layer of CBOW are fully connected in this system.
The procedure of integrating word2vec and CNN to train the words vector follows four Figure 1: The structure of word2vec in the system. steps: (i) initialization and pre-training: initialize word2vec and CNN parameters, and pretrain word2vec a certain number of iterations; (ii) CNN training: input the latest words vector from word2vec to CNN; (iii) word2vec training: input the latest words vector from CNN to word2vec; (iv) alternate the (ii) and (iii) steps until the training phase converges. CNN can help word2vec extract more effective text features. Experiments show that the model obtained by integrating CNN and word2vec performs better when the data is sufficient compared to adopting them separately.

Deep learning model
The system proposes CNN model to predict the sentiment polarity of a Twitter text. The CNN structure diagram is shown in Figure 2.
Word vector sequence: Each word is represented as a d-dimensional word vector, a sentence or a Twitter text containing n words can be expressed as n d-dimensional vectors, which are concatenated together into a matrix X ∈ R d×n form which represents a sentence or a Twitter text. Each row of matrix X is treated as a new vector, thus d n-dimensional vectors are obtained and concatenated as input to the CNN network.
Convolutional layer: The convolution layer uses full convolution operation. Let F l i ∈ R M 1 represents ith feature map at lth layer 2 , and m l j,i ∈ R M 2 represents the convolutional kernel of the jth convolutional result C l+1 j at l + 1th layer. So the 2 Here, a layer refers to one convolutional and one pooling layer.
jth convolutional result C l+1 j at l + 1th layer is the result of convolution operation between each feature map at lth layer and convolutional kernel m l j,i , i.e., k-max pooling: After the convolution operation, the max-pooling operation is performed which preserves the largest value in each convolution result. The system uses k-max pooling, which preserves the largest k values instead. For example: (1, 6, 3, 8) Where k is a parameter that needs to be set manually.
full-connection Layer: The full connection layer receives the output of the last layer of CNN and fully connects itself to the output layer, i.e. W * x+b operation, where W and b can be trained during the network training phase.
output Layer: The output layer uses the Softmax operation and outputs the probability that the input sentence or Twitter text belongs to each class, and the class with the maximum probability is the predicted class judged by the system.

Combination prediction
Because of the insufficiency of training data and the great quantity bias in different classes' training data, the trained CNN can't work so well. So our system adds Support Vector Machine (SVM) (Suykens, 2001) model to predict jointly with  3 Experimental datasets and model parameters

Experimental datasets
The datasets of the experiment is provided by Se-mEval 2017, and the specific datasets used are shown in Table 1.

Model parameters
The parameters set of word2vec and CNN in our system are shown in Table 2  4 Experimental results and analysis

Evaluation method
The measure metric of the Evaluation is average macro recall (Rosenthal et al., 2017). The formula is as follows: Here, ρ P , ρ U and ρ N denote recall for the positive class, neutral class and negative class.
The other two measure metrics are the average macro F 1 and the average macro precision: Here, F P 1 , F U 1 and F N 1 denote F 1 for the positive class, neutral class and negative class; P P , P U and P N denote precision for the positive class, neutral class and negative class. Table 3 lists the average macro recall for each model on the development dataset. From the table 3, the effect of word2vec+CNN model is better than SVM, and word2vec + CNN + SVM is the best of the three models, so the best results on the test set are submitted.

Analysis of experimental results
System SVM CNN SVM+CNN ρ P U N 0.589 0.601 0.653 Table 3: The results of different models on the development dataset. Table 4 shows the details of our system's result in comparison with the three top ranked systems' results. It can be seen from the table that our result's ρ N is not good, but ρ U is better than the top three systems. The decrease of experimental results is from the quantity bias in training data of different classes.
For deep learning models, a lot of training data are required. Due to the lack of Twitter texts, word2vec training is not sufficient and do not generate effective words vector representation. In the future, semi-supervisory mechanisms will be considered to expand the number of training data.
In the future, we can improve the system's performance from following points: (i) to expand the amount of training data; (ii) to improve the type of combination: the results can be combined with multiple CNN systems to predict; (iii) to add more emotional semantic features.

Conclusion
This paper presentes a method of training word vector by integrating word2vec with CNN and using the trained CNN to complete the Twitter sentiment analysis task. In the future work, we hope to continue to improve system's performance in multiple ways, such as trying to modify some parameters or improve the type of classifiers' combination, adding some sentiment features.

System
Positive Negative Neutral Score P ρ F 1 P ρ F 1 P ρ F 1 ρ P U N DataStories (1)