TWINA at SemEval-2017 Task 4: Twitter Sentiment Analysis with Ensemble Gradient Boost Tree Classifier

This paper describes the TWINA system, with which we participated in SemEval-2017 Task 4B (Topic Based Message Polarity Classification – Two point scale) and 4D (two-point scale Tweet quantification). We implemented ensemble based Gradient Boost Trees classification method for both the tasks. Our system could perform well for the task 4D and ranked 13th among 15 teams, for the task 4B our model ranked 23rd position.


Introduction
Twitter, as a social networking service and microblogging service has gained great success in the recent years. It attracted millions of users to disseminate most up-to-date information, which resulted in generating massive amounts of information every day. Users share their opinions and experience on Twitter with the limit of 140 characters length text called as Tweet. Many applications in the field of Natural Language processing (NLP) and Information Retrieval (IR) are suffering severely from noisy in such a short 140 character length text. This paper describes the system, with which we participated in Task 4 (Sentiment Analysis in Twitter) of SemEval -2017 (Rosenthal et al., 2017). Organizers have given five different subtasks in task 4, they are:  Task We participated in only two subtasks B and D.
With our submissions, we could stand in 13 th position among 15 participants of task 4D and ranked 23 rd position in task 4B. For both the tasks B and D, we implemented basic model of ensemble based Gradient Boost Tree Classifier and applied parameter optimization technique to improve the results.
The rest of the paper is organized as follows: In section 2 we describe the datasets, section 3 preprocessing of data for analysis, section 4 describes the model implementation using ensemble based Gradient Boost Trees Classification technique, section 5 gives results and section 6 gives conclusion and future work.

Datasets
In implementing the solution for SemEval Task

Naveen Kumar Laskari Suresh Kumar Sanampudi
Assistant Professor of IT Head of the Department, IT BVRIT Hyderabad JNTUH College of Engineering Jagitial naveen.laskari@gmail.com sureshsanampudi@gmail.com

Pre-processing
Twitter has a constraint that, Tweet should not exceed 140 characters to convey the information or message. This makes the users to use unpredictable ways of expressing themselves. To find out sentiment from these kinds of tweets is very challenging task. In addition to, short text users are using different emoticons to express their opinions and feelings. Dealing with emoticons is a challenging task. To get the better results, we have to apply some pre-processing steps in order to clean Tweets for not to have unnecessary information. Initially each tweet converted into lower case and all URLS and HTML parts, Hash tags are removed from these tweets. Basically, emoticons has considered as two categories SAD and HAPPY, to deal with emoticons, each of the emoticons has been replaced with its category label either SAD or HAPPY. The Table 1 shows how the preprocessing step is applied, for the original Tweet and pre-processed Tweet can be seen.

Implementation
To train and test our model implementations, we have downloaded the training, development testing and testing datasets provided by the SemEval-2017 Task 4 organizers. After preprocessing the Tweet, we extracted word2vec features using genism models. These word2vec features are used to train the Gradient Boost Tree Classifier (GBC). After training the GBC model, development test dataset has been used to validate the model and final test dataset has been used to evaluate the model.

Word2Vec
Word2vec 1 model is used for learning the vector representations of words called word embeddings (Mikolov et al., 2013;Pennington et al.,2014). Word2vec is computationally efficient predictive model for learning word embeddings. It comes in two flavors, the Continuous Bag-of-Words model (CBOW) and the Skip-Gram model. Algorithmically, these models are similar, except that CBOW predicts target words from source context words, while the skip-gram does the inverse and predicts source context-words from the target words. The amazing property of these word embeddings is that, it effectively captures the semantic meanings of the words.

Gradient Boost Tree Classifier
Gradient Boosting is a machine learning technique for regression and classification problems, it builds an ensemble of trees one-byone, and then the predictions of individual trees are summed. Gradient Boosting involves three elements:  A loss function to be optimized  A weak learner to make predictions  An additive model to add weak learner to minimize the loss function. Decision trees are used as weak learners in gradient boosting. Trees are constructed in greedy manner by choosing the best split points. Trees are added one at a time, and existing trees in the model are not changed.
As we have used Scikit-learn 2 for our model implementation. It is a free software library for machine learning in python. Scikit-learn come with various classification, regression and clustering techniques. It is designed to interoperate with Python numerical and scientific libraries NumPy and SciPy.
min_samples_leaf is the minimum observations or samples required in leaf or terminal node. Lower values can be picked to control the over fitting problem and solve class imbalance problem, so we fixed with 1.
n_estimators is the number of sequential trees to be modeled. In GBC is fairly robust for the higher values of trees, but it can still over fit from point on. Hence, we checked various combinations of values and fixed with 2500. max_depth is the maximum depth of the tree. Appropriate value has to be picked to control overfitting, because as the higher depth tree will allow the model to learn very specific relations, which leads to overfitting. So we fixed with 7.
subsample is the fraction of observations to be used for each construction. Selection of the subsample is done by purely random sampling approach. The value slightly less than 1 makes the model robust. We fixed at 0.75. random_state is the random number seed used to generate the same random numbers every time. This is very important parameter. If we don't fix the random number, then we will have different outcomes for subsequent runs on the same parameters. We fixed with 3.
learning_rate is the parameter which determines the impact of each tree on the final outcome. Learning rate controls the magnitude of change in the estimates. Lower values are suitable to make the model more robust, but need to construct more number of trees to model all the relations, which actually computationally expensive. We fixed with 0.005.
We have tested Gradient Boost Tree Classifier model with various combinations of values for the above parameters, and for every combination the accuracy of the model has been evaluated. We could arrive at comparatively best results for the above combinations.

Results
We participated in only two sub tasks (Task 4B & 4D) of SemEval-2017 Task 4. We have used ensemble based Gradient Boost Trees Classification technique for both the subtasks. For Task 4B we classified the polarity of the Tweet with respect to a particular entity either positive or negative.
For Task 4D, we assigned the probability score for each Tweet and computed mean value of the positive and negative probabilities for entity level. The computed mean probability of the entity is considered as the final score for the Tweet quantification towards the entity.