CENNLP at SemEval-2018 Task 1: Constrained Vector Space Model in Affects in Tweets

This paper discusses on task 1, “Affect in Tweets” sharedtask, conducted in SemEval-2018. This task comprises of various subtasks, which required participants to analyse over different emotions and sentiments based on the provided tweet data and also measure the intensity of these emotions for subsequent subtasks. Our approach in these task was to come up with a model on count based representation and use machine learning techniques for regression and classification related tasks. In this work, we use a simple bag of words technique for supervised text classification model as to compare, that even with some advance distributed representation models we can still achieve significant accuracy. Further, fine tuning on various parameters for the bag of word, representation model we acquired better scores over various other baseline models (Vinayan et al.) participated in the sharedtask.


Introduction
A huge portion of analysis in natural language processing try to find better understand and process various kinds of info in text. Day by day the development of social websites, blogging and the consummation of technologies gives vast amount text data on the internet, which opened a space to study peoples feeling, reviews, and emotion from their own written languages, called sentimental analysis. Sentimental analysis has so many attractions and has done so many research  in this area. Sentiment analysis remains a sequence of techniques, approaches, and tools about sensing and mining subjective info (such as opinion and attitudes) from language (Bravo-Marquez et al., 2014). Traditional approaches Mohammad et al., 2013) are finding out the polarity of the positive, negative, neutral classification problem Bravo-Marquez et al., 2015). Recent research in sentimental analysis (Mohammad and  are done on the data-driven algorithm view point. But at the same time combination of good linguistic awareness data can increase the performance and insights about the task. We used machine learning techniques to build the model. Linear regression, random forest methods are used respectively for prediction and classification tasks. A mathematical system or an algorithm need some form of numeric representation to work with. The naive way of representing a word in vector form is one hot representation but it is a very ineffective way for representing a large corpus. In a more effective way, we need some semantic similarities  to nearby points, thus creating the representation bring beneficial info about the word actual meaning, called word embedding models that are categorized based on count and predictive word embedding models. Both embedding models at least some way share sematic meaning. We used here count based word embedding methods for inputting the word. In more specific, Feature representation is done based on the term-document matrix (TDM) and term frequency-inverse frequency (TFIDF) matrix. The optimum value of n-gram range, depth of classifier, mindf are obtained by hyper parameter tuning.

Corpus
Dataset provided by shared task was sourced from Twitter API by focusing emotion-correlated words. The tweets were annotated separately for 4 emotions namely anger, joy, fear and sadness. The data provided were annotated with best-worst scaling technique (Kiritchenko and Mohammad, 2016) that gave better annotation consistency and emotion intensity scores for tweets. There were 5 subtasks in task1 . For each sub-tasks, separate training and testing data sets are given for Spanish, English, and Arabic. Subtasks 1 and 3 focused on emotion intensity and sentiment intensity tasks respectively which were categorized into regression tasks (EI-reg and V-reg ). In that emotion intensity and sentimental intensity is a real-valued scale between 0 and 1, where 0 represents least and 1 represents the most intensity of the tweeters from written tweets. Rest of the subtasks EI-oc, V-oc, E-c were multi-class classifications problems that are emotion intensity ordinal classification, sentiment analysis ordinal classification, emotion classification subtasks respectively. For subtask 2(EI-oc) distinct training and testing, dataset are provided for anger, fear, joy, and sadness. Subtask 4(V-oc) gives 7 ordinal classes, according to different levels of positive and negative valence state of the tweeter. Ta 3 Background

TDM
TD is the most basic method of representation of a text used in NLP. In this technique, for every individual document present in a corpus, we take the raw count of the words present in that document over all the unique words present in the entire corpus as its representation (Larson, 2010). That is to say, a vocabulary is created using all the word in the entire corpus and for a single document representation, the count of the words are incremented in view of their occurrence only for that docu-  ment. The drawback of this method is that this creates a very spares matrix where only a few of the columns are accumulated with numbers whereas, the rest of the columns are all zeros, thus bringing to the term frequency method.

TF-IDF
One of the problems that occur due to the term document representation is that, it takes a raw count of all the words present in the document where most frequently occurring words like conjunction, preposition appear very often across most of the articles, thus not adding any significant importance to the individual article. On the other hand, seldom occurring words, like proper nouns give a more individual identification to the article. Thus, coming to a method where we take in the frequency of the words over the entire corpus, this method is termed as term-frequency(tf).
In language processing technique a collection of commonly appearing words with apparently less significance to a document are called as 'stop words', these can be removed at pre-processing level. Whereas, more often than not a list of stop words is not a sophisticated approach to adjusting term frequency for commonly used words. Inverse document frequency (idf ) is a technique (Ramos et al., 2003) wherein, less weight age is given to more commonly occurring words (not restricted to only stop words) and vice-verse for seldomly used words across the entire corpus.
idf t = ln N tot docs n docs containing t Combining the two ideologies (tf-idf ) brings, the rarity of the term intended to measure how important a word can be to the document in a collection (or corpus) of documents. it can be considered as a heuristic quantity. The term inverse document frequency for any given term is defined as

Linear Regression
Linear regression is a commonly used supervised learning approach for prediction. The key goal is to fit a best fit line between a dependent and independent variable so as to minimize the error sum of squares between the actual and predicted value using the model. The model for linear regression is usually fitted using least square approach, or by minimizing the error sum of squares between the actual and predicted value. In certain cases, the model can also be framed by adding a regularization term. The regularization term is added to avoid overfitting (François and Miltsakaki, 2012).

Random Forest
Random forest, an ensemble decision tree based classifier which averages various combination of trees created on arbitrary samples from the data set. A decision tree breakdown the data into minor sub-classes while instantaneously construe a tree using decision and leaf nodes. The category is embodied by leafs nodes. A decision node takes two or extra divisions with choices or leafs. Every tree in the RF is made on an arbitrary decent subclass of features present (Liaw et al., 2002) on the entire data. The RF algorithm medians trees to generate a system with short variance and insignificant trees are canceled out, left trees produce the output.

Methodology
The model will be effective based on how it is extracting meaningful information from raw text. The system is created with the help of scikit-learn library 1 which is a python based library very much useful for classification, regression, clustering, data preparation, dimensionality reduction etc. 3. Create a bag of words model which is a simple numeric representation of piece of text that is easy to classify. We just count the frequency of each word in the piece of text and created a dictionary of them which is called tokenization process in NLP which is then passed to countvectorize object in scikit learn package to create a set of maximum features. We use fit transform method to model (Ganesh et al., 2016) the bag of words feature vector which are stored in an array.
4. Same tools and methods are followed for creating TDM matrix as mentioned in step 3 5. We created a classifier or prediction with the help of machine learning model. Here we used random forest classifiers consisting of one hundred trees. RF is a set of decision trees graphs that model all possibility of certain outcomes.

Result
The group of tasks is particularly focusing on automatic detection of the intensity of emotion (EIreg) and sentiment (V-reg) of the tweeter. In this task, they have presented with the problem of classifying multi-classed emotion of tweets, such as EI-oc, V-oc, E-c . We have approached these tasks with a count based representation model, where every individual tweet is represented based on varied vocabulary size, and how these will perform for different category of subtasks over three different language dataset namely English, Spanish and Arabic. We base the model, considering in mind  that an algorithm should not be narrowed down to a certain problem. That is it should not be biased towards a particular problem overall, this inference is made on the fact that all subtasks under task1 are focused on understanding the effect of tweets from the same corpora. As all the subtasks under task1 follow a generic grid search models, which are varied over min-df, n-gram parameters. The El-reg task was tuned on mean square error and varience for all 3 languages. El-reg gave comparatively better accuracy in TF-IDF matrix than TDM matrix.so we used TF-IDF for creating feature matrix. This regression task gave macroavg between 32-44 percentage. English tweets gave least macro-avg value (32)   V-reg is a regression task where sentiment intensity was predicted. Spanish and English used TF-IDF and Arabic corpora used term document matrix for feature input matrix. These feature are found out by grid search method. Arabic and Spanish data give 58 % prediction and English data give slight high result which is 62 Pearson (all instances) Valence English 0.622 Arabic 0.583 Spanish 0.580 Table 9: V-reg result.
Subtasks 2,4,5 are multi-label classification problems whose models are also generated by bag of words method. But the classification which was done by random forest did not yield expected result comparing to regression tasks.

Conclusion
Affect in tweets has been found out by the bag of words representation and classical machine learning algorithms. Random Forest and linear re-gression were used as machine learning tasks for predicting classification tasks and regression tasks respectively in which regression task gave fairly good results while classification task yield not so favorable results. TF-IDF seems to give better results for English and Spanish languages whereas TDM gave better results for the Arabic language. Emotion intensity and valence were captured by our model for the validation given data. Algorithms performed nearly same with TF-IDF and TDM but with slightly better results while using TF-IDF.