NRU-HSE at SemEval-2016 Task 4: Comparative Analysis of Two Iterative Methods Using Quantification Library

In many areas, such as social science, politics or market research, people need to track sentiment and their changes over time. For sentiment analysis in this field it is more important to correctly estimate proportions of each sentiment expressed in the set of documents (quantification task) than to accurately estimate sentiment of a particular document (classification). Basically, our study was aimed to analyze the effectiveness of two iterative quantification techniques and to compare their effectiveness with baseline methods. All the techniques are evaluated using a set of synthesized data and the SemEval-2016 Task4 dataset. We made the quantification methods from this paper available as a Python open source library. The results of comparison and possible limitations of the quantification techniques are discussed.


Introduction
In many areas, such as customer-relationship management or opinion mining, people need to track changes over time and measure proportions of documents expressing different sentiments. In these situations, the task of accurate categorization of each document is replaced by the task of providing accurate proportions of documents from each class (quantification). George Forman suggested defining the 'quantification task' as finding the best estimate for the amount of cases in each class in a test set, using a training set with substantially different class distribution (Forman, 2008).
Although quantification techniques are able to provide accurate sentiment analysis of proportions in situations of distribution drift, the question of optimal technique for analysis of tweets still raises a lot of questions. It is worth mentioning that sentiment analysis of tweets presents additional challenges to natural language processing, because of the small amount of text (less than 140 characters in each document), usage of creative spelling (e.g. "happpyyy", "some1 yg bner2 tulus"), abbreviations (such as "wth" or "lol"), informal constructions ("hahahaha yava quiet so !ma I m bored av even home nw") and hashtags (BREAKING: US GDP growth is back! #kidding), which are a type of tagging for Twitter messages.
In our paper we used several quantification methods mentioned in literature as the best ones and evaluated them by comparing their effectiveness with one another and with baseline methods.
The paper is organized as follows. In Section 2, we first look at the notation, then we briefly overview six methods to solve the quantification problem. Section 3 describes two datasets we use in our research. Section 4 describes the results of our experiments, while Section 5 concludes the work defining open research issues for further investigation.

Quantification Methods
In this section we describe the methods used to handle changes in class distribution.
First, let us give some definition of notation. Х: vector representation of observation x; C = {c 1 , …, c n }: classes of observations, where n is the number of classes; (c): a true prior probability (aka "prevalence" of class c in the set S; ̂ (c j ): estimated prevalence of c j using the set S; ̂ (c j ): estimated ̂ (c j ) obtained via method M; p(c j /x): a posteriori probabilitiesto classify an observation x to the class c j ; , : training and test sets of observations, respectively; : a subset of set where each observation falls within class ; : class probability distribution of the training set; The problem we study has some training set, which provides us with a set of labeled examples -TRAIN, with class distribution TRAIN_CD. At some point the distribution of data changes to a new, but unknown class distribution -TEST_CD, and this distribution provides a set of unlabeled examples -TEST. Given this terminology, we can state our quantification problem more precisely.

Classify and Count
The first approach provides information about proportions of document in each class just by classification of each document. In this case, the process starts with training the best available classifier, applying it to the test set and counting the amount of documents in each class. Forman named this obvious approach as Classify and Count (CC) (Forman, 2008).
The observed count P of positives from the classifier will include both true positives and false positives, P = TP + FP, as characterized by the standard 2 × 2 confusion matrix.
Classifier Predictions: Actual\Prediction P_ N_ P TP FN N FP TN

Adjusted Classify and Count
Adjusted Classify and Count (ACC -aka the "confusion matrix model" quantification method (Forman, 2005) consists of six steps: 1. training a binary classifier on the entire training set 2. estimating its characteristics via many-fold cross-validation (tpr = TP/P and fpr = FP/N) 3. applying the classifier to the test set 4. counting the number of test cases on which the classifier outputs positives 5. estimating the true percentage of positives via Equation (1) 6. clipping the output to the feasible range.
As mentioned by Forman, the performance of the ACC method degrades severely in the situation of a highly imbalanced training sample. If one of the classes is rare in the training set, the classifier will learn not to vote for this class because of tpr = 0%. Small denominator (tpr − fpr) in Equation (1) makes the quotient highly sensitive in the estimation of tpr or fpr, and this leads to low quantification accuracy especially at the small training sets with high class imbalance (Forman et al., 2006).

Probabilistic Classify and Count
The Probabilistic Classify and Count (PCC) method differs from the CC algorithm by counting the expected share of positive predicted documents, i.e. the probability of membership in class c of observation after classifying documents in the TEST set. (2)

Probabilistic Adjusted Classify and Count
The central idea of the Probabilistic Adjusted Classify and Count (PACC) algorithm is evidently to combine two algorithms above -ACC and PCC. ̂ ( ), ( ), ( ) should be replaced by their expected values, i.e.

Expectation Maximization
A simple procedure to adjust the outputs of a classifier to a new a priori probability is described in the study by (Saerens et al., 2002).
It is important that authors suggest using not only the well-known formula (4) to compute the corrected a posteriori probabilities, but also an iterative procedure to adjust the outputs of the trained classier with respect to these new a priori probabilities, without having to refit the model, even when these probabilities are not known in advance.
To make the Expectation Maximization (EM) method clear, we specify its algorithm in Figure1 using a pseudo-code. The algorithm begins with counting start values for class probability distribution, using labels on the training set TRAIN (line 1), builds an initial classifier C_i from the TRAIN set (line 2) and classifies each item in the unlabeled TEST set (line3), where the classify functions return the a posteriori probabilities (TEST_prob) for the specified datasets. The algorithm then iterates in lines 4-9 until the maximum number of iterations (maxIterations) is reached. In this loop, the algorithm first uses the previous a posteriori probabilities TEST_prob to estimate a new a priori probability (line 6). Then, in line 7, a posteriori probabilities are computed using Equation (4). Finally, once the loop terminates, the last posteriori probabilities returns (line 9).
To build a classifier in the function build_clf, we use support vector machines (SVM) with linear kernel.

Iterative Class Distribution Estimation
Another interesting method is iterative cost-sensitive class distribution estimation (CDEIterate) described in the study by (Xue and Weiss, 2009).
The main idea of this method is to retrain a classifier at each iteration, where the iterations progressively improve the quantification accuracy of performing the «classify and count» method via the generated costsensitive classifiers.
For the CDE-based method, the final prevalence is induced from the TRAIN labeled set with the cost of classes COST. The COST value is computed with Equation (5), utilizing the class distribution calculated during the previous step TEST_CD. For each iteration, we recalculate: The CDEIterate algorithm is specified in Figure 2, using the pseudo-code. The algorithm begins with counting the class distribution TRAIN_CD for training labels TRAIN (line 1). Then it builds an initial classifier C_i from the TRAIN set (line 2). In a loop, this algorithm uses the previous classifier C_i to classify the unlabeled TEST set by estimating a posterior probability TEST_prob for each item in a test set (line 5). Then. in line 6, the a priory probability distribution is computed and the cost ratio information is updated (line 7). In line 8, a new cost-sensitive classifier C_i is generated using the TRAIN set with the updated cost ratioCOST. The algorithm then iterates in lines 4-9 until the maximum number of iterations (maxIterations) is reached. Finally, once the loop terminates, the last a priory probability distribution of classes is returned TEST_CD (line 10). To build a cost-sensitive classifier in the function build_clf, we tried a few ones and chose a fast logistic regression classifier.
We did not find any open library where baseline quantification methods were implemented. We, therefore, shared all the algorithms, which we had programmed using the Python language, on the Github repository 1 . We believe that this library can help pool information on quantification.

Experiment Methodology
This section describes our experimental setup. It describes the datasets we use, the specific experiments we run and the classifier induction algorithm we employ.

Simulations on Artificial Data
We present a simple experiment that illustrates the efficiency of iterative adjustment of the a priori probabilities.
We use random sample generators from SkiKit Library to build artificial datasets of controlled size and complexity 2 . For each dataset we generate 10 ords with 10 features. Figure 3 exemplifies a dataset with two classes.
The initial prevalence for classes (p train (c 1 ) = p train (c 2 ) = 0.5). The total set randomly splits into two subsets: 25% training set, 75% test set. training set, the class distribution remains unchanged. For the test set, we vary prevalence value 0.05 to 0.95. For each prevalence value we generate a ferent test sets. Therefore, nineteen hundred of the following experimental design are applied.
We used a Kullback-Leibler Divergence (KLD) tween the true class prevalence and the 2 http://scikitlearn.org/stable/modules/generated/sklearn.da sification.html erators from SkiKit-Learn Library to build artificial datasets of controlled size and dataset we generate 10,000 recexemplifies 2 features of es c 1 and c 2 was equal otal set randomly splits into two subsets: 25% training set, 75% test set. For the training set, the class distribution remains unchanged.
For each prevalence value we generate a hundred difnineteen hundred replications of the following experimental design are applied.
Leibler Divergence (KLD) between the true class prevalence and the predicted class learn.org/stable/modules/generated/sklearn.datasets.make_clas prevalence as a quality evaluation metrics for quantif ers.

Test Dataset
To evaluate the algorithms on the real data pated in the SemEval-2016 Task 4 called "Sentiment Analysis in Twitter". Its dataset consists of sages (aka observations) divided Task 4 consists of five subtasks, but w ed in subtasks D and E: tweet quantification according to a two-point scale and five These subtasks are evaluated topics, and the final result is counted as an average of evaluation measure out of all the topics 2016).
The organizers provide a default split of the data into training, development and development tasets. The algorithms evaluation is performed these subsets. The training subset is used as a TRAIN set, development and development are used as a TEST set.
Since observation x in this dataset is a message wri ten in a natural language, we first need to transform it to the vector representation X. Based on a study by and Sebastiani, 2015), we choose the following comp nents of the feature vector:  TFIDF for word n-grams with n 4  TFIDF character n-grams where n 5.
Feature vector is extracted with a We also perform data preprocessing terns (e.g. links, emoticons, numbers) w with their substitutes. For word n matization using WordNetLemmatizer.
It is interesting to characterize messages using SentiWordNet library. For each token we obtain its polarity value from the SentiWordNet. First, we recognize the part of speech using tagger from the NLTK library cond, we get the SentiWordNet first polarity value for this token using the part of speech information.
We used polarity values to extend vector represent tion of documents in two ways the polarity score as a sum of positive minus negative polarity values and add this feature to tor representation of a document. Second the sum of positive polarities and a quality evaluation metrics for quantifi-To evaluate the algorithms on the real data, we partici-2016 Task 4 called "Sentiment Its dataset consists of Twitter mesdivided into several topics. Task 4 consists of five subtasks, but we only participat-D and E: tweet quantification according point scale and five-point scale, respectively.
independently for different final result is counted as an average of evaluation measure out of all the topics (Nakov et al., default split of the data into training, development and development-time testing datasets. The algorithms evaluation is performed using raining subset is used as a TRAIN set, development and development-time testing subsets in this dataset is a message written in a natural language, we first need to transform it to . Based on a study by (Gao , we choose the following compograms with n varying from 1 to grams where n varies from 3 to extracted with a Scikit_Learn tool 3 . We also perform data preprocessing .Several text patlinks, emoticons, numbers) were replaced For word n-grams we apply lemmatization using WordNetLemmatizer.
It is interesting to characterize messages using the SentiWordNet library. For each token x i in document X obtain its polarity value from the SentiWordNet. part of speech using a speech NLTK library (Bird et al., 2009). Seget the SentiWordNet first polarity value for part of speech information. We used polarity values to extend vector representation of documents in two ways: first we simply calculate sum of positive minus a sum of negative polarity values and add this feature to the vecpresentation of a document. Second, we calculate sum of positive polarities and the sum of negative learn.org/stable/modules/generated/sklearn.feature_extraction. polarities and add these two features to the vector representation of a document.
The metrics that we use to evaluate the classifier performance are described in (Nakov et al., 2016) and are not described here.

Experiment Results
We apply six quantification methods mentioned above in Section 2: CC, PCC, ACC, PACC, EM, CDEIterate and compare them.

Synthesized Data
First, we applied CC, PCC, ACC, PACC, EM and CDEIterate algorithms to generated data described in Section 3.1. Synthesized data allows us to perform a comparative analysis of these quantification methods with different amount of distribution drift.
In Figure3, which demonstrates the means and standard deviation values of the evaluation measure -Kullback-Leibler Divergence (KLD), each point is obtained by averaging over one hundred generated datasets with different prevalence. It is obvious from Figure 4 that the CDEIterate approach shows the lowest KLD mean values when a distribution drift is relatively large. A standard deviation value for the CDEIterate method remains the smallest one among all possible distribution drifts.
On the contrary, the EM approach shows very unstable results. Sometimes the EM algorithm converges far from the real value. Its standard deviation displays the same unstable behavior.
For more careful consideration, let us show its functions in the logarithmic scale in Figure 5. When distribution changes from the starting value p train (c) = 0.5 by less than 0.1, the simple methods like CC and PCC show better performance (lower KLD).

Test Data
We noticed that CDEIterate methods sometimes converge to different values, if an algorithm starts iteration from a different starting point. To support this, we add the COST_start variable to the algorithm shown in Figure 2. The first starting point is a priori probability distribution of a training set. Therefore, for the starting iteration we assume TEST_CD to equal TRAIN_CD. The second starting point is when TEST_CD is uniformly distributed. This case is labeled as CDEIterate_U. In the previous Section 4.1, these two starting points were actually the same. CDEIterate_U approach showed the best accuracy on the testing set among others with both five-point and two-point scales.
SentiWordNet is usually regarded as an important source of information about word sentiment (Baccianella et al., 2010;Esuli and Sebastiani, 2006). In our comparison, we add the sum of positive scores and the sum of negative scores of each word as two additional features to the feature vector. Only the first meaning, according to the recognized part of speech, was used. The quantification methods remain the same. The results provided in We explain this behavior as follows: simple algorithms cannot adjust to the whole singularity and such additional features increase dimension and, thereby, accuracy. In a more complex case, the classifier extracts information from features more efficiently. Additional information about polarity scores leads to algorithm overtraining. We can guess that, as tweets contain creative spelling and abbreviation common in Twitter (like "lol", not presented in SentiWordNet), the existence of character n-grams contains more specific information than polarity scores of selected, properly written words. Therefore, we exclude SentiWordNet features from the final feature vector.

Conclusion and future work
The aim of this research was to perform comparative analysis of different approaches of state-of-the-art quantification techniques.
For tweet quantification on a five-point scale (Subtask E) and a two-point scale (Subtask D), the best performance was demonstrated by the adopted iterative method proposed by (Xue and Weiss, 2009), based on the iterative procedure with the cost-sensitive supervise learner. All the algorithms mentioned in the article, are available on the Github repository 4 .
In our future work, we are planning to move in two directions. First, we plan to extend the vector of features used for representation of documents. Second, we want to add more quantification methods to our open source library.