QCRI at SemEval-2016 Task 4: Probabilistic Methods for Binary and Ordinal Quantification

We describe the systems we have used for participating in Subtasks D (binary quantification) and E (ordinal quantification) of SemEval-2016 Task 4 “Sentiment Analysis in Twitter”. The binary quantification system uses a “Probabilistic Classify and Count” (PCC) approach that leverages the calibrated probabilities obtained from the output of an SVM. The ordinal quantification approach uses an ordinal tree of PCC binary quantifiers, where the tree is generated via a splitting criterion that minimizes the ordinal quantification loss.


Introduction
This document describes the systems we have used for participating in Subtasks D (binary quantification) and E (ordinal quantification) of SemEval-2016 Task 4 "Sentiment Analysis in Twitter". In the runs we have submitted no training data was used other than the officially provided ones (indeed, the only "external" data used were the sentiment lexicons mentioned in Section 2). Like a classification system, a system for performing quantification consists of two main components: (i) an algorithm for converting the objects of interest (tweets, in our case) into vectorial representations that can be interpreted both by the learning algorithm and, once it has been trained, by the quantifier itself, and (ii) an algorithm for training quantifiers from vectorial representations of training ob-jects. Section 2 will be devoted to discussing component (i), while Sections 3 and 4 will be devoted to discussing the two learning algorithms we have deployed for the two tasks.

Features for detecting tweet sentiment
As in (Gao and Sebastiani, 2015;Gao and Sebastiani, 2016), for building vectorial representations of tweets we have followed the approach discussed in (Kiritchenko et al., 2014, Section 5.2.1), since the representations presented therein are those used in the systems that performed best at both the Se-mEval 2013 (Mohammad et al., 2013) and SemEval 2014  tweet sentiment classification shared tasks.
The text is preprocessed by normalizing URLs and mentions of users to the constants http://someurl and @someuser, resp., after which tokenisation and POS tagging is performed. The binary features used (i.e., features denoting presence or absence in the tweet) include word ngrams, for n ∈ {1, 2, 3, 4}, and character n-grams, for n ∈ {3, 4, 5}, whether the last token contains an exclamation and/or a question mark, whether the last token is a positive or a negative emoticon and, for each of the 1000 word clusters produced with the CMU Twitter NLP tool 1 , whether any token from the cluster is present. Integer-valued features include the number of all-caps tokens, the number of tokens for each POS tag, the number of hashtags, the number of negated contexts, the number of sequences of exclamation and/or question marks, and the number of elongated words (e.g., cooooool).
A key addition to the above is represented by features derived from both automatically generated and manually generated sentiment lexicons; for these features, we use the same sentiment lexicons as used in , which are all publicly available. We omit further details concerning our vectorial representations (and, in particular, how the sentiment lexicons contribute to them), both for brevity reasons and because these vectorial representations are not the central focus of this paper; the interested reader is invited to consult (Kiritchenko et al., 2014, Section 5.2 (Bella et al., 2010), Expectation Maximization for Quantification (EMQ) (Saerens et al., 2002), SVMs optimized for KLD (SVM(KLD)) , SVMs optimized for N KLD (SVM(NKLD)) (Esuli and Sebastiani, 2014), and SVMs optimized for Q (SVM(Q)) ( Barranquero et al., 2015). All 8 methods are described in detail in (Gao and Sebastiani, 2016), where we test them on a ternary 2 tweet sentiment quantification task using 11 datasets and 6 evaluation measures. The aim of (Gao and Sebastiani, 2016) was to test whether the conclusions drawn from a previous experiment , where quantification was according to topic and where texts were significantly longer than tweets, were confirmed also in a context in which quantification is according to sentiment and the items are significantly shorter.
The preliminary experiments we performed for the present work were carried out by training our models on TRAIN+DEV and testing on DEVTEST. For the first 5 methods mentioned at the beginning of this section we make use of a standard SVM with a linear kernel, in the implementation made available in the LIBSVM system 3 (Chang and Lin, 2011). For the other 3 methods we make use of an SVM for structured output prediction, in the implementation made available in the SVM-perf system 4 (Joachims, 2005). For all 8 methods we optimize the C parameter (which sets the tradeoff between the training error and the margin) directly on DEVTEST by performing a grid search on all values of type 10 x with x ∈ {−6, ..., 7}; we instead leave the other parameters at their default value. The PCC, PACC, EMQ methods require the classifier to also generate posterior probabilities; since SVMs do not natively generate posterior probabilities, for these three methods we use the -b option of LIBSVM, which converts the scores originally generated by SVMs into posterior probabilities according to the algorithm of (Wu et al., 2004).
The results of these preliminary experiments, which are reported in Table 1, indicated PCC as the best performer. These experiments by and large confirmed the results of (Gao and Sebastiani, 2016), where PCC was the best performer for 34 of the 66 combinations of 11 datasets × 6 evaluation measures. Instead, for none of the 66 combinations SVM(KLD), which had been the best performer in the experiments of  (where it also outperformed PCC), was the best performer. In (Gao and Sebastiani, 2016) we conjectured that this difference may be due to the fact that in quantification by sentiment, class prevalences tend to be fairly high (> .10), and that the experiments of (Esuli and Sebastiani, 2015) mostly concerned classes with low prevalence (< .10) or very low prevalence (< .01), which tend to be the norm in classification by topic.
As a result of all this, in this work we decided to use PCC; Section 3.1 describes the PCC method in detail.

Probabilistic Classify and Count (PCC)
The PCC method, originally introduced in (Bella et al., 2010), consists in generating a classifier from T r, classifying the objects in T e, and computing p T e (c) as the expected fraction of objects predicted to belong to c. If by p(c|x) we indicate the posterior probability, i.e., the probability of membership in c of test object x as estimated by the classifier, and by E[x] we indicate the expected value of x, this corresponds to computinĝ wherep M S (c) indicates the prevalence of class c in set S as estimated via method M (the "hat" symbol indicates estimation). The rationale of PCC is that posterior probabilities contain richer information than binary decisions, which are usually obtained from posterior probabilities by thresholding.
For our final run, we have retrained the system on TRAIN+DEV+DEVTEST, using the parameter values which had performed best on DEVTEST in the preliminary experiments. On the official test set (Nakov et al., 2016) we obtained a KLD score of 0.055, and thus ranked 5th in a set of 14 participating teams.

Subtask E: Tweet quantification according to a five-point scale
Our goal in tackling the ordinal quantification task has been to devise a new learning algorithm for or-dinal quantification. We decided to aim for an algorithm that (a) leverages the information inherent in the class ordering, and (b) performs quantification according to the Probabilistic Classify and Count (PCC) method ( (Bella et al., 2010) -see also (Gao and Sebastiani, 2016, §4.2)), since this has proven the best-performing method in the tweet quantification experiments of (Gao and Sebastiani, 2016). Ordinal quantification will be tackled by arranging the classes in the totally ordered set C = {c 1 , ..., c |C| } into a binary tree. Given any j ∈ {1, . . . , (|C|−1)}, C j = {c 1 , . . . , c j } will be called a prefix of C, and C j = {c j+1 , . . . , c |C| } will be called a suffix of C. Given any j ∈ {1, . . . , (|C| − 1)} and a set S of items labeled according to C, by S j we denote the set of items in S whose class is in C j , and by S j we denote the set of items in S whose class is in C j .

Generating a quantification tree
The algorithm for training a quantification tree is described in concise form as Algorithm 1, and goes as follows. Assume we have a training set T r and a held-out validation set V a of items labelled according to C.
The first step (Line 3) consists in training (|C|−1) binary classifiers h j , for j ∈ {1, . . . , (|C| − 1)}. Each of these classifiers must discriminate between C j and C j ; for training h j we will take the items in T r j as the negative training examples and the items in T r j as the positive training examples. We require that these classifiers, aside from taking binary decisions (i.e., predicting if a test item is in C j or in C j ), also output posterior probabilities, i.e., probabilities p(C j |x) and p(C j |x) = (1 − p(C j |x)), where p(c|x) indicates the probability of membership in c of test object x as estimated by the classifier 5 .
The second step (Line 5) is building the ordinal quantification (binary) tree. In order to do this, 5 If the classifier only returns confidence scores that are not probabilities (as is the case with many non-probabilistic classifiers), the former must be converted into true probabilities. If the score is a monotonically increasing function of the classifier's confidence in the fact that the object belongs to the class, the conversion may be obtained by applying a logistic function. Well-calibrated probabilities (defined as the probabilities such that the prevalence pS(c) of a class c in a set S is equal to x∈S p(c|x)) may be obtained by using a generalized logistic function; see e.g., (Berardi et al., 2015, Section 4.4) for details. among the classifiers h j we pick the one (let us assume it is h t ) that displays the highest quantification accuracy (Line 12) on the validation set V a, and we place it at the root of the binary tree. We then repeat the process recursively on the left and on the right branches of the binary tree (Lines 14 to 17), thus building a fully grown quantification tree. Quantification is performed according to the PCC method described in Section 3.1. We measure the quantification accuracy of classifier h j via Kullback-Leibler Divergence (KLD), defined as wherep is the distribution estimated via PCC using the posterior probabilities generated by h j .

Estimating class prevalences via an ordinal quantification tree
The algorithm for estimating class prevalences by using an ordinal quantification tree is described in concise form as Algorithm 2, and goes as follows.
Essentially, for each item x ∈ T e and for each class c ∈ C, we compute (Line 6) the posterior probability p(c|x); the estimatep T e (c) is computed as the average, across all x ∈ T e, of p(c|x). The posterior probability p(c|x) is computed in a recursive, hierarchical way (Lines 13 to 18), i.e., as the probability that the binary classifiers that lie on the path from the root to leaf c, would classify item x exactly in leaf c (i.e., that they would route x exactly to leaf c). This probability is computed as the product of the posterior probabilities returned by the classifiers that lie on the path from the root to leaf c. An example quantification tree for a set of |C| = 6 classes is displayed in Figure 1; for brevity, classes are represented by natural numbers, the total order defined on them is the order defined on the natural numbers, and sets of classes are represented by sequences of natural numbers. Note that, as exemplified in Figure 1, our algorithm generates trees for which (a) there is a 1-to-1 correspondence between classes and leaves of the tree, (b) leaves are ordered left to right in the same order of the classes in C, and (c) each internal node represents a decision between a suffix and a prefix of C.
Point (c) is interesting, and deserves some discussion. Indeed, internal node "1234 vs. 56" is trained by using items labelled as 1, or 2, or 3, or 4 as negative examples and items labelled as 5, or 6 as positive examples; however, by looking at Figure 1, it would seem intuitive that items labelled as 6 should not be used, since the node is root to a subtree where class 6 is not an option anymore. The reason why we do use items labelled as 6 (which is the reason the node is labelled "1234 vs. 56" and not "1234 vs. 5") is that, during the classification stage, the classifier associated with the node might be asked to classify an item whose true label is 6, and which has thus been misclassified up higher in the tree. In this case, it would be important that this item be classified as 5, since this minimizes the contribution of this item to misclassification error; and the likelihood that this happens is increased if the classifier is trained to choose 1 Function QuantifyViaHierarchicalPCC (T e, TC); / * Estimates class prevalences on T e using the quantification tree * / Input : Unlabelled set T e; Quantification tree TC; Output: Estimatesp(c) for all c ∈ C; 2 for c ∈ C do 3p(c) ← 0 4 end 5 for x ∈ T e do between 1234 and 56, rather than between 1234 and 5. Note also that this is one aspect for which our algorithm is a true ordinal classification algorithm; if there were no order defined on the classes this policy would make no sense.
A second reason why our algorithm is an inherently ordinal quantification algorithm is that the groups of classes (such as 1234 and 56) between which a binary classifier needs to discriminate are groups of classes that are contiguous in the order defined on C. It is because of this contiguity that the structure of the trees we generate makes sense: if, say, classes {1, ..., 6} represent degrees of positivity of product reviews, with 1 representing most negative and 6 representing most positive, the group 56 may be taken to represent the positive reviews (to different degrees), while 1234 may be taken to represent the reviews that are not positive; a group such as, say, 256, would instead be very hard to interpret, since it is formed of non-contiguous classes that have little in common with each other.
Finally, we remark that our ordinal quantification algorithm does not depend on the fact that PCC is the chosen quantification method, and could be adapted to work with other such methods, such as e.g., SVM(KLD). Indeed, if SVM(KLD) is the chosen quantification method, in Algorithm 1 we only need to change the learning method we use (Line 3), and change the recursive subroutine CPost (Lines 12 to 17) in such a way that, by recursively making binary choices down the tree, it picks exactly one out of the |C| leaf classes instead of computing the posterior probabilities for all of them 6 .

Our run
In our preliminary run over the DEVTEST set, our system obtained an EM D value of 0.210; for obtaining this, the optimization of the C parameter (see Section 3.1) was carried out using EM D as a criterion, i.e., the parameter that yielded the best EM D value on DEVTEST was chosen. For comparison, we also run on DEVTEST a baseline multiclass PCC system, i.e, one which performs quantification according to the PCC method and does not take the order on the classes into account; the baseline system, after parameter optimization, obtained an EM D value of 0.222, with a 5.64% deterioration over our system. As a result, we decided to tackle the unlabelled set with our system as described in Sections 4.1 and 4.2. Note that, unlike for Subtask D, in Subtask E we did not have a range of other datasets to perform preliminary experiments with; as a result, the only choice that could make sense here was using the system which had performed best on DEVTEST.
On the official test set (Nakov et al., 2016) we obtained an EM D score of 0.243, ranking 1st in a set of 10 participating systems, with a high margin over the other ones (systems from rank 2 to rank 8 obtained EM D scores between 0.316 and 0.366).