ISTI-CNR at SemEval-2016 Task 4: Quantification on an Ordinal Scale

This paper details on the participation of ISTI-CNR to task 4 of Semeval 2016. Among the ﬁve subtasks, special attention has been paid to the ﬁve-point scale quantiﬁcation sub-task. The quantiﬁcation method we pro-pose is based on the observation that a standard document-by-document regression method usually has a bias towards assigning high prevalence labels. Our method models such bias with a linear model, in order to compensate it and to produce the quantiﬁcation estimates.


Introduction
The participation of ISTI-CNR to task 4 of Semeval 2016 (Nakov et al., 2016) produced submissions for all the five proposed subtasks.
Submissions for subtasks A and B are based on a relatively typical machine learning pipeline, with B used as the base classification tool for subtask D, which uses a quantification via classification method. Subtask C uses an ordinal regression method, based on building a data-balanced tree of binary classifiers. The regression method of subtask C has been used as the base regression tool for the implementation of the quantification method for subtask E.
We propose a novel quantification method for subtask E, tweet quantification according to a fivepoint scale. The method stems from the intuition of measuring and compensating the bias a regression model may have for the labels with high prevalences.
All the code we produced for these tasks is mainly based on the scikit-learn python library (Pedregosa et al., 2011) and it is published under an open source license 1 .
The next sections detail on the data and methods adopted to produce the five submissions.

Training data
The labeled training dataset has been downloaded using the tool suggested by the organizers 2 . Table 1 summarizes the total number of tweets available for download at the time of crawling (November 2015). The final number of tweets used to train the classifiers, or quantifiers, is 6223 for subtasks A, C and E, and 4475 for subtasks B and D.
A small number of tweets in the dataset appeared multiple times, some with conflicting labels. In the "train" data parts, for example, 25 tweets appeared twice for subtasks A, C, and E, six of them with conflicting labels for subtask A, and 14 with conflicting labels for subtasks C and E 3 . For subtasks B, D, the number of tweets appearing twice in training is 13, one of them with conflicting labels. Duplicate tweets have been reduced to a single instance and those with conflicting labels have been excluded from the dataset and from any analysis performed in this work.  Training data for subtasks A and B has been enriched by adding 5331 positive and 5331 negative sentences extracted from movie reviews, which are part of the movie review dataset (Pang and Lee, 2005) 4 . Even though these sentences are domainspecific, they are deemed to contribute to the learning process by enriching the vocabulary of expressions used to denote positive and negative sentiments. The final training set for subtask A is thus composed of 16885 examples, and 15137 for subtask B.

Features
The transformation of each tweet into its vectorial representation uses a relatively simple processing. The text of each tweet is tokenized, stopwords are removed. Word bigrams and trigrams, and character fourgrams are added to representation. Regular expressions are used to detect mentions, hashtag, URLs, and emoticons, and metafeatures for each of these special type of information are added to the representation, e.g., if a tweet has two hashtags, the ' hashtag' feature with frequency two is added to the representation of the tweet. The vectors are weighted by tf · idf . Feature selection based on χ2 is used to retain only the x most informative features, with x determined for each subtask with a  cross-validation on training data.

Subtasks A and B: classification
A linear SVM has been used for for both classification tasks: a simple binary classifier for subtask B, and three one-vs-all binary classifiers for subtask A. The value of the parameter C of the SVM has been determined with a cross-validation on training data.

Subtask C: regression
The Balanced Binary Tree for Ordinal Regression (BBTOR) method we designed for subtask C is based on building a tree of binary classifiers that recursively split the ordinal scale on the points of maximum balance in the number of training example assigned to the two sides of the binary classification problem.   training data against other regression methods, i.e., SVORIM (Chu and Keerthi, 2007), based on linear regression, and DDAG (Aiolli et al., 2009), based on binary classifiers, and produced the best performance.

Subtask D: binary quantification
Four quantification methods based on classification have been compared, following the works of Forman (2008) and Bella et al. (2010). The four methods are: classify and count (CC), in which a classifier is applied to the test documents and the prevalences are determined by counting the documents assigned to each label; adjusted classify and count (ACC), in which the output of the CC method is corrected to take into account the bias in error towards one of the two labels the classifier may have; probabilistic classify and count (PCC) in which the contribution of each document to the counting is weighted on the confidence the classifier has on the assignment; probabilistic adjusted classify and count (PACC) which is the ACC method applied to the probabilistic model of PCC. From a cross validation on training data, in which each topic has been in turn used as test data and the remaining as training data, the PCC performed best and it was thus used for the final submission.
6 Subtask E: quantification on an ordinal scale Two methods have been compared for subtask E. One is a simple regress and count (RC) method in which the BBTOR method used in subtask C is applied to documents of a topic and then the quantification values for the topic is determined by counting the number of documents assigned to each slot in the ordinal scale. We propose the adjusted regress and count (ARC) method, that is based on the intuition to measure, and compensate, the typical bias of regression methods to assign documents to the slots in the ordinal scale that have higher prevalences. Let's denote the prevalences for a topic-label pair with p j (c i ), where j indicates a topic in the set of topics {t 1 , . . . , t n } and i a label in the set of ordered labels {c 1 , . . . , c m } that form the ordinal scale. On a given set of topics, the cumulative prevalence for each label is denoted as P (c i ) = n j=1 p j (c i ). Given a quantification method that produces estimationsp j (c i ), its cumulative prevalences are denoted asP (c i ) = n j=1p j (c i ). Under the hypothesis of a linear error model, knowing the estimate prevalences and the cumulative correct and estimate prevalences on a set of topics, the true prevalence for a topic can be determined as: Note that the model uses a different linear correction weight w i = P (c i ) P (c i ) for each label c i . The correction value w i cannot be determined on the test data, since P (c i ) is unknown. Following the ACC method for binary quantification (Forman, 2008) that estimates its correction parameter on the training set, also the w i values can be approximated on the training data using cross-validation, substituting w i with the w Tr i = P Tr (c i ) P Tr (c i ) value. In this way the ARC quantification estimate can be derived from the RC estimate using the formula : where Z j = m i=1p ARC j (c i ) is a normalization factor to guarantee that the prevalences for a topic sum up to one.
The ARC method produced a sensible improvement over RC on a leave-one-topic-out validation on training data as reported in Table 4.

Future work
The features extracted from text in these experiments are based on a traditional vector space model in which each distinct feature is a represented by a dedicated dimension in the vector space. The limited amount of training data, and the variety of topics, produces an effect of data sparsity, in which there is little overlap between features from training and test data. We plan to repeat the experiments using semantically-richer features based on the use of language models, which should improve the vectorial representations by projecting onto similar vectorial representations the features with similar semantic properties, thus reducing the effect of data sparsity.
The participation to subtask E resulted in a bias correction method, ARC, that performed well. ARC sensibly improved on the baseline produced by the direct use of the original regression method, the one used to produce the submission for subtask C, without correction. Future work will explore the use of the bias correction method in combination with other ordinal regression methods, either based on classification or linear regression.
A strong assumption of the ARC method is that the error on each label has a linear relation with respect to the prevalence. This assumption can be considered to hold locally, i.e., when the variation of prevalence for a label across topics is limited, while it is harder to consider it valid when prevalences varies a lot across topics. Future work will explore the use of more complex models, e.g., fitting the differences observed between p j (c i ) andp j (c i ) on the training set using a polynomial model, instead of a single w i weight.