Detecting speculations, contrasts and conditionals in consumer reviews

A support vector classiﬁer was compared to a lexicon-based approach for the task of detecting the stance categories speculation , contrast and conditional in English consumer reviews. Around 3,000 training instances were required to achieve a stable performance of an F-score of 90 for speculation . This outperformed the lexicon-based approach, for which an F-score of just above 80 was achieved. The machine learning results for the other two categories showed a lower average (an approximate F-score of 60 for contrast and 70 for conditional ), as well as a larger variance, and were only slightly better than lexicon matching. Therefore, while machine learning was successful for detecting speculation , a well-curated lexicon might be a more suitable approach for detecting contrast and conditional .


Introduction
Stance taking -including attitudes, evaluations and opinions -has received a great deal of attention in the literature (Hunston and Thompson, 2000;Biber, 2006;Hunston, 2011;Fuoli, 2015), and many studies of speakers' expression of feelings have been carried out in the fields of sentiment analysis and opinion mining with pre-defined or automatically detected categories related to sentiments and opinions. At its most basic level, such analyses use categories of positive, negative or (sometimes) neutral sentiment (Täckström and McDonald, 2011;Feldman, 2013), while other types of analyses use more finegrained categories of sentiments or attitudes, such as happiness, anger and surprise (Schulz et al., 2013). There are, however, additional aspects or types of stance taking, e.g., contrasting of different opinions (Socher et al., 2013), indications of the degree of likelihood of a conveyed message (Biber, 2006) or expression of conditional statements (Narayanan et al., 2009). Detecting such aspects is an integral part of a high quality sentiment analysis system, as they modify the opinions expressed. In this study, the automatic detection of three such stance categories is investigated: (1) Speculation: "the possible existence of a thing [that] is claimed -neither its existence nor its non-existence is known for sure" (Vincze, 2010, p. 28).
There are previous studies on automatic detection of speculation and related stance categories. Results are, however, reported for models trained on large annotated corpora, which are expensive to obtain (Uzuner et al., 2011;Cruz et al., 2015). Here, lexicon-based methods -as well as machine learning models trained on a smaller amount of training data -are instead evaluated for the task of detecting speculation, contrast and conditional. The categories are specifically compared with regards to the following research questions: (a) Are machine learning or lexicon-matching the more suitable method for detecting these three stance categories? (b) How does the amount of used training samples affect the performance of trained machine learning models?
Some systems for automatic detection of speculation are modelled as text classification problems, often using support vector classifiers (SVCs) trained on word n-grams (Uzuner et al., 2011;Wei et al., 2013). Others are modelled as named entity recognition systems and use structured prediction for detecting text chunks that function as cues for speculation (Tang et al., 2010;Clark et al., 2011).
The SFU Review corpus, which consists of English consumer generated reviews of books, movies, music, cars, computers, cookware and hotels (Taboada and Grieve, 2004;, is often used for sentiment analysis. This corpus has been annotated for speculation by Konstantinova et al. (2012), according to a modification of guidelines created by Vincze et al. (2008), in which cues for speculation and negation, and their scope, were annotated. Inter-annotator agreement was measured on 10% of the corpus, resulting in an F-score and a Kappa score of 89 for the agreement on speculation cues. The same corpus has also been annotated by Taboada and Hay (2008) for Rhetorical Structure Theory categories (Taboada and Mann, 2006, pp. 426-427). A total of 36 different categories were annotated, including condition, contrast and concession 1 . In contrast to the annotations by Konstantinova et al., these annotations were not checked for reliability. Cruz et al. (2015) trained an SVC to detect the speculation cues annotated by Konstantinova et al., and achieved an F-score of 92. Their lexicon matching approach, which was built on a list of the four most frequent speculation cues, achieved a lower F-score of 70. The SVC was clearly successful, as results slightly better than the inter-annotator agreement were achieved. Since the results were achieved by 10-fold cross-validation on the entire set of annotated data, they were, however, also expensive in terms of annotation effort. The present study, therefore, explores if similar results can be achieved with fewer training samples. In addition, the lexicon matching is here further explored, as it was performed with a very limited lexicon by Cruz et al. (2015).

Methods
A lexicon-based and a machine learning-based approach for detecting the three stance categories were compared. The SFU Review corpus annotations by Konstantinova et al. (2012) and by Taboada and Hay (2008) were used for all experiments. These annotations were performed independently and at different times, with Konstantinova et al. segmenting the corpus into sentences, while Taboada and Hay used segments, which are often shorter. The two segmentation styles were reconciled, by using the sentence boundaries of the Konstantinova et al. corpus, except when the corresponding segment in the Taboada and Hay corpus was longer than this sentence boundary. In such cases, the segment annotated by Taboada and Hay was used as the sentence boundary. 2 The speculation category in the Konstantinova et al. corpus was used for investigating speculation, and the condition category in the Taboada and Hay corpus for investigating the category conditional. Although these categories were somewhat overlapping, since condition was included in speculation, the categories were employed as defined and annotated in the previous studies. Since the two related categories contrast and concession are often conflated by annotators (Taboada and Mann, 2006), annotations of these categories in the Taboada and Hay corpus were combined, forming the merged category contrast. The speculation classification format previously used in the first of the CoNLL-2010 shared tasks (Farkas et al., 2010) and by Wei et al. (2013) was applied, that is an entire sentence was classified as either belonging to a stance category or not. The procedure used in CoNLL-2010 for transforming the data into this format was adopted, i.e., if either the scope of a speculation cue or a segment annotated for concession/contrast or condition was present and-can and-if anything-else apparently be be-an be-done be-used believe believe-that better but-if buy can can-also can-be can-do can-get can-go can-have can-only can-say can-you computer could could-be could-have could-not couldn dishwasher don don-think either even-if extra fear get have-one hope hope-this if if-it if-not if-there if-they if-this if-you it-can it-seemed it-seems it-still it-would kingdom like-to likely may may-be maybe might might-be must must-say not-be or or-if perhaps probably re recommend seem seem-to seemed seemed-to seems seems-to should should-be so-if someone supposed supposed-to that that-can that-could that-would that-you the-extra the-money they-can think think-it think-that think-the think-this thought to-mind want want-to we-can whether will-probably would would-be would-definitely would-have would-highly would-like would-recommend wouldn wouldn-be wouldn-recommend you you-are you-can you-could you-don you-like you-may you-might you-must you-re you-should you-think you-want you-would your your-money  in a sentence, the sentence was categorised as belonging to this category (or categories, when several applied). The sentence list was randomly split into two halves -as training and evaluation data (Table 1).

Machine learning-based approach (SVC)
A support vector classifier model, the Lin-earSVC included in Scikit learn (Pedregosa et al., 2011), was trained with bag-of-words and bag-ofbigrams as features. A χ 2 -based feature selection was carried out to select the n best features. Suitable values of n and the support vector machine penalty parameter C were determined by 10-fold cross-validation on the training data. The training and feature selection was carried out for different sizes of the training data; starting with 500 training samples and increasing sample size stepwise with additional 500, up to 5,000 samples. A separate classifier was always trained for each of the three categories, and the categories were evaluated separately.

Lexicon-based approach (Lexicon)
The lexicon-based approach used three lists of marker words/constructions, one list for each category of interest. Sentences containing constructions signalling any of the three categories were classified as belonging to that category. The lists were created by first gathering seed markers; for speculation from constructions listed by Konstantinova et al. (2012) and from a previous resource collected with the aim of detect-  ing speculations in clinical texts (Velupillai et al., 2014), and for contrast from constructions listed by Reese et al. (2007). These seeds were then expanded with neighbours in a distributional semantics space (Gavagai, 2015) and from a traditional synonym lexicon (Oxford University Press, 2013). Finally, the expanded lists of candidates for speculation and contrast markers were manually filtered according to the suitability of included constructions as stance markers. From the list created for speculation, a subset of markers signalling conditional was selected to create the list for this category. The final lists contained 191 markers for speculation, 39 for contrast and 26 for conditional.

Results
Results on the evaluation set for the two approaches (lexicon-matching and the SVC when using all training data) are shown in Table 2. Features selected when obtaining these SVC results are shown in a font size corresponding to their model weight in Figures 1 and 2, and markers found in the evaluation data when using the lexicon-based approach are shown in Figure 3.
Different training data sizes were evaluated with although although-the but but-it but-the even-though questionable sure  bootstrap resampling (Kaplan, 1999). For each data size, 50 different models were trained, each time with a new random sample from the pool of training data. Figure 4 displays all results.

Discussion
Both approaches were clearly more successful for detecting speculation than for detecting contrast and conditional. When using the entire training data set, the SVC results for speculation were slightly higher than the human ceiling (an SVC Fscore of 92, compared to an inter-annotator agreement of 89). The F-scores for contrast and conditional were, however, considerably lower (approximately 30 points lower and 20 points lower than speculation, respectively). The SVC results for the two latter categories also remain unstable for larger training data samples, but stabilise for speculation (Figure 4). The higher F-score for speculation than for contrast and conditional, as well as its higher stability, might be explained by this category being more frequent than the other two. However, there seems to be a much greater variety in the way in which speculation is expressed, as shown by the number of SVC-features selected for this category and the number of markers that lead to true positives in the lexical approach, compared to what was the case for the other two categories. Lower recall was also achieved for the lexical approach for detecting speculation, despite the many stance markers used for this category. Therefore, it would seem reasonable to hypothesise that, while many training samples would be required for speculation, a smaller number of samples should be enough for the other categories. Language is, however, highly contextually adaptable, allowing the same construction to express different phenomena (Paradis, 2005;Paradis, 2015), and frequent English markers for contrast and conditional seem to be polysemous to a larger extent than speculation markers. E.g., 'while' sometimes expresses contrast, although it more often has a temporal meaning (Reese et al., 2007), which results in 30 true positives and 70 false positives when it is used as a marker for con-trast in the lexicon-matching approach. Similarly, 'if' is, by far, the most frequently used marker for expressing conditional, as previously observed by Narayanan et al. (2009), and as shown here in the lexical approach, in which 98% of the true positives contained this marker. Despite that, 'if' is also used to indicate indirect questions and as a more informal version of 'whether' (Oxford University Press, 2013), which has a potential to give rise to false positives. In the scheme used by Konstantinova et al., on the other hand, most readings of 'if' were covered by their broad definition of speculation.
In addition, it cannot be disregarded that annotations from two different sources were used for the experiment, and that part of the differences in performance, therefore, might be attributed to differences in annotation quality. For the Konstantinova et al. corpus, there is a reliability estimate, which does not exist for the Taboada and Hay corpus. The Taboada and Hay annotation scheme might also be more difficult -as it included 36 annotation categories -and thus more error prone.
Comparing the SVC approach and the lexicon matching, it can be concluded that the only case in which machine learning clearly outperforms lexicon matching is when the SVC for detecting speculation is trained on at least 1,500-2,000 training samples. For the categories contrast and conditional, on the other hand, it can be observed that (1) the machine learning results are unstable, and (2) only very few features -and only positive ones -are used by the models. One point of applying machine learning for text classification is to be able to create models that are complex enough to overcome weaknesses of a lexicon-matching approach, e.g., weaknesses arising from the use of polysemous expressions. Despite being trained on more than 5,000 training samples, only a few features were, however, selected as relevant for contrast and conditional. Therefore, for automatic detection, it might be more resource efficient to focus the effort on further curation of the lexicons used, rather than on annotation of training data. The complexity of the model for speculation seems, however, to exceed what could easily be captured with lexicon-matching, since more features, including negative ones, were used. This further motivates the suitability of machine learning for the task of detecting speculation.
In future work, inclusion of additional features for training models for stance detection will be attempted (e.g., syntactic features or distributional features), and the usefulness of applying the detection on extrinsic tasks, such as sentiment analysis (Narayanan et al., 2009), will be further evaluated.

Conclusion
For detecting sentences with speculation, an SVC trained on bag-of-words/bigrams performed around 10 points better than a lexicon matching approach. When using between 3,000-5,000 training instances, the model performance was stable at an approximate F-score of 90, which is just above the inter-annotator agreement F-score. For detecting conditional sentences and sentences including contrast, however, the results were lower (an Fscore of around 60 for contrast and around 70 for conditional). On average, the F-score for the machine learning models for these two categories was a few points better than for the lexicon-based methods, but these better results were achieved by models that only used eight features (which were all positive). This, together with the fact that the machine learning models showed a large variance, indicates that a lexicon-based approach, with a well-curated lexicon, is more suitable for detecting contrast and conditional.