A Study of Suggestions in Opinionated Texts and their Automatic Detection

,


Introduction
Online text is becoming an increasingly popular source for acquiring public opinions towards entities like persons, products, services, brands, events, etc. The area of opinion mining focuses on exploiting this abundance of opinions, by mainly performing sentiment based summarisation of text into positive, negative, and neutral categories, using sentiment analysis methods. In addition to the online reviews and blogs, people are increasingly resorting to social networks like Twitter, Facebook etc. to instantly express their sentiments and opinions about the products and services they might be experiencing at a given time.
On a closer look, it is noticeable that opinionated text also contains information other than sentiments. This can be validated from the presence of large portions of neutral or objective or nonrelevant labelled text in state of the art sentiment analysis datasets. One such information type is suggestions. Table 1 shows the instances of suggestions in sentiment analysis datasets which were built on online reviews. These suggestions may or may not carry positive or negative sentiments towards the reviewed entity. In the recent past, suggestions have gained the attention of the research community, mainly for industrial research, which led to the studies focussing on suggestion detection in reviews (Ramanand et al., 2010;Brun and Hagege, 2013).
The setting up of dedicated suggestion collection forums by brand owners, shows the importance of suggestions for the stakeholders. Therefore, it would be useful if suggestions can be automatically extracted from the large amount of already available opinions. In the cases of certain entities where suggestion collection platforms 1 are already available and active, suggestion mining can be used for summarisation of posts. Often, people tend to provide the context in such posts, which gets repetitive in the case of large number of posts, suggestion mining methods can extract the exact sentence in the post where a suggestion is expressed.
This task has so far been presented as a binary classification of sentences, where the available opinionated text about a certain entity is split into sentences and these sentences are then classified as suggestions or non-suggestions. The previous studies were carried out in a limited scope, mainly for specific domains like reviews, focusing on one use case at a time. The path to the leaf  Figure 1 summarises the scope of suggestion mining studies so far. These studies developed datasets for individual tasks and domains, and trained and evaluated classifier models on the same datasets.
We analyse manually labelled datasets from different domains, including the existing datasets, and the datasets prepared by us. The ratio of suggestion and non-suggestion sentences vary across domains, where the datasets from some domains are too sparse for training statistical classifiers. We also introduce two datasets which are relatively richer in suggestions. In Table 1 we report similar linguistic nature of suggestions across these datasets, which presses for domain independent approaches. Therefore, as a deviation from previous studies, this work investigates the generalisation of the problem of suggestion detection i.e. the detection of all suggestions under the root node in Figure 1.
In this work, we compare different methods of suggestion mining using all available datasets. These include manually crafted rules, Support Vector Machines (SVM) with proposed linguistic features, Long Short Term Memory (LSTM) Neural Networks, and Convolutional Neural Networks (CNN). We also compare the results from these approaches with the previous works whose datasets are available. We also perform crossdomain train test experiments. With most of the datasets, Neural Networks (NNs) outperform SVM with the proposed features. However, the overall results for out of domain training remain low. We also compare two different types of word embeddings to be used with the NNs for this task.

Problem Definition and Scope
As stated previously, the task of suggestion detection has been framed as binary classification of sentences into suggestion (positive class) and nonsuggestion (negative class). We previously provided a fine grained problem definition (Negi and Buitelaar, 2015) in order to prepare benchmark datasets and ensure consistency in future task definitions. We identified three parameters which define a suggestion in the context of opinion mining: receiver of suggestion, textual unit of suggestion, and the type of suggestion in terms of its explicit or implicit nature.
While the unit of suggestion still remains as sentence in this work, and the type as explicit expression, we aim for the evaluation of different classifier models for the detection of any suggestion from any opinionated text. The motivation lies in our observation that explicitly expressed suggestions appear in similar linguistic forms irrespective of domain, target entity, and the intended receiver (Table 1). Furthermore, datasets used by the previous studies indicate that aiming the detection of specific suggestions restricts the annotations to suggestions of a specific type, which in turn aggravates class imbalance problem in the datasets (Table 2). It also renders these datasets unsuitable for a generic suggestion detection task, since the negative instances may also comprise of suggestions, but not of the desired type.

Related Work
In the recent years, experiments have been performed to automatically detect sentences which contain suggestions. Targeted suggestions were mainly the ones which suggest improvements in a commercial entity. Therefore, online reviews remains the main focus, however, there are a limited number of works focussing on other domains too.
Suggestions for product improvement: Studies like Ramanand et al. (2010) and Brun et al. (2013) employed manually crafted linguistic rules to identify suggestions for product improvement. The evaluation was performed on a small dataset (∼60 reviews). Dong et al. (2013)   they used certain hash tags and mined frequently appearing word based patterns from a separate dataset of suggestions about Microsoft phones. Suggestions for fellow customers: In one of our previous works (Negi and Buitelaar, 2015), we focussed on the detection of those suggestions in reviews which are meant for the fellow customers. An example of such suggestion in a hotel review is, If you do end up here, be sure to specify a room at the back of the hotel. We used SVM classifier with a set of linguistically motivated features. We also stressed upon the highly subjective nature of suggestion labelling task, and thus performed a study of a formal definition of suggestions in the context of suggestion mining. We also formulated annotation guidelines, and prepared a dataset for the same. Advice Mining from discussion threads: Wicaksono et al. (2013) detected advice containing sentences from travel related discussion threads. They compared sequential classifiers based on Hidden Markov Model (HMM) and Conditional Random Fields (CRF), considering each thread as a sequence of sentences labelled as advice and non-advice. They also some features which were dependent on the position of a sentence in its thread. This approach was therefore specific to the domain of discussion threads. Their annotations seem to consider implicit expressions of advice as advice.
Text Classification using deep learning: Recently NNs are being effectively used for text classification tasks, like sentiment classification and semantic categorisation. LSTM (Graves, 2012), and CNN (Kim, 2014a) are the two most popular neural network architectures in this regard. Tweet classification using deep learning: To the best of our knowledge, deep learning has only been employed for sentiment based classification of tweets. CNN (Severyn and Moschitti, 2015) and LSTM (Wang et al., 2015) have demonstrated good performance in this regard.

Datasets
The required datasets for this task are a set of sentences obtained from opinionated texts, which are labelled as suggestion and non-suggestion, where suggestions are explicitly expressed.
Existing Datasets: Datasets from most of the previous studies on suggestions for product improvement are unavailable due to their industrial ownership. The currently available datasets are: 1) Twitter dataset about Windows phone: This dataset comprises of tweets which are addressed to Microsoft. The tweets which expressed suggestions for product improvement are labelled as suggestions (Dong et al., 2013). Due to the short nature of tweets, suggestion detection is performed on the tweet level, rather than the sentence level. The authors indicated that they have labeled the explicit expressions of suggestions in the dataset. 2) Electronics and hotel reviews: A review dataset, where only those sentences which convey suggestions to the fellow customers are considered as suggestions (Negi and Buitelaar, 2015).
3) Travel advice dataset: Obtained from travel related discussion forums. All the advice containing sentences are tagged as advice (Wicaksono and Myaeng, 2013). One problem with this dataset is that the statements of facts (implicit suggestions) are also tagged as advice, for example, The temperature may reach upto 40 degrees in summer.
Introduced Datasets: In this work, we identify additional sources for suggestion datasets, and prepare labelled datasets with larger number of explicitly expressed suggestions. 1) Suggestion forum: Posts from a customer support platform 2 which also hosts dedicated suggestion forums for products. Though most of the forums for commercial products are closed access, we discovered two forums which are openly accessible: Feedly mobile app 3 , and Windows app studio 4 . We collected samples of posts for these two products. Posts were then split into sentences using the sentence splitter from Stanford CoreNLP toolkit . Two annotators were asked to label 1000 sentences, on which the inter-annotator agreement (kappa) of 0.81 was obtained. Rest of the dataset was annotated by only one annotator. Due to the annotation costs, we limited the size of data sample, however this dataset is easily extendible due to the availability of much larger number of posts on these forums. 2) We also prepared a tweet dataset where tweets are a mixture of random topics, and not specific to any given entity or topic. These tweets were collected using the hashtags suggestion, advice, recommendation, warning, which increased the chance of appearance of suggestions in this dataset. Due to the noisy nature of tweets, two annotators performed annotation on all the tweets.
The inter-annotator agreement was calculated as 0.72. Only those tweets were retained for which the annotators agreed on the label. 3) We also re-tagged the travel advice dataset from Wicaksono et al. (2013) where only those suggestions which were explicitly expressed were retained as suggestions.

Automatic Detection of Suggestions
Some of the conventional text classification approaches have been previously studied for this task, primarily, rules and SVM classifiers. Each approach was only evaluated on the datasets prepared within the individual works. We employ these two approaches on all the available datasets for all kinds of suggestion detection task. We then perform a study of the employability of LSTM and CNN for this kind of text classification task. We evaluate all the statistical classifiers in both domain dependent and independent training. The results demonstrate that deep learning methods have an advantage over the conventional approaches for this task.

Rule based classification
This approach uses a set of manually formulated rules aggregated from the previous rule based experiments (Ramanand et al., 2010;Goldberg et al., 2009). These rules exclude the rules provided by Brun et al. (2013), because of their dependency on in-house (publicly unavailable) components from Brun et al. (2013). Only those rules have been used which do not depend on any domain specific vocabulary. A given text is labeled as a suggestion, if at least one of the rules is true. The part of speech tagging and parsing is performed using Stanford parser . Table 3 shows the results of rule based classification for the positive class i.e. suggestion class. With the available datasets, detection of negative instances is always significantly better than the positive ones, due to class imbalance.

Statistical classifiers
SVM was used in almost all the related work either as a proposed classifier with some feature engineering, or for comparison with other classifiers.
Support Vector Machines: SVM classifiers are popularly used for text classification in the research community. We perform the evaluation of a classifier using SVM with the standard n-gram  features (uni, bi-grams) and the features proposed in our previous work (Negi and Buitelaar, 2015). These features are sequential POS patterns for imperative mood, sentence sentiment score obtained using SentiWordNet, and information about nsubj dependency present in the sentence. We use LibSVM 6 implementation with the parameters specified previously in Negi and Buitelaar (2015).
No oversampling is used, instead class weighting is applied by using class weight ratio depending upon the class distribution of the negative and positive class respectively in the training dataset.

Deep Learning based classifiers:
Recent findings about the impressive performance of deep learning based models for some of the natural language processing tasks calls for similar experiments in suggestion mining. We therefore present the first set of deep learning based experiments for the same. We experiment with two kinds of neural network architectures: LSTM and CNN. LSTM effectively captures sequential information in text, while retaining the long term dependencies. In a standard LSTM model for text classification, text can be fed to the input layer as a sequence of words, one word at a time. Figure 2 shows the architecture of LSTM neural networks for binary text classification. On the other hand, CNN is known to effectively capture local co-relations of spatial or temporal structures, therefore a general intuition is that CNN might capture well the good n-gram features at different positions in a sentence.

Features
Features for SVM: The feature evaluation of (Negi and Buitelaar, 2015) indicated that POS tags, certain keywords (lexical clues), POS Figure 2: Architecture for using LSTM as a binary text classifier patterns for imperative mood, and certain dependency information about the subject, can be useful features for the detection of suggestions.
In the previous works, the feature types were manually determined. We now aim to eliminate the need of manual determination of feature types. A recently popular approach of doing this is to use neural networks with word embeddings (Bengio et al., 2003) based feature vectors, instead of using classic count-based feature vectors.

Word embeddings for Neural Networks:
In simpler terms, word embeddings are automatically learnt vector representations for lexical units. Baroni et al. (2014) compared the word embeddings obtained through different methods, by using them for different semantic tasks. Based on those comparisons, we use a pre-trained COM-POSES 7 embeddings, which were developed by Baroni et al. (2014). These embeddings/word vectors are of size 400. For experiments on twitter datasets, we used Glove (Pennington et al., 2014) based word embeddings learnt on Twitter data 8 , which comprises of 200 dimensions. We additionally experiment with dependency based word embeddings (Deps) 9 (Levy and Goldberg, 2014). These embeddings determine 7 Best predict vectors on http://clic.cimec.unitn.it/composes/semantic-vectors.html 8 http://nlp.stanford.edu/projects/glove/ 9 Dependency-Based on https://levyomer.wordpress.com/2014/04/25/dependencybased-word-embeddings/ the context of a word on the basis of linguistic dependencies, instead of window based context used by COMPOSES. Therefore, Deps tends to perform better in determining the functional similarity between words, as compared to COM-POSES. Additional feature for NNs: For neural network based classifiers, we also experimented with POS tags as an additional feature with the pre-trained word embeddings. This tends to decrease the precision and increase the recall, but results in an overall decrease of F-1 score in most of the runs. Therefore, we do not report the results of these experiments.

Configurations
NN Configuration: Considering the class imbalance in the datasets, we employ oversampling of the minority class (positive) to adjust the class distribution of training data. While performing cross validation, we perform oversampling on training data for each fold separately after crossvalidating. LSTM: For LSTM based classification, we use 2 hidden layers of 100 and 50 neurons respectively, and 1 softmax output layer. We also utilize L2 regularization to counter overfitting. For LSTMs, we use the softsign activation function. CNN: We used a filter window of 2 with 40 feature maps in CNN, thus giving 40 bigram based filters (Kim, 2014b). A subsampling layer with max pooling is used.

In-Domain and Cross-Domain Evaluation:
In the case of statistical classifiers, we perform the experiments in two sets. The first set of experiments (Table 4, 6) evaluate a classifier (and feature types) for the cases where labeled data is available for a specific domain, entity, or receiver specific suggestions. In this case, evaluation is performed using a 10 fold cross validation with SVM and 5 fold with NN classifiers. The second set of experiments evaluate the classifiers (and feature types) for a generic suggestion detection task, where the model can be trained on any of the available datasets. These experiments evaluate the classifier algorithms, as well as the training datasets. In the case of twitter, training is performed on twitter dataset, while evaluation for this cross-domain setting is performed on the Microsoft tweet dataset.   Table 5: Comparison of the performance of SVM (Negi and Buitelaar, 2015), LSTM and CNN with the best results reported in two of the related works whose datasets are available. 5 fold cross validation was used. The related works used different kinds of F1 scores.  Table 6: F-1 score for the suggestion class, using COMPOSES and Deps embeddings with LSTM and CNN. 5 fold cross validation.
Pre-processing: We also compared experiments on tweets with pre-processing, and without pre-processing the tweets. The pre-processing involved removing URLs and hashtags, and normalisation of punctuation repetition. Preprocessing tends to decrease the performance in all the experiments. Therefore, none of the experiments reported by us use pre-processing on tweets.

Results and Discussions
Tables 4, 7 show the Precision, Recall and F-1 score for the suggestion class (positive class). In general, rule based classifier shows a higher recall, but very low precision, leading to very low F-1 scores as compared to statistical classifiers, where LSTM emerges as a winner in majority of the runs. Below we summarise different observations from the results.

Conclusion and Future Work
In this work, we presented an insight into the problem of suggestion detection, which extracts different kinds of suggestions from opinionated text. We point to new sources of suggestion rich datasets, and provide two additional datasets which contain larger number of suggestions as compared to the previous datasets. We compare various approaches for suggestion detection, including the ones used in the previous works, as well as the deep learning approaches for sentence classification which have not yet been applied to this problem. Since suggestions tend to exhibit similar linguistic nature, irrespective of topics and intended receiver of the suggestions, there is a scope of learning domain independent models for this task. Therefore, we apply the discussed approaches both in a domain dependent, and domain independent setting, in order to evaluate the domain independence of the proposed models. Neural networks in general performed better, in both in-domain and cross-domain evaluation. The initial results for domain independent training are poor. In light of the findings from this work, domain transfer approaches would be an interesting direction for future works in this problem.
The results also point out the challenges and complexity of the task. Preparing datasets where suggestions are labeled at a phrase or clause level might reduce the complexities arising due to long sentences.