AKTSKI at SemEval-2016 Task 5: Aspect Based Sentiment Analysis for Consumer Reviews

This paper describes the polarity classiﬁcation system designed for participation in SemEval-2016 Task 5 - ABSA. The aim is to determine the sentiment polarity expressed towards certain aspect within a consumer review. Our sys-tem is based on supervised learning using Support Vector Machine (SVM). We use standard features for basic classiﬁcation model. On top this, we include rules to check precedent polarity sequence. This approach is experimental.


Introduction
In the consumer-focused markets today, understanding opinions expressed on the online platforms or review portals is of key essence for the businesses. Statistical or Machine learning methods and Natural Language Processing are now being widely applied to extract important information or patterns from the opinion data. A review statement may have a mix of sentiments towards different aspects. For e.g., consider the food and ambiance at xyz hotel were extraordinary, as expected. However, the waiters seemed rude. Here, the main entity of review is a 'hotel'. Henceforth, we will refer to such main entity as global item. However, there is no definite overall sentiment expressed towards the global item. Different sentiments are expressed towards food and ambiance aspects (extraordinary: positive) and towards the aspect of service (waiters, rude: negative). Thus, it is important to approach sentiment detection as an aspect-based problem.
The SemEval-2016 Task 5 -Aspect Based Sentiment Analysis (ABSA) focuses on this problem (Pontiki et al., 2016). This task is a continuation from SemEval-2015 ABSA task (Pontiki et al., 2015). The task was organized across different domains and languages. We participated in Restaurant domain in English language. The focus of our system is polarity detection and not aspect extraction. Thus, we use dataset in which aspects are already known.
To develop our system, we have used standard features for basic model and also rules to check effect of precedent-polarity sequence pattern on polarity to be predicted. We focus on experimenting with sequence pattern. The system is described in Section 3. Pre-processing is described in Sub-section 3.1. selected features are discussed in Sub-section 3.2 and sequence pattern discussed in Sub-section 3.3. In section 4, we discuss the analysis and evaluation results for our system.

Related Work
Aspect-based sentiment analysis has been a subject of some interesting works so far. (McAuley et al., 2012) employ topic modeling paradigm to address this problem. Deep Learning has also been explored in this area, such as by (Wang and Liu, 2015). They used Convolutional Neural Network for aspect-based analysis of SemEval-2015 ABSA data and reported performance comparable to top systems of the 2015 task. Previously, the system by  achieved the best performance in Polarity Detection task in SemEval-2014. They used various innovative linguistic features, publicly available sentiment lexicon corpora and automatically generated polarity lexicons. In Semeval-2015, SENTIUE system by (Saias, 2015) provided remarkable results. They used wide range of features such as Bag-of-Words, negation words, bigram after negation, polarity inversion, polarized terms in last 5 tokens, publicly available lexicons etc. They used MALLET 1 with Maximum Entropy classifier.

Classification System
Our system uses Support Vector Machine (SVM) with Radial Basis Function (RBF) kernel as classifier. The scikit-learn SVM implementation has been used (Pedregosa et al., 2011). This classifier is trained using dataset provided by task organizers. This dataset consists of several reviews, each with a unique review ID (rID). Each review consists of several sentences. A sentence may have single or multiple aspects. Sentences under same rID express sentiment towards the same global item. In our case, the global item is some restaurant. The data are parsed into following format: {Review(rID) {Sentence1 aspect1:(target, category, polarity, from, to)}, {Sentence2 aspect1:(target, category, polarity, from, to)}, ..., {SentenceN aspect1:(target, category, polarity, from to) ... aspectM:(target, category, polarity, from, to)} }.
Here, Review(rID) is just one instance out of several such reviews. (target, category, polarity, from, to) are values belonging to an aspect of a sentence. Polarity values are positive, negative or neutral. SentenceN is an example of a sentence which contains multiple aspects. The test data are also parsed in the same format except that polarity values are not provided. Henceforth, (target, category, polarity, from, to) will be referred to as (tar, cat, p, f, t) for simplicity.
To develop our system, we have used NLTK package (Loper and Bird, 2002) in Python with resources such as WordNet package 2 , SentiWordNet, Bing Liu's opinion lexicon and MPQA subjectivity lexicon 3 .
Based on the observation made on the provided dataset, we hypothesize that only the terms related to tar affect the aspect-polarity p. In the example above, only "more than usually greasy" is relevant for "pork shu mai". Thus, first we decompose any {SentenceX aspect1:(tar, cat, p, f, t) ... aspectM:(tar, cat, p, f, t)} into {SubSent1 aspect1:(tar, cat, p, f, t), ..., SubSentM aspectM:(tar, cat, p, f, t)}, where M is greater than or equal to 1 and SentenceX is any sentence with aspect values assigned to it.
For decomposition, we first use Stanford Dependency Parser (de Marneffe et al., 2006) to obtain a dependency graph of SentenceX. Using the obtained graph, we choose terms in SentenceX which are more closely related to tar terms. For e.g., in the staff acted like we were imposing on them and they were very rude, the underlined terms are related in the dependency graph. Here, tar is 'staff'. SubSent for any tar can be formed using only such related terms.
SubSent formation is not straightforward when tar is NULL. We use self-generated tar values in such cases. Our intuition is that the terms express- (Wiebe et al., 2005). Bing Liu's lexicons and SentiWordnet are available as part of NLTK package. Bing Liu's lexicons and MPQA are binary, i.e., they simply classify words or terms as positive or negative. SentiWordNet provides a range of positive and negative scores for terms. ing sentiment should be related to a noun or pronoun subject (for instance, "loud" and "rude" related to family). Thus, after eliminating all SubSent for non-NULL tar, sentiment terms in the remaining sentence are identified by looking-up terms in the lexicon corpora. Then, a noun or pronoun related (in dependency graph) to identified terms is considered as tar. Since the global item is restaurant, if 'food', 'drinks', 'service', 'waiter', 'price', 'staff' or 'ambiance' are present, they are preferably considered as tar. Also, 'they', 'she' and 'he' are frequently used to refer to service staff in the provided dataset. Hence, these terms are also preferred as tar.
After decomposing, we filter-out stop words selected from NLTK's stop word list. Numbers and symbols (except '!') are also filtered-out using regular expression.

Features
Following basic features have been used in our model:
Presence of negation terms : The scores of sentiment lexicons are modified according to negation (e.g., 'not', 'didn't', 'don't' etc.). Bing Liu and MPQA features are simply reversed (pos → neg, neg → pos). For SentiWordnet features, negation is made in proportion to the scores. For e.g., a word like 'extraordinary' having higher positive score is less affected by negation compared to a word like 'good' having lower positive score. We use a simple scheme for score modification: pos = pos + Here, pos and neg are positive and negative lexicon scores of a term, receptively. A significant work on negation problem has been done by . They provide several methods to perform shifting of sentiment scores.
4. Punctuation like '!'. In the training dataset, this punctuation mostly co-occurs with positive polarity. Hence, the occurrence of the punctuation is checked.

5.
Keywords associated with specific aspect category -There are a total of 12 aspect-categories (cat) such as FOOD#QUALITY, FOOD#PRICES, RESTAURANT#GENERAL, SER-VICE#GENERAL etc.
in the provided dataset. For a specific cat, there could be keywords which, when co-occurring with the cat, express some sentiment. For e.g., high and low are generic terms but for FOOD#PRICES they can indicate a polar sentiment. We divide the dataset into 12 documents, one for each cat. Then, we identify keywords based on Term Frequency -Inverse Document Frequency (TF-IDF) scores. The frequently occurring terms are added to a keyword list. Frequency thresholds of min:0.3 & max:0.8 are used. Total 12 keyword lists are obtained, one for each cat. Then, for each {SubSent aspect:(tar, cat, p, f, t)}, we check for presence of keywords corresponding to cat in SubSent. If found, the keywords are used as new uni-gram features.
These are the features used for basic classification model. In the next sub-section, we describe inclusion of sequence pattern.

Using precedent polarity sequence (experimental)
Following observations are made on the provided training dataset: 1. In majority of the cases, the sentences under the same rID exhibit similar sentiment. In other words, polarity values {p 1 , p 2 , ..., p N }, under same rID, are equal. Henceforth, we will refer to this as Flow.
2. There are sentences where the polarity values change, i.e., p i is not equal to p i−1 , under same rID. Henceforth, we will refer to this as Trans (transition). Trans instances may be identified by explicit contrast terms present around tar. The common contrast terms found in the dataset are: 'but', 'however', 'though', 'tho', 'although', 'yet', 'except' For instance, The decor is right tho...but they REALLY need to clean that vent in the ceiling...its quite un-appetizing, and kills your effort to make this place look sleek and modern target="place" polarity="negative"; target="decor" polarity="positive"; target="vent" polar-ity="negative" However, this does not imply that a contrast term is always present when Trans happens.
Exploiting the 'Flow or Trans' patterns can help address ambiguity. This is the main reason for including sequence pattern. Consider following sentence: The manager came to the table and said we can do what we want, so we paid for what we did enjoy, the drinks and appetizers. For a classifier, the sentiment expressed towards 'manager' may be ambiguous. Our basic model classifies this as neutral, while the true polarity is negative. However, if we take previous sentence in consideration -The level of rudeness was preposterous -the state of mind of the reviewer becomes more clear.
Based on this observation, we hypothesize that, under same review (rID), precedent polarity outcome affects current polarity outcome, either by Flow or Trans, given certain conditions. (Vanzo et al., 2014) propose a context-based model for sentiment analysis of tweets, on similar lines. They use sequence of tweets to build Conversational context, hashtags to build Topical context and also use Markovian approach.
We describe our methods to account for Flow or Trans here. Method1: We use new set of features instead of basic feature-set discussed in sub-section 3.2. First, we generate the features representing conditions for Flow or Trans. We use two conditions for our model -contrast keywords and sentiment keywords -present in a SubSent. The training dataset is divided into 3 sub-sets according to polarity labels. Then, we search for sentiment terms belonging to one of the lexicon corpora, sentiment terms with negation terms (bi-grams and tri-grams) and terms belonging to our neutral word list. A new dictionary D is created with these terms. Moreover, TF-IDF based selection is performed on the 3 sub-sets (or documents). Frequency thresholds are min:0.3 & max:0.8. This ensures inclusion of any frequent keywords which are not already a member of D. Then, for a SubSent i , feature set X i = {posD, negD, neutD, cont} i is generated. Here, posD: terms in a SubSent belonging to positive section of D; negD: terms in a SubSent belonging to negative section of D; neutD: terms in a SubSent belonging to neutral section of D; cont: contrast terms in SubSent. Separate sentiment classes have been used here to let the classifier learn how strongly a SubSent is inclined towards any particular sentiment type. The classifier should learn that if such inclination is strong, then ambiguity is low. So, effect of previous SubSent should also be low.
New input feature-set corresponding to SubSent i is X(i) = {X i , X i−1 , X i−2 }, plus, selective features from sub-section 3.2. For initial two SubSent(s) under a rID, X i−2 or both X i−1 and X i−2 are empty. We do not use n-gram and neutral word features because terms are now selected from D. Punctuation is ignored since its effect is minimal (Table 1). cat specific keywords are included because they are extracted using different document-types. Lexicon scores are also included to capture sentiment strength. The same SVM-RBF classifier is then trained with X to predict polarities. For test data, same dictionary D is used to generate new features.
Method2: This method is along the lines of auto-regression 5 . However, polarity sequence is not a strict time-series. Hence, we devise our mathematical model with necessary considerations. A first set of predicted polarities P 1 = {p 11 , p 12 , ..., p 1k } are obtained using SVM-RBF with all of the basic features from sub-section 3.2. Polarities are mapped as {positive, negative, neutral} → {1,-1,0}. The aim is to obtain final predictions, P 2 = {p 21 , p 22 , ..., p 2k }. Feature-set X i = {posD i , negD i , neutD i , cont i } for SubSent i of test data is obtained using D. However, we do not predict using these features. The Flow or Trans effect is directly calculated using P 1 values. For each SubSent i , we define following values: s p i : positive vote. This is initialized by 0, then incremented by +1 for first term found in posD i and by +0.5 for every next term in posD i , s n i : negative vote. This is initialized by 0, then incremented by -1 for first term found in negD i and by -0.5 for every next term in negD i , s o i : neutral vote. This is initialized by 0.4 (s o imin ), then incremented by (1-s o i )/4 for every term found in neutD i , keeping the value below 1. c i : contrast vote. This is initialized by +1; assigned c i = 2, if cont i is not empty, w i : aggregate voting weight. This is calculated as, Since, a SubSent must express some sentiment, we assume a basic neutral characteristic in each SubSent. Hence, s o imin is added. We define a function g(w,p) = |w|(p + ||p|-1|). Then, using these parameters we calculate a weighted effect,p(i), for polarity as, E i = g(w i−1 ,p 1,i−1 )+ r=i−1 r=l (c r /c r−1 )g(w r−1 ,p 1,r−1 ), E i(avg) = E i /(1+ r=i−1 r=l (c r /c r−1 )), p(i) = g(w i ,p 1i ) + E i(avg) /2c i . The increment and assignment values have been chosen after experimenting with different values. Also, for our model, l = i-2 works best. Effect value E i captures the effect of precedent polarities. The effect of a polarity value p 1,i−2 should be amplified with respect to p 1,i−1 if p 1,i−1 came by contrast and reduced if p 1,i−2 itself came by contrast. Hence, p 1,i−2 is multiplied with c i−1 /c i−2 . Finally, the average effect E i(avg) should be reduced if current SubSent i has explicit contrast terms. Hence, the division by 2c i . Then, ifp(i)<0, p 2i = -1; ifp(i)>1, p 2i = 1; otherwise, p 2i = 0.
These equations are tuned based on observations made on training data. More generic and robust equations need to be formed. This needs further investigation.

Analysis using training data
The analysis of our system on training data is provided in Table 1. SVM-RBF with parameters : [C=100, kernel='rbf', gamma = 0.001] is used (same for evaluation/test). Parameters are obtained using grid search. The accuracy scores are obtained using 10-fold cross-validation from scikit-learn (Pedregosa et al., 2011). N-grams obtained using dependency relation with aspect-target are base features. Lexicons are essential to capture sentiment types and scores. However, we found that there were some terms occurring in neutral sentences which were not listed in lexicon corpora. Hence, we generated our own list of neutral words. Punctuation (!) has negligible effect on the performance. Including aspect-category keywords improves accuracy. As discussed earlier, keywords are required to include terms that express some opinion specific to a category. These are the only basic features used. On top of this, we include polarity sequence pattern. It can be seen that Method2 provides better result than Method1 by a slight margin only. Method2 may not be necessarily better, but we prefer using it. It theoretically permits using more than one precedent polarities in the sequence, if required, without involving complex features; only the summation series needs to be expanded as we go along a polarity sequence. Method2 is used in final Evaluation model.
Due to time constraint, we focus more on inclusion of polarity sequence pattern instead of engineering better features or classifier ensemble.

Evaluation result
The result of evaluation is provided in Table 2. There were 676 sentences in the evaluation (test) data and 859 instances of aspect values (tar, cat, f, t).
The polarity values had to be predicted. The system predictions were evaluated against gold labels by the organizers. There were total 30 submissions in polarity detection task for Restaurant domain and English language. This included multiple submissions from single teams as well. Relative performance of our system was poor. This may be attributed to less effort invested on improving classifier model (using ensembles, or otherwise) or on using more robust features. We also suspect that {posD, negD, neutD, cont} features may be biased towards training data as the keyword dictionary D was generated on the full training dataset before evaluation. Moreover, Method2 is tuned using training data and expected to perform weaker on unseen datasets.

Further evaluation on gold-labeled data
We did further evaluation of our system after release of gold-labeled test data. This was aimed at checking the effect of using sequence pattern. The results are provided in Table 3. The accuracy obtained against gold labels without using sequence pattern was 0.668. By using Method1, the accuracy increased to 0.702. With Method2, the accuracy obtained was 0.717. These are small increments. Also, the method has obvious caveats as mentioned above. So, the usage of sequence pattern needs to be improved by more research.

Conclusion
We submitted unconstrained system for sentiment polarity detection. The system was unconstrained in the sense that it used several external resources  for feature generation. Apart from standard features, we experimented with polarity sequence pattern. This approach provides slight improvement in prediction accuracy as checked on evaluation data. However, for any serious purpose, this approach requires deeper investigation. Our next step would be to devise more robust feature-extraction to handle polarity sequence patterns. Moreover, this approach needs to be tested on broader datasets. We will also explore using sequence pattern with multiclass Platt Scaling (Zadrozny and Elkan, 2002) or ensemble models to check performance.