IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain Dependency and Distributional Semantics Features for Aspect Based Sentiment Analysis

This paper reports the IIT-TUDA participation in the SemEval 2016 shared Task 5 of Aspect Based Sentiment Analysis (ABSA) for sub-task 1. We describe our system incorporating domain dependency graph features, distributional thesaurus and unsupervised lexical induction using an unlabeled external corpus for aspect based sentiment analysis. Overall, we submitted 29 runs, covering 7 languages and 4 different domains. Our system is placed ﬁrst in sentiment polarity classiﬁcation for the English laptop domain, Spanish and Turkish restaurant reviews, and opinion target expression for Dutch and French in restaurant domain, and scores in medium ranks for aspect category identiﬁcation and opinion target extraction.


Introduction
The advent of web technologies has made an unprecedented opportunity for online users to share and explain their views and opinions. The corelation between the views expressed by the users and the market strategies by the organizations strikes the importance of analyzing such reviews. But, valuable as they are, user-generated review texts are unstructured and very noisy. Major research studies adopted Natural Language Processing (NLP) and text mining techniques to better understand and process various types of information in user-generated reviews. Such efforts have come to be known as opinion mining, sentiment analysis or review mining (Pang and Lee, 2008).
Aspect level analysis performs a finer-grained sentiment analysis by addressing three subproblems: extracting aspects from the review text, identifying the entity that is referred to by the aspect, and finally classifying the opinion polarity towards the aspect (Liu, 2012). For example, a review of the "entity" laptop is likely to discuss distinct "aspects" like size, processing unit, and memory, and a single product can trigger a positive "opinion" about one feature, and a negative "opinion" about another.
In an attempt to support the efforts on Aspect Based Sentiment Analysis (ABSA), the SemEval 2016 shared Task 5 ABSA (Pontiki et al., 2016) offers the opportunity to experiment and evaluate on benchmark datasets (reviews) across various domains and languages through three subtasks. Subtask 1 covers the three sub-problems mentioned above, namely: aspect category identification (Slot 1), opinion target expression (OTE) (Slot 2) and sentiment polarity classification (Slot 3). We participated in Slot 1 and Slot 3 for English, Spanish, Dutch, French, Turkish, Russian and Arabic language for all available domains except telecoms. We also conducted experiments for Slot 2 for English, Spanish, Dutch and French. Overall, we submitted 29 runs, covering 7 languages and 4 different domains.

Method for Aspect Based Sentiment Analysis
In this section, we describe our data preprocessing and feature extraction. We also introduce an unsupervised approach for expanding the coverage of existing lexical resources based on the notion of distributional thesaurus. We use Support Vector Machine (SVM) (Cortes and Vapnik, 1995) as the baseline classifier for aspect category detection and sentiment polarity classification, and Conditional Random Fields (CRF) (Lafferty et al., 2001) for opinion target expression identification.

Preprocessing
We tokenize the data using Stanford tokenizer, normalize all digits to 'num' and remove stop words for tf-idf computation. For opinion target expression, we run Stanford CoreNLP 1 suite in order to extract information such as lemma, Part-of-Speech (PoS) and named entity (NE) in English language. For languages other than English, we use the universal parser 2 for tokenization and parsing. Since we deal with the OTE as a sequence labelling problem, it is necessary to identify the boundary of OT properly. We follow the standard BIO notation, where 'B-ASP', 'I-ASP' and 'O' represent the beginning, intermediate and outside tokens of a multi-word OTE respectively. e.g. In, 'Chow

Features for Aspect Category Detection
• Domain Dependency Graph: We use the aspects list produced by Domain Dependency Graph (DDG) for each domain by (Kohail, 2015). The idea is to detect topics underlying a mixed-domain dataset, aggregate individual dependency relations between domain-specific content words, weigh them with tf-idf and produce a DDG by selecting the highest-ranked words and their dependency relations. Since the domains are already given, no topic modeling is required. However, only one domain was provided for French and Spanish, we used ex- ternal reviews dataset to compute tf-idf. We use movies reviews 3 for Spanish; and books, music and DVD reviews 4 for French. The resulting graphs were filtered and only 'amod' (adjective modifying a noun) and 'nsubj' (nominal subjects of predicates) relations were selected. For each extracted aspect from the opinion-aspect pairs, we determine the existence or absence of this aspect using a binary feature.
• Distributional Thesaurus: A Distributional Thesaurus (DT) is an automatically computed lexical resource that ranks words according to the semantic similarity. We employ an open source implementation of DT computation as described in (Biemann and Riedl, 2013). For every top five significant words based on tfidf score in each aspect category (for example: 'overpriced', '$', 'pricey', 'cheap', 'expensive' are the most significant terms in 'food#price' category), we find ten most similar words according to DT. The presence or absence of these words in the review is used as a feature for aspect category identification. Examples from the distributional thesaurus are presented in Table 1.
• Tf-Idf Score: Each aspect category has some discriminative aspect terms. We extract a maximum of top five distinguishing words in each category based on tf-idf score. Presence or absence of each token in the review denotes the feature.
• Bag of Words: This feature denotes the number of occurrences of each word in the review.

Features for Opinion Target Expression
• Word and Local Context: We use the current token, its lowercase form and the context tokens in a window of [-5..5] as features.
• Part-of-Speech (PoS) Information: We use PoS information of the current, preceding two and following two tokens as the features.
• Head Word and its PoS: We use the head word of the noun phrase and PoS information of the head word.
• Prefix and Suffix: We use prefix and suffix of length up to four characters.
• Frequent Aspect Term: We build a list of frequently occurring OTEs from the training set. An OTE is considered to be frequent if it appears at least four times in the training corpus. We define a binary feature for the presence or absence of extracted OTEs.
• Dependency Relations: In English language, features are defined in line with (Toh and Wang, 2014). For other languages, feature is defined by considering whether the current token is present in dependency relations 'nsubj', 'dep', 'amod', 'nmod' and 'dobj' or not.
• Character N-grams: We use all substrings up to length 5 of the current token as features.
• Orthographic feature: This feature checks whether the current token starts with the capitalized letter or not.
• DT features: We use the top 5 DT expansions of current token as the features.
• Expansion Score: OTEs have opinion around them. Opinions are regularly lexicalized with words found in sentiment lexicons. We calculate sentiment score based on SentiWordNet 5 (Esuli and Sebastiani, 2006) in English language. For Non-English language, we use our induced lexicons. We calculate sentiment score by considering the window size of 10 (preceding 5 and following 5 tokens of the target one). 5 http://sentiwordnet.isti.cnr.it/ We additionally extract the following features only for English language.
• Chunk information: To identify the boundaries of multi-word OTEs, we use chunk information of the current token as the features.
• Lemma: Lemmatization trims the inflectional forms and derivationally related forms of a token to a common base form.
• WordNet: We use top 4 noun synsets of current token from WordNet as the features.
• Named entity information: We extract named entity information of the current token with Stanford CoreNLP tool, and use the NERsequence labels in BIO-scheme as features.

Features for Sentiment Polarity Classification
• Lexical Acquisition: We use lexical expansion for inducing sentiment words based on distributional hypothesis. We observe that for rare words, unseen instances and limited coverage of available lexicons, the distributional expansion can provide a useful backoff technique, also cf. (Govind et al., 2014).
For all languages, we construct a polarity lexicon using an external corpus and seed sentiment lexicon. For seed lexicons, we use English (Salameh et al., 2015) and Arabic (Salameh et al., 2015) versions of Bing Liu's lexicon (Hu and Liu, 2004) for English and Arabic respectively, VU sentiment lexicon 6 for French, Dutch and Spanish, a lexicon by (Panchenko, 2014) for Russian, and Senti-TurkNet (Dehkharghani et al., 2015) and NRC Emotion for Turkish 7 . For inducing a lexicon, we obtain the top 100 DT expansion of each word in the seed lexicon. Next we accept candidate terms that a) occur in the expansions of at least 10 seed terms, b) have a corpus frequency  of more than 50 in the background corpus (English 8 , French 9 , Spanish 10 , Dutch 11 , Russian 12 , Arabic 13 ). Finally, we compute the normalized positive, negative and neutral score for each word similar to (Kumar et al., 2015), and inspired by (Hatzivassiloglou and McKeown, 1997). The core assumption is that words tend to be semantically more similar to words of same sentiment. Hence, words appearing more in the expansions of positive (negative/neutral) words get assigned a higher positive (negative/neutral) sentiment score, Here, in difference to (Kumar et al., 2015), we compute normalized positive, negative and neutral scores rather than assigning one of the polarity class to the words. It should be noted that the volume of induced lexicon depends on two factors: (i) number of words in the seed lexicon that have expansions and (ii) pruning threshold for obtaining the induced lexicon. The unavailability of expansions for all words in the seed lexicon and higher threshold on conditions for accepting candidate terms reduces the volume of induced lexicon. Expansion statistics for the induced lexicons are provided in Table 2.
• Word N-gram: All unigrams and bigrams tokens are extracted from the training set are used as a binary feature, where 1 and 0 indicates the presence and absence of n-grams in the review.
• Entity-Attribute Pair: We use E#A pair as a binary feature for sentiment classification.

Datasets, Experimental Results and Discussions
For feature selection and hyperparameter tuning, we perform five-fold cross-validation on the training set. For Slot 1 and Slot 3, we use supervised classification using Support Vector Machine (SVM) 14 . Based on cross-validation results, we set the probability threshold of 0.185, 0.13 and 0.145 for restaurants, laptops and phones, respectively, for predicting aspect categories in the review. All E#A pairs having predicted probability greater than the threshold are enlisted as aspect categories. For Slot 2, we use CRFSuite 15 with default parameters. CRF-
Teams were allowed to submit their systems in two modes: constrained and unconstrained modes. In constrained mode, the participants are allowed to use only the resources and dataset provided by the organizers whereas in unconstrained submission, participants can use any external resource. For Slot 2 and Slot 3, we only sent unconstrained submission, while for Slot 1 we sent constrained as well as unconstrained submissions except for Russian restaurants.
Our system achieves the best results in sentiment polarity classification for reviews about English laptops, Spanish restaurants and Turkish restaurants. We score second for English restaurants. We also produce the maximum F1-score value for opinion target expression for French and Dutch restaurants. Our evaluation results across all domains and lan-guages are given in Table 3.
The results show that our system performs comparably well for sentiment polarity classification and opinion target expression. A feature ablation experiment given in Table 4 shows the effectiveness of induced lexicon for Slot 3 task. We get a significant improvement on adding information from the induced lexicons in each language. This holds especially for languages other than English, where existing sentiment lexicons are less comprehensive. We also note that entity-attribute pairs also help in resolving conflicting sentiments (for example: cheap food (positive) to cheap service (negative)).
We score in medium ranks for Slot 1 task. Distributional thesaurus based expansion for discriminative terms and aspects obtained through domain dependency graph results in marginal increments. This could be attributed to conflict in very fine grained aspect categories (for example: Restaurant#Prices, Food#Prices, Drink#Prices)  After the evaluation period, we revised our feature representation to ensure that it matches the correct input format for CRF. We also added two new features including unsupervised PoS tags (Biemann, 2009) as the feature for all the languages and Sen-tiWordNet score for English language. For the current token, we use PoS tag, distributional thesaurus, lexical expansion score, unsupervised PoS tag, Sen-tiWordNet score of context tokens [-2, -1, 0, 1, 2], prefix (upto 3-character), suffix (upto 3-character) and chunk information of context tokens [-1, 0, 1]. The updated results of after modification are shown in Table 5. If we had incorporated these changes earlier, we would have scored third for English and first for the other three languages.

Conclusions and Future Work
In this paper, we report our work on the task of Aspect Based Sentiment Analysis, which covers three slots: aspect identification, opinion target extraction and sentiment polarity classification. By leveraging a distributional thesaurus, we expand the existing domain specific aspect list and sentiment lexicons for different languages to reach a higher coverage on sentiment words. Our system was ranked first in five out of 29 submitted runs. While performance is satisfactory for Slot 3 and Slot 2 (after correction), our setup compares infavorably to others for Slot 1. We will continually improve our system in future work.