Language-Agnostic Model for Aspect-Based Sentiment Analysis

In this paper, we propose a language-agnostic deep neural network architecture for aspect-based sentiment analysis. The proposed approach is based on Bidirectional Long Short-Term Memory (Bi-LSTM) network, which is further assisted with extra hand-crafted features. We define three different architectures for the successful combination of word embeddings and hand-crafted features. We evaluate the proposed approach for six languages (i.e. English, Spanish, French, Dutch, German and Hindi) and two problems (i.e. aspect term extraction and aspect sentiment classification). Experiments show that the proposed model attains state-of-the-art performance in most of the settings.


Introduction
Sentiment analysis (Pang and Lee, 2008) is often target-centric. In aspect-based sentiment analysis (ABSA), we aim to identify the polarity of expressed sentiments towards a feature or aspect. These features or aspects are usually explicitly mentioned in the text. Also, a sentence may contain more than one aspect terms, and the task is to assign separate sentiments to each of them, e.g. in "The food was great! But service was below par." there are two aspects ('food' and 'service'), and the expressed sentiment towards food and service are positive and negative, respectively. Such analysis offers finegrained information to a user or an organization who seeks users opinion towards any specific entity. For example, based on the users' feedback, an individual can draw a general perception about the specific attribute or aspect of a product or service, and he/she can make an informed decision about the product or service under observation. Similarly, an organization can utilize the feedback to refine its product/service or to take a decision in the business model.
Aspect-based sentiment analysis (Pontiki et al., 2014(Pontiki et al., , 2016 has two subproblems at its core, i.e., aspect term identification (or opinion target extraction) and aspect sentiment classification. Given a text, aspect term identification task aims to find the boundaries of all the aspect terms present in the text, whereas aspect sentiment classification task classifies each of these identified aspect terms into one of the predefined sentiment classes (e.g., positive, negative, neutral etc.). A sentence may contain any number of aspect terms or no aspect term at all. The terms 'aspect term' and 'opinion target' are often used interchangeably and refer to the same span of text.

Motivation and Contribution
A survey of the literature for ABSA suggests a number of works for different languages Brun et al., 2016;Ç etin et al., 2016). Although the reported performance for these works are good, they usually suffer in handling the language diversity, i.e., the systems that reported state-of-theart performance for one language typically do not work well for the other languages. The unavailability of such a generic system motivates us to build a language-agnostic model for aspect based sentiment analysis. We propose a generic deep neural network architecture that handles the language divergence to a great extent. Our model is based on Bidirectional Long Short-Term Memory (Bi-LSTM) network (Graves et al., 2005) that also utilizes extra hand-crafted features. We evaluate our proposed approach for four European (i.e., Spanish, French, Dutch & German), one Indian (i.e., Hindi) and English languages. The contributions of our work are three-fold: a) we propose an efficient and generic neural network architecture that works across multiple languages; b) we utilize a small set of handcrafted features (one each for aspect extraction and aspect classification) for the training and evaluation; and c) we provide the new state-of-the-art performance for two problems of ABSA across six different languages.
Rest of the paper is organized as follows: In Section 2, we present the literature survey. The proposed methodology has been discussed in detail in Section 3. In Section 4, we furnished experimental results and provided the necessary analysis. Finally, we conclude in Section 5.
For ABSA, System GTI (Alvarez-López et al., 2016) used a Support Vector Machine (SVM) and Conditional Random Field (CRF) based approach for aspect extraction and sentiment classification, respectively. They used language-dependent features like lemmas and Part-of-Speech (PoS) tags to achieve the state-of-the-art score for aspect extraction in Spanish. IIT-TUDA  also used a number of hand-crafted features like character n-grams, dependency relations, prefix and suffix for SVM and CRF. They achieved comparable performance for Spanish, French & Dutch. System XRCE (Brun et al., 2016) used a feedback ensemble network that obtained the best performance for aspect classification on the French dataset. System TGB (Ç etin et al., 2016) used a Logistic Regression based model to address the aspect sentiment classification and reported to achieve the best score on Dutch dataset. Mishra et al. (2017) used a Bi-LSTM based model, whereas Naderalvojoud et al. (2017) adopted a deep recurrent neural network model for the German dataset.  developed an aspect based sentiment analysis datasets for Hindi. They employed CRF and SVM for aspect term extraction and aspect sentiment classification, respectively. For aspect based sentiment analysis in English, Kiritchenko et al. (2014) reported the best performance in SemEval-2014 shared task on ABSA (Pontiki et al., 2014).
There have been few attempts at injecting handcrafted features into the neural network architecture for enhancing the overall performance Araque et al., 2017) of sentiment analysis.  combined CNN representation and optimized features for learning a Support Vector Machine. Authors in (Araque et al., 2017) proposed a classifier ensemble model that combines surfacelevel features and generic word vectors for the sentiment classification. However, our work differs from these systems in the following ways: a) we perform aspect level sentiment analysis for six different languages (belong to different language family); b) we propose four different architectures to successfully combine the neural network learned representations and the handcrafted features; c) the proposed architectures handle both aspect extraction (a sequence labelling task) and aspect sentiment classification (a classification task); and d) we achieve better performance for most of the problem/language pairs.

Proposed Method
Overall, aspect based sentiment analysis can be thought of as a two-step process, i.e. aspect term extraction and aspect sentiment classification. Aspect term extraction is a sequence labelling task where each token of a sentence needs to be classified as either inside the boundary of an aspect term or outside. We adopted BIO notation to mark each token as either Begin, Intermediate or Outside of an aspect term. A 'B' signifies the beginning of an aspect term and successive 'Is' signify a multi-token aspect term (e.g. spicy tuna rolls). A single-token aspect term will be tagged as 'B'. For the second problem, i.e. aspect sentiment classification, we define a context window of size ±5 around each aspect term and consider all the tokens within the window for an instance. The intuition behind such an approach is that the sentiment-bearing clue words often occur close to the aspect terms. An example scenario is depicting in Table 1.

Review:
Rice was good but the main attraction was spicy tuna rolls . BIO Notation: Rice and Spicy tuna rolls Context window (±5) P rev 5 P rev 4 P rev 3 P rev 2 P rev 1 Aspect term N ext 1 N ext 2 N ext 3 N ext 4 N ext 5 Rice null null null null null Rice was good but the main Spicy tune roll but the main attraction was spicy tuna roll . null null nulll null Aspect Sentiment: Positive for Rice and Positive for Spicy tuna rolls. Table 1: An example review from restaurant domain and its respective processing for aspect term extraction (i.e. BIO notations) and aspect sentiment classification (i.e. contextual processing).
Our proposed neural network architecture employs a Bi-LSTM network for learning sentence embeddings, which are then fed to a fully-connected dense layer for classification. Given a sentence, we first compute the word embeddings of each word and feed them into the Bi-LSTM network at different time steps for the prediction. We refer to this architecture as A1. In addition, we inject extra hand-crafted manual features to assist the neural architecture. We design three architectures (i.e. A2, A3 & A4 in Figure 1) for the successful combination of word embeddings and the hand-crafted features. The basic difference among these three architectures are the way features are injected into the model. A high-level architecture of our proposed method is depicted in Figure 1. Architecture A1 makes use of word embeddings as the sole input for the network. In A2, we concatenate the word embeddings with the hand-crafted features at the input and then feed this combined input to the network for learning. In comparison, architecture A3 learns the sentence embedding through Bi-LSTM network on top of word embedding only, which is then merged with the hand-crafted features before feeding into the fully connected layers for prediction. In contrast, architecture A4 utilizes two separate Bi-LSTM networks for word embeddings and hand-crafted features, respectively. Subsequently, the learned sequences of each Bi-LSTM are concatenated and fed into the fully-connected layers for further prediction. The choice of separate Bi-LSTMs for the hand-crafted features in architecture A4 is driven by the fact that the dimension of a word embedding is usually very high as compared to its corresponding hand-crafted features. If trained together, as in architecture A2, extracted features of low dimension usually get overshadowed by the high-dimensional word embeddings. Thus making it nontrivial for the network to learn from the extracted features. Further, to exploit the sequence information of words in a sentence, we pass hand-crafted features of each word through a separate Bi-LSTM layer. E.g. in the following sentence there is one negative word (i.e. horrible) and one negation (i.e. not) but no positive words. However, in a model that takes into account only the simple polar word score, the sentence would have high relevance towards the negative sentiment. However, the sequence information of the phrase "not any more" dictates the positive sentiment of the sentence.
"It was used to be a horrible place to eat but not any more." In contrast to A4, architecture A3 does not rely on the sequence information of the extracted features and allows the network to learn on its own. We use 300 dimension Word2Vec (Mikolov et al., 2013) word embeddings for the experiments. Each Bi-LSTM layer contains 100 neurons while two dense layers contain 100 and 50 neurons, respectively.

Features
As additional features, we extract the following information for each token in an instance.
-Aspect term extraction: Distributional thesaurus (DT) 1 (Biemann and Riedl, 2013) defines the lexicon expansion of a token based on a similar context. It is usually very effective for the handling of unseen text. If a token in the test set never appears in the training set, it becomes a non-trivial task for the classifier to make a correct prediction. By employing DT feature, the classifier can additionally utilize lexical expansion of the current token for mapping with the training set, thus minimize the chance of unseen text. For each token, we use its top 3 DT expansions as features.

Datasets
We evaluate our proposed approach on the benchmark datasets of SemEval-2016 shared task on aspect based sentiment analysis (Pontiki et al., 2016) (Task 5), which contain user reviews across multiple languages. The datasets of English, Spanish, French and Dutch are related to the reviews of consumer electronics and restaurants. We also evaluate our approach on the GermEval-2017 shared task on ABSA (Wojatzki et al., 2017), which comprises of reviews in the German language. The training datasets contain 2,070, 1,733, 1,711 & 19,432 reviews in Spanish, French, Dutch and German, respectively. Whereas, test datasets contain 881, 696, 575 & 2,566 reviews for the respective languages. For Hindi, we employed ABSA dataset developed by Akhtar et al. . There are total 4469 aspect terms in 5417 sentences across 12 domains. We perform 10-fold cross validation for the evaluation in this work. Table 2 lists the brief statistics of the various datasets for different languages.

Preprocessing
We extract each instance from the SemEval and the GermEval dataset to take into account only the relevant information and remove the XML tags. We use NLTK 2 (Shallow parser 3 for Hindi) to tokenize each sentence of the dataset. The aspect terms can span over multiple words in a sentence and hence, we use the BIO encoding scheme. In this notation, B, I and O denote the beginning, internal and outside tokens of aspect term respectively.
for each language/problem pair. In aspect extraction problem, architecture A4 yields the best F1-score for Spanish (73.0%), German (24.0%), English (64.9%) and Hindi (53.5%), whereas for French and Dutch we obtain the best F1-score with architectures A2 (67.8%) and A3 (65.7%), respectively. We observe similar trends for aspect classification as well with architecture A4 performing better for Spanish (87.2% accuracy), German (87.2% F1-score), English (83.4% accuracy) and Hindi (66.9% accuracy). Similar to aspect extraction, architectures A2 and A3 report better performance for French (75.34%) and Dutch (81.9%), respectively. Among all four architectures, architecture A1 has the least performance across all six languages for both the problems. It suggests that the hand-crafted features -when fused into the network-assist the system to learn in a better way than the system learnt with only word embeddings. We also perform statistical significance test (T-test) on the obtained results and observe that the performance of the architecture A4 is significant with 95% confidence for English, Spanish, German and Hindi for both the problems. Further, we compare our proposed system with state-of-the-art systems as listed in Table 4. Our proposed system shows an improvement over the existing state-of-the-art for 9 out of 12 language/problem pairs. For aspect extraction, the system achieves an improvement of 4.5, 1.2, 8.8, 2 and 12.5 points for Spanish, French, Dutch, German and Hindi, respectively. Our system manages to improve the score of sentiment classification for Spanish, Dutch, German, and Hindi by 3.56, 4.17, 12.3 and 1 points, respectively. Improvement of the system performance across the language/problem pairs suggests about the generic nature of our proposed approach. Also, significance T-test shows that improvement of the proposed method over the state-of-the-art systems are statistically significant with p-values< 0.05.
From Table 3, we observe that architecture A4 performs the best for four languages, i.e., Spanish, German, English and Hindi irrespective of the problems. Similarly, the performance of the architectures A2 & A3 is best for French and Dutch, respectively. Since architecture A4 is the clear winner in 8 out of 12 language/problem pairs and also reports comparable performance in other cases -with maximum 2.9 points below the best architecture as reported in Table 3 -, we recommend it as the default choice for all the languages and problems.

Error Analysis
We perform error analysis on the predicted outputs, using automatic translations (Google) for languages we are not proficient in. Following are the few cases where our proposed system often faces challenges.
Aspect term extraction: Aspect term extraction is a quite challenging task. The BIO notation is an effective solution for tagging an aspect term; however, it is highly skewed towards the O class, i.e., only a small percentage of tokens in the vocabulary qualify for the aspect term. Despite this limitation, BIO notations result in decent outputs with the few exceptions. In Table 5, we list a few common error patterns along with the examples. Our system faced difficulties when one or more terms can independently qualify as an aspect term. In the first two examples, our system misclassifies the multi-token aspect terms 'customer service' and 'atencin del personal' (attention of the staff) as single aspect terms. It predicts the first token of the aspect term (i.e., 'customer' (first example) and 'atencin' (attention) (second example)) as one aspect term and the last token (i.e., 'service' and 'personal' (staff)) as the other aspect term. Despite both the tokens of aspect term 'customer service' is identified as aspect terms, it results in recall=0 and precision=0.  'and', 'with' etc.) in the multi-token aspect terms (i.e. 'riz arborio aux truffles' (arborio rice with truffles)). In general, 'and', 'with' or other conjunctions does not qualify for the aspect term except in the company of multi-token aspect terms. However, such occurrences are not very common, and the underlying system misclassifies them as outside aspect term, i.e., O. The second example (i.e. 'atención del personal' (attention of the staff)) may also qualify for the similar reason.
Aspect sentiment classification: For aspect sentiment classification, we observed two most common sources of errors across languages, i.e., lack of polar information inside the defined context window (±5 neighbouring words) and presence of the sarcastic or metaphoric phrase in the review. We list a few error cases in Table 6. The first example belongs to the Spanish language, which contains an aspect term 'calidad-precio' (quality-price). The actual sentiment towards the aspect term is positive; however, in the absence of clue words (i.e. 'restaurantes de referencia de Zaragoza' (recommended restaurants of Zaragoza)) inside the context window, our proposed system predicts its sentiment as neutral.
Predicting sentiment for the sarcastic and metaphoric text are usually challenging due to the difference in its textual-meaning and actual-meaning (i.e., what is said is not meant or vice-versa). Our system also finds it non-trivial to correctly classify an aspect term in the presence of sarcastic (second example of Table6) or metaphoric (third example) text. In the second example, the staff's unresponsiveness behaviour irked the writer, who had to ask for a table sarcastically. Similarly, in the third example writer was not amused by the quality of lemon chicken and compared it with the sticky sweet donuts as figure-of-speech.

Conclusion
In this paper, we have proposed a language-agnostic deep neural network approach for solving the problems of aspect-based sentiment analysis. Our system employs Bi-LSTM network for learning the sentence embeddings, which is assisted by a few handcrafted features. To show the effectiveness, we evaluated the proposed approach on six languages (i.e. English, Spanish, French, Dutch, German and Hindi) and two problems (i.e. aspect term extraction and aspect sentiment classification). We also evaluated different ensemble architectures to combine sentence embeddings and handcrafted features. Comparisons with the existing system suggest that our proposed approach attains the state-of-the-art performance for almost each of the language/problem pair.

Acknowledgement
Asif Ekbal acknowledges the Young Faculty Research Fellowship (YFRF), supported by Visvesvaraya PhD scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY), Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia).