SHELLFBK: An Information Retrieval-based System For Multi-Domain Sentiment Analysis

This paper describes the SHELLFBK system that participated in SemEval 2015 Tasks 9, 10, and 11. Our system takes a supervised approach that builds on techniques from information retrieval. The algorithm populates an inverted index with pseudo-documents that encode dependency parse relationships extracted from the sentences in the training set. Each record stored in the index is annotated with the polarity and domain of the sentence it represents. When the polarity or domain of a new sentence has to be computed, the new sentence is converted to a query that is used to retrieve the most similar sentences from the training set. The retrieved instances are scored for relevance to the query. The most relevant training instant is used to assign a polarity and domain label to the new sentence. While the results on well-formed sentences are encouraging, the performance obtained on short texts like tweets demonstrate that more work is needed in this area.


Introduction
Sentiment analysis is a natural language processing task whose aim is to classify documents according to the opinion (polarity) they express on a given subject (Pang et al., 2002). Generally speaking, sentiment analysis aims at determining the attitude of a speaker or a writer with respect to a topic or the overall tonality of a document. This task has created a considerable interest due to its wide applications. In recent years, the exponential increase of the Web for exchanging public opinions about events, facts, products, etc., has led to an extensive usage of sentiment analysis approaches, especially for marketing purposes.
By formalizing the sentiment analysis problem, a "sentiment" or "opinion" has been defined by (Liu and Zhang, 2012) as a quintuple: where o j is a target object, f jk is a feature of the object o j , so ijkl is the sentiment value of the opinion of the opinion holder h i on feature f jk of object o j at time t l . The value of so ijkl can be positive (by denoting a state of happiness, bliss, or satisfaction), negative (by denoting a state of sorrow, dejection, or disappointment), or neutral (it is not possible to denote any particular sentiment), or a more granular rating. The term h i encodes the opinion holder, and t l is the time when the opinion is expressed. Such an analysis, may be document-based, where the positive, negative, or neutral sentiment is assigned to the entire document content; or it may be sentence-based where individual sentences are analyzed separately and classified according to the different polarity values. In the latter case, it is often desirable to find with a high precision the entity attributes towards which the detected sentiment is directed.
In the classic sentiment analysis problem, the polarity of each term within the document is computed independently of the domain which the document's domain. However, conditioning term polarity by domain has been found to improve performance (Blitzer et al., 2007). We illustrate the intuition behind domain specific term polarity. Let us consider the following example concerning the adjective "small": 1. The sideboard is small and it is not able to contain a lot of stuff.
2. The small dimensions of this decoder allow to move it easily.
In the first sentence, we considered the Furnishings domain and, within it, the polarity of the adjective "small" is, for sure, "negative" because it highlights an issue of the described item. On the other hand, in the second sentence, where we considered the Electronics domain, the polarity of such an adjective may be considered "positive". Unlike the approaches already discussed in the literature (and presented in Section 2), we address the multi-domain sentiment analysis problem by applying Information Retrieval (IR) techniques for representing information about the linguistic structure of sentences and by taking into account both their polarity and the domain.
The rest of the work is structured as follows. Section 2 presents a survey on works about sentiment analysis. Section 3 provides a description of the SHELLFBK system by described how information are stored during the training phase and exploited during the test one. Section 4 reports the system evaluation performed on the Tasks 9, 10, and 11 proposed at SemEval 2015 and, finally, Section 5 concludes the paper.

Related Work
The topic of sentiment analysis has been studied extensively in the literature (Pang and Lee, 2008;Liu and Zhang, 2012), where several techniques have been proposed and validated.
Machine learning techniques are the most common approaches used for addressing this problem, given that any existing supervised methods can be applied to sentiment classification. For instance, in (Pang et al., 2002) and (Pang and Lee, 2004), the authors compared the performance of Naive-Bayes, Maximum Entropy, and Support Vector Machines in sentiment analysis on different features like considering only unigrams, bigrams, combination of both, incorporating parts of speech and position information or by taking only adjectives. Moreover, beside the use of standard machine learning method, researchers have also proposed several custom techniques specifically for sentiment classification, like the use of adapted score function based on the evaluation of positive or negative words in product reviews (Dave et al., 2003), as well as by defining weighting schemata for enhancing classification accuracy (Paltoglou and Thelwall, 2010).
An obstacle to research in this direction is the need of labeled training data, whose preparation is a time-consuming activity. Therefore, in order to reduce the labeling effort, opinion words have been used for training procedures. In (Tan et al., 2008) and (Qiu et al., 2009b), the authors used opinion words to label portions of informative examples for training the classifiers. Opinion words have been exploited also for improving the accuracy of sentiment classification, as presented in (Melville et al., 2009), where a framework incorporating lexical knowledge in supervised learning to enhance accuracy has been proposed. Opinion words have been used also for unsupervised learning approaches like the ones presented in (Taboada et al., 2011) and(Turney, 2002).
Another research direction concerns the exploitation of discourse-analysis techniques. (Somasundaran, 2010) and (Asher et al., 2008) discuss some discourse-based supervised and unsupervised approaches for opinion analysis; while in (Wang and Zhou, 2010), the authors present an approach to identify discourse relations.
The approaches presented above are applied at the document-level, i.e., the polarity value is assigned to the entire document content. However, for improving the accuracy of the sentiment classification, a more fine-grained analysis of the text, i.e., the sentiment classification of the single sentences, has to be performed. In the case of sentence-level sentiment classification, two different sub-tasks have to be addressed: (i) to determine if the sentence is subjective or objective, and (ii) in the case that the sentence is subjective, to determine if the opinion expressed in the sentence is positive, negative, or neutral. The task of classifying a sentence as subjective or objective, called "subjectivity classification", has been widely discussed in the literature (Riloff et al., 2006;Wilson et al., 2006;Yu and Hatzivassiloglou, 2003). Once subjective sentences are identified, the same methods as for sentiment classification may be applied. For example, in (Hatzivassiloglou and Wiebe, 2000) the authors consider gradable adjectives for sentiment spotting; while in (Kim and Hovy, 2007) and (Kim et al., 2006) the authors built models to identify some specific types of opinions.
The growth of product reviews was the perfect floor for using sentiment analysis techniques in marketing activities. However, the issue of improving the ability of detecting the different opinions concerning the same product expressed in the same review became a challenging problem. Such a task has been faced by introducing "aspect" extraction approaches that were able to extract, from each sentence, which is the aspect the opinion refers to. In the literature, many approaches have been proposed: conditional random fields (CRF) (Jakob and Gurevych, 2010;Lafferty et al., 2001), hidden Markov models (HMM) (Freitag and McCallum, 2000;, sequential rule mining (Liu et al., 2005), dependency tree kernels (Wu et al., 2009), and clustering (Su et al., 2008). In (Qiu et al., 2009a;Qiu et al., 2011), a method was proposed to extract both opinion words and aspects simultaneously by exploiting some syntactic relations of opinion words and aspects.
A particular attention should be given also to the application of sentiment analysis in social networks. More and more often, people use social networks for expressing their moods concerning their last purchase or, in general, about new products. Such a social network environment opened up new challenges due to the different ways people express their opinions, as described by (Barbosa and Feng, 2010) and (Bermingham and Smeaton, 2010), who mention "noisy data" as one of the biggest hurdles in analyzing social network texts.
One of the first studies on sentiment analysis on micro-blogging websites has been discussed in (Go et al., 2009), where the authors present a distant supervision-based approach for sentiment classification.
At the same time, the social dimension of the Web opens up the opportunity to combine computer science and social sciences to better recognize, interpret, and process opinions and sentiments expressed over it. Such multi-disciplinary approach has been called sentic computing (Cambria and Hus-sain, 2012b). Application domains where sentic computing has already shown its potential are the cognitive-inspired classification of images (Cambria and Hussain, 2012a), of texts in natural language, and of handwritten text (Wang et al., 2013).
Finally, an interesting recent research direction is domain adaptation, as it has been shown that sentiment classification is highly sensitive to the domain from which the training data is extracted. A classifier trained using opinionated documents from one domain often performs poorly when it is applied or tested on opinionated documents from another domain, as we demonstrated through the example presented in Section 1. The reason is that words and even language constructs used in different domains for expressing opinions can be quite different. To make matters worse, the same word in one domain may have positive connotations, but in another domain may have negative connotations; therefore, domain adaptation is needed. In the literature, different approaches related to the Multi-Domain sentiment analysis have been proposed. Briefly, two main categories may be identified: (i) the transfer of learned classifiers across different domains (Yang et al., 2006;Blitzer et al., 2007;Pan et al., 2010;Bollegala et al., 2013;Xia et al., 2013;Yoshida et al., 2011), and (ii) the use of propagation of labels through graph structures (Ponomareva and Thelwall, 2013;Tsai et al., 2013;Tai and Kao, 2013;Huang et al., 2014). Independently of the kind of approach, works using concepts rather than terms for representing different sentiments have been proposed.

The SHELLFBK System
The proposed system is based on the implementation of an IR approach for inferring both the polarity of a sentence and, if requested, the domain to which the sentence belongs to. The rational behind the usage of such an approach is that by using indexes, the computation of the Retrieval Status Value (RSV) (da Costa Pereira et al., 2012) of a term or expression, automatically takes into account which are the elements that are more significant in each index with respect to the ones that, instead, are not important with respect to the index content. In this section, we present the steps we carried out to implement our IR based sentiment and theme classification system.

Indexes Construction
The proposed approach, with respect to a classic IR system, does not use a single index for containing all information, but a set of indexes are created in order to facilitate the identification of the correct polarity and domain, of a sentence during the validation phase. In particular, we built the following set of indexes: • Polarity Indexes: from the training set, the positive, negative, and neutral sentences have been indexed separately.
• Domain Indexes:: a different index has been built for each domain identified in the training set. This way, it is possible to store information about which terms, or expression, are relevant for each domain.
• Mixed Indexes: by considering the multidomain nature of the system, this further set of indexes allows to have, for each domain, information about the correlation between the domain and the polarities. This way, we are able to know if the same term, or expression, has the same polarity in different domains or not.
For each sentence of the training set, we exploited the Stanford NLP Library for extracting the dependencies between the terms. Such dependencies are then used as input for the indexing procedure.
As example, let's consider the following sentence extracted from the training set of the Task 9: "I came here to reflect my happiness by fishing." This sentence has a positive polarity and belongs to the "outdoor activity" domain. By applying the Stanford parser, the dependencies that are extracted are the following ones: nsubj(came-2, I-1) nsubj(reflect-5, I-1) root(ROOT-0, came-2) advmod(came-2, here-3) aux(reflect-5, to-4) xcomp(came-2, reflect-5) poss(happiness-7, my-6) dobj(reflect-5, happiness-7) prep_by (reflect-5, fishing-9) Each dependency is composed by three elements: the name of the "relation" (R), the "governor" (G) that is the first term of the dependency, and the "dependent" (D) that is the second one. We extract, from each dependency, the structure "field -content" shown in Table 1 by using as example the dependency "dobj(reflect-5, happiness-7)". Such a structure is then given as input to the index.

Field Name
Content RGD "dobj-reflect-happiness" RDG "dobj-happiness-reflect" GD "reflect-happiness" DG "happiness-reflect" G "reflect" D "happiness" The structure shown in Table 1 is created for each dependency extracted from the sentence and the aggregation of all structures are stored as final record in the index.

Polarity and Domain Computation
Once the indexes are built, both the polarity and the domain of each sentence that need to be evaluated, are computed by performing a set of queries on the indexes. In our approach, we implemented a variation of classic IR scoring formula for our purposes. In the classical TF-IDF IR model (van Rijsbergen, 1979), the inverse document frequency value is used for identifying which are the most significant documents with respect to a particular query. This value is useful when we want to identify the uniqueness of a document with respect to a term contained in a query, with respect to the other documents stored into the index. In our case, the scenario is different because if a term, or expression, occurs often in the index, this aspect has to be emphasized instead of being discriminated. Therefore, in our scoring formula we consider, as final score of a term or an expression, the document frequency (DF) value (i.e., the inverse of the IDF). This way, we are able to infer if a particular term or expression is significant or not for a given polarity value or domain.
The queries are built with the same procedure used for creating the records stored in the indexes. For each sentence to evaluate, a set of queries, one for each dependency extracted from the sentence is performed on the indexes and the results are aggre-gated for inferring both the polarity and domain of the sentence.
As example of how the system works, let's consider the following sentence: "I feel good and I feel healthy." For simplicity, we only consider the following two extracted dependencies: acomp(feel-2, good-3) acomp(feel-6, healthy-7) From these two dependencies, we generate the following two queries: Q1: "RGD:"acomp-feel-good" OR RDG:"acomp-good-feel" OR GD:"feel-good" OR DG:"good-feel" OR G:"feel" OR D:"good" Q2: "RGD:"acomp-feel-healthy" OR RDG:"acomp-healthy-feel" OR GD:"feel-healthy" OR DG:"healthy-feel" OR G:"feel" OR D:"healthy" For computing the polarity of the sentence, the queries are performed on the three indexes containing polarized records: positive (P OS), negative (N EG), and neutral (N EU ). From the computed ranks, we extract only the DF associated to each field F contained in the query: where DF is the value extracted. As a direct consequence, for each index I, the value representing the RSV of a sentence is: Finally, the polarity of the sentence S is inferred by considering the maximum RSV computed over the three indexes: In case of domain assignment, given a set D of k domains, the domain is computed by:

Results
The SHELLFBK system participated in three Se-mEval 2015 tasks: 9, 10, and 11. All three tasks were about the sentiment analysis topic with the following differences: • Task 9 (Russo et al., 2015): this task is based on a dataset of events annotated as instantiations of pleasant and unpleasant events previously collected in psychological researches as the ones on which human judgments converge (Lewinsohn and Amenson, 1978), (MacPhillamy. and Lewinsohn, 1982). Task 9 concerns classification of the events that are pleasant or unpleasant for a person writing in first person. This task was organized around two subtasks: (A) identification of the polarity value associated to an event instance, and (B) identification of both the event instantiations and the associated polarity values.
The SHELLFBK system has been tested on both tasks.
• Task 10 (Rosenthal et al., 2015): this task aims to identify sentiment polarities in short text messages contained in the Twitter microblog. This task contains five subtasks: (A) expression-level, (B) message-level, (C) topicrelated, (D) trend, and (E) a task on prior polarity of terms. The SHELLFBK has been tested only on the subtask (B).
• Task 11 (Ghosh et al., 2015): this task consists in the classification of tweets containing irony and metaphors. Given a set of tweets that are rich in metaphor and irony, the goal is to determine whether the user has expressed a positive, negative, or neutral sentiment in each, and the degree to which this sentiment has been communicated. With respect to the other tasks, here the polarity is expressed through a fine-grained scale in the interval [-5, 5].
In the following subsections, we will briefly report the performance obtained on each task.

Task 9
Table 2 reports the results obtained in Task 9. This task consisted in the identification of the polarity of a sentence written in first person (subtask A) and in  the identification of both the polarity and the domain of the sentence (subtask B). Precision, recall and F-Measure have been computed. As expected, the accuracy obtained on the sole prediction of the sentence polarity is higher with respect to the one obtained on the subtask combining the inference of both the domain and the polarity itself. Unfortunately, the recall values obtained on both subtasks are quite low, especially for the subtask B.

Task 10
Performance obtained by the SHELLFBK system on Task 10 have been reported in Table 3. For this task, the SHELLFBK system has been tested only on the message-level polarity subtask (B). By observing either the overall f-measure and the ones obtained on the different portions of the dataset, the performance of the system are too low for considering it a reliable solution for being used in contexts where short texts are taken into account.

Task 11
Results of the proposed system concerning Task 11 are shown in Table 4. In this task, due to the finegrained nature of the polarity predictions, the cosine similarity and the mean square error with respect to the gold standard have been computed. In the first result-line, the values obtained on the four figurative categories are reported, while in the second one, the overall results. By observing the results, for the "Sarcasm" and "Irony" topics the obtained results are acceptable; while, for the "Metaphor" and for the "Other" category, both the cosine similarity and the MSE are significantly worse with respect to the first two. These results, either with the ones obtained on Task 10, confirm that the analysis of short texts is the first issue to address for improving the general quality of the system.

Conclusion
In this paper, we described the SHELLFBK system presented at SemEval 2015 that participated in Se-mEval 2015 Tasks 9, 10, and 11. Our system makes use of IR techniques to classify sentences by polarity, domain and the joint prediction of polarity and domain, effectively providing domain specific sentiment analysis. The results demonstrated that, while on well-formed sentences the system obtained good performance, the method performs less well on short texts like tweets. Therefore, future work will focus on the improvement of the system in this direction.
In future work, we intend to explore the integration of sentiment knowledge bases (Dragoni et al., 2014) in order to move toward a more cognitive approach.