Domain Adaptation of Polarity Lexicon combining Term Frequency and Bootstrapping

,


Introduction
Sentiment Analysis (SA) is a discipline that combines Natural Language Processing (NLP) and data mining techniques to deal with the subjectivity in textual information. Several tasks have been studied but perhaps Polarity Classification is the most well known that focuses on determining the semantic orientation of a document: positive, negative or neutral.
Although different approaches have been applied to the field of polarity classification, the mainstream basically consists of two major methodologies. On the one hand, the Machine Learning (ML) approach that is based on using a collection of data to train the classifiers (Pang et al., 2002). On the other hand, the approach based on Semantic Orientation (SO) that does not need prior training but takes into account the orientation of words, positive or negative (Turney, 2002). In this paper we focus on semantic orientation in order to tackle one of the open issues related to polarity classification: domain adaptation.
Our main goal is to propose a method for automatically adapting a general polarity lexicon to a specific domain. Specifically, we are going to work with the movie domain because, as several papers demonstrate, this is a very difficult domain to deal with and adapt in SA (Turney, 2002;Taboada et al., 2009;Molina-González et al., 2015b). In addition, we will focus on Spanish since we consider that a language other than English is a more challenging task in the NLP area in general, and in SA in particular, due to the scarcity of resources. Different methods have been proposed for tackling the domain adaptation problem by automatically generating polarity lexicons. One of the primary studies related to SA is (Blitzer et al., 2007). They note that the polarity of a particular word can carry opposing sentiments depending on the domain, so the general purpose lexicon should be adapted to the specific domain in order to improve the effectiveness. Two main approaches to creating polarity lexicons automatically have been studied: dictionary-based and corpus-based. Dictionarybased approaches use external resources such as thesaurus or dictionaries in order to enrich a set of polar terms by aggregating new subjective words (Esuli and Sebastiani, 2006;Hu and Liu, 2004;Molina-González et al., 2013). On the other hand, corpusbased approaches also start from a set of polar terms but instead of using dictionaries, which are domain dependant and usually difficult to find in some languages, they try to integrate external knowledge from document collections (Hatzivassiloglou and McKeown, 1997;Turney, 2002;Molina-González et al., 2015b).
In this paper we investigate two corpus-based methods used to adapt a polarity lexicon to a specific domain. On the one hand, domain adaptation using Term Frequency (TF) and on the other hand, domain adaptation using pattern matching with a BootStrapping algorithm (BS). Both methods are corpus based and start with the same polarity lexicon, but the first one requires an annotated collection of documents while the second one only needs a corpus where it seeks the specific patterns. The performance obtained with both methods significantly overcomes the baseline system using the general polarity lexicon. Finally, we propose an approach that combines both domain adaptation methods and the results obtained are even better that those with the systems applied individually.
Most studies usually start with a very small set of polar terms and then apply some methods to extract and append new subjective words to the original list. However, in this paper we use a very large general lexicon as our starting point. Specifically, in the two approaches presented in the paper, we have used as seed the Spanish polarity lexicon iSOL (Molina- González et al., 2013). This lexicon has been successfully applied in several studies, showing a very good performance in Spanish SA (González et al., 2015a;Cruz et al., 2014). In addition, we have carried out our experiments with the Spanish movie corpus MuchoCine (Cruz et al., 2008).
Regarding the domain approaches applied, the first one is based on the Term Frequency (TF) in an annotated corpus. This approach has already been applied in previous studies using different domains (Molina-González et al., 2015b;González et al., 2015c). However, although in all the cases the results are improved, we have noted that some polar terms are wrongly added and sometimes some noise is introduced into the systems. Thus, in this paper we investigate a new method that not only adds new words but also, if some terms are detected to be highly subjective, they are eliminated from the adapted lexicon. This is the second approach used in this paper and it is based on detecting patterns in a corpus and applying a bootstrapping algorithm in order to enrich and clean the polarity lexicon. We will call this approach Boot-Strapping (BS). One of the best points of this second method is that we do not need any annotated corpus to adapt the system. We only need a polarity lexicon and a corpus of documents. From this corpus, we extract patterns using the information in the lexicon and we apply the bootstrapping algorithm in order to append or eliminate polar terms from the original lexicon. In this paper we only look for patterns including adjectives, because according to several studies this kind of word is the best clue to express subjectivity in documents (Wiegand et al., 2013).
As we will show in Section 5, the results obtained with the TF approach are very promising although the method requires an annotated corpus. On the other hand, although the results with the BS strategy also surpass the baseline system with the general opinion lexicon, the improvement is lower than we would hope. For this reason, we have decided to combine both methods in order to take advantage of them. Thus, we first apply the TF approach obtaining a new adapted polarity lexicon. This new list is used as the seed of polar terms for the BS algorithm. In this way, our method not only appends terms that are adapted to the domain but it also eliminates polar terms that can be considered highly subjective in this specific domain. The results achieved with this combined method improve on the performance of both approaches when they are applied individually.
The rest of the paper is organised as follows: The following section presents a review of the main methods of domain adaptation, focusing mainly on the corpus-based approaches and commenting on some studies that deal with Spanish. Section 3 introduces the methodology of adaptation based on term frequency and Section 4 presents domain adaptation using bootstrapping. Section 5 exhibits the resources employed and shows the results obtained. In addition, we propose the combination of both methods in order to achieve an improvement in the final system. Section 6 discusses the different results obtained and analyses the systems proposed showing their advantages and disadvantages. Finally, in Section 7 conclusions and future work are presented.

Background
In this paper we focus on Spanish domain adaptation. We follow two different corpus based methods and then we combine both of them. Our work is based on the papers briefly described below.

Methods using Bootstrapping (BS)
One of the first learning studies to extract linguistic patterns for determining the polarity of sentences was (Hatzivassiloglou and McKeown, 1997). They consider only adjectives and using a large corpus classify the new extracted terms as positive or negative. Turney (2002) follows the same approach but includes adverbs along with adjectives, and uses Pointwise Mutual Information and Information Retrieval to estimate the semantic orientation. Riloff et al. (2003) present a bootstrapping algorithm that learns subjective nouns from an unannotated corpus. The approach uses two high precision classifiers (HP-Subj and HP-Obj) to automatically identify subjective and objective sentences. These classifiers give very high precision but low recall. The extracted sentences are then added to the training data to learn patterns. The learned patterns are then used to automatically identify more subjective and objective sentences. The bootstrapping algorithm increases the recall of the final system while high precision is maintained.
Regarding the studies dealing with Spanish documents,  is one of the first works applying a bootstrapping algorithm to discover new subjective adjectives in a Spanish corpus, following the same approach as Hatzivassiloglou and McKeown (1997). However, they use a small set of polar seeds that they have previously shown to be domain independent . Actually, they introduce the concept of "highly sub-jective adjectives" as those adjectives which can be considered not only positive for one domain and negative for another ("predictable steering" in car domain vs "predictable plot" in movie domain), but which could also change their prior polarity even within the same domain ("antique car" can be positive for a classic person and negative for a modern person).
Our method using a bootstrapping algorithm is based on this idea of "highly subjective adjectives". We first identify and extract the adjectives learned by the linguistic patterns, but before being included in the original lexicon, we first check whether the adjective must be eliminated or included in a set of "highly subjective adjectives".

Methods using Term Frequency (TF)
The method based on term frequency proposed in this paper has already been applied, although the domain and corpora used are different. For example, the proposed approach in (Du et al., 2010) is based on the idea that a word must be negative (or positive) if it appears in many negative (or positive) documents, among other assumptions. The authors select three datasets of different domains, and from the relationships between them, they generate two labelled sentiment lexicons (domain independent and domain dependent) for each domain. In (Dehkharghani et al., 2012) a method is proposed for building a domain-dependent polarity classification system. The hotel and movie domains are selected by the authors. Each review is represented by a set of domain-independent features and a set of domain-dependent ones. The domain independent features are extracted from SentiWordNet (Baccianella et al., 2010). To build the set of domaindependent features the authors propose taking the lexicon built by Hu and Liu (2004) and choosing those positive/negative words that occur in a significant number of positive/negative reviews of the training corpus used for the experimentation.
For Spanish domain adaptation, García et al. (2012) develop a polarity classification system based on the use of a list of opinion words generated by the authors. The corpus used for the evaluation is a set of hotel reviews written in Spanish and gathered from TripAdvisor. The method consists of counting the number of positive and negative words that ap-pear in the text. Molina-  propose a method taking as a base the iSOL lexicon and then enrich it with the most frequent words in positive/negative reviews from a Spanish corpus in the tourism domain (subset of SFU corpus 1 ). They implement an automatic method to determine the best ratio between positive and negative words in order to integrate them into the new adapted lexicon. The system is tested on a different corpus of Spanish hotel reviews composed of more than 32,000 opinions. The results obtained improve greatly on the result with other general purpose lexicon. Afterward, the same approach is extended to the 8 different domains present in the Spanish SFU corpus (Molina- González et al., 2015b). In this paper we have applied the same approach as Molina- , but instead of focusing on the tourism domain we have applied the method to the movie domain. We have also carried out several experiments in order to determine the best ratio between positive and negative terms to consider polar terms.

Domain adaptation based on Bootstrapping
The bootstrapping algorithm implemented in this paper follows the approach described in ) that proposes the use of a bootstrapping method to automatically create lists of polar adjectives relevant for a domain and to detect "highly subjective adjectives" (that is, adjectives that could change their polarity even in the same domain). It is a corpus-based method that only needs a set of linguistic patterns extracted from a corpus and a seed polarity lexicon. The hypothesis of this algorithm is that there are some linguistic patterns that provide evidence of the semantic orientation of the words, and therefore this information can be iteratively used to identify the polarity of new words. The patterns selected by  are the ones corresponding in Spanish to those presented in Hatzivassiloglou and McKeown (1997) where the authors hypothesize that the conjunction "and" ("y"/"e") joins adjectives of the same orientation while the conjunction "but" ("pero"/"aunque") joins adjectives of different orientation. As seed words, they use a set of 28 positive and 7 negative adjectives that five human annotators manually labeled as domain independent . They test the effectiveness of the proposed approach over a set of 200 Spanish documents manually tagged with the polar adjectives that should be in the final polarity lexicon. From all the adjectives labeled by the bootstrapping algorithm (67% of the total are identified), 97.6% of the positive adjectives and 71.5% of the negatives adjectives are correctly tagged. The authors validate the method and obtain promising results, but they do not test the generated sentiment lexicon in the task of polarity classification, therefore we have decided to use it in our experimentation to check how it works. The bootstrapping algorithm operates as follows. In first place, all the pairs of adjectives that match with any of the patterns defined (adj1 "y"/"e" adj2, adj1 "pero"/"aunque" adj2) are extracted from the training corpus. After this, the following process is iteratively repeated until there are no changes (insertion or removal) in the polarity lexicon.
For each pattern found, if any of the adjectives is in the seed polarity lexicon we proceed as follows: • If the adjectives are joined by "y"/"e" and the polarity of one of them is unknown (that is, the adjective is not in the polarity lexicon yet), then the unknown adjective has the same semantic orientation as the other adjective and consequently it is added to the polarity lexicon with its corresponding polarity.
• If the adjectives are connected by "y"/"e" and the polarity of the two is known; if both have the same polarity all is well, but if they have opposite polarity (positive adj "y"/"e" negative adj, negative adj "y"/"e" positive adj), the two adjectives will be added to the list of highly subjective adjectives and both will be removed from the polarity lexicon.
• If the adjectives are joined by the conjunction "pero"/"aunque" and the polarity of one of them is unknown, then the unknown adjective has the opposite semantic orientation as the other adjective and consequently it is added to the polarity lexicon with its corresponding polarity.
• If the adjectives are joined by "pero"/"aunque" and the polarity of the two is known; if both have opposite polarity all is well, but if they have the same polarity (positive adj "pero"/"aunque" positive adj, negative adj "pero"/"aunque" negative adj), the two adjectives will be added to the list of highly subjective adjectives and both will be removed from the polarity lexicon.
We first applied this approach directly using the 35 adjectives proposed by  over the MuchoCine Corpus. However, the results were very poor and, thus, we decided to start from the iSOL lexicon as seed for the BS algorithm.
The main advantages of this approach are that it does not need a corpus previously tagged with the semantic orientation of each text, it can be applied to any domain and it not only adds adjectives to the seed sentiment lexicon but also cleans it by removing the highly subjective adjectives. However, we also find some disadvantages. It only takes into account one part of speech, the adjective. Furthermore, it removes an adjective if it appears in a contradictory construction 2 without considering that the adjective could appear in more correct constructions than in those that are contradictory. For example, a user can make a mistake and write "clean and dirty", then the algorithm will remove both adjectives from the positive and negative lists respectively, but if it took into account that there are a great quantity of correct constructions for the adjective "dirty" (for example, "dirty and ugly", "dirty and dusty". . . ) it should not be removed from the negative list. Thus, we should consider other parameters before removing a specific polar term.

Domain adaptation based on Term Frequency
In order to implement the Term Frequency approach, we have followed the same assumptions as Du et al. (2010). According to this strategy, a word must be negative (or positive) if it appears in many negative (or positive) documents. Therefore, we have implemented an automatic method to determine the groups of terms in order to integrate them into the new adapted lexicons. The groups of words are selected using the following equation: Where f + is the absolute frequency of the occurrences of a given word in positive reviews and f − is the absolute frequency of the occurrences of a given word in negative reviews. Therefore, n is the ratio between the amount of positive and negative words.
The main advantage of this method is its simplicity and quick implementation. Nevertheless, we find several disadvantages. On the one hand, the task of finding available corpora labelled with polarity at the document level is sometimes difficult, particularly for certain domains and languages. On the other hand, the inclusion in the adapted lexicon of all types of words without any discrimination, depending only on a ratio, sometimes introduces noise that does not improve the result in polarity classification.

Experimental framework and results
We tested two corpus-based approaches and the combination of them for the domain adaptation of a polarity lexicon in Spanish. The list of opinion words taken as a starting point was iSOL (Molina- González et al., 2013). iSOL is a Spanish polarity lexicon generated from the automatic translation of the Bing Liu Lexicon (Hu and Liu, 2004) and the manual revision of it. It is composed of 2,509 positive and 5,626 negative words.
For the adaptation of this lexicon and for testing the TF and BS approaches we used the Spanish MuchoCine corpus (MC) (Cruz et al., 2008). This dataset consists of 3,878 movie reviews collected from the MuchoCine website. The reviews are written by web users, therefore the sentences found in the reviews may include spelling mistakes or informal expressions and they may not always be grammatically correct. The dataset contains about 2 million words and an average of 546 words per review. The opinions in the corpus are rated on a scale from 1 to 5. A rank of 1 means that the opinion is very bad and 5 means very good. Reviews with a rating of 3 can be categorized as "neutral", which means that the user considers the movie is neither bad nor good. In our experiments the neutral reviews were not taken into account, the opinions with ratings of 1 or 2 were considered as negative and those with ratings of 4 or 5 as positive (in total 1,351 positive and 1,274 negative reviews). The 60% of these reviews (781 positive and 794 negative reviews) were employed for the domain adaptation of iSOL and the remaining 40% (570 positive and 480 negative reviews) were used for testing the resultant lists in the task of polarity classification.
In order to apply the method based on frequency (TF), the punctuations and the stopwords of the documents were first removed. After this, the absolute frequency of each word in the positive and negative reviews was determined. Subsequently, several experiments were carried out to add to iSOL those words of the corpus that verify Equation 1, considering different ratios in order to fix the best ratio between positive and negative terms to consider polar terms.
In the case of the approach based on bootstrapping (BS), the documents were first tokenized and splitted into sentences and each token was tagged with its pertinent part of speech, using Freeling 3 (Carreras et al., 2004). Afterwards, the pairs of adjectives that matched with any of the defined patterns were extracted using regular expressions. Finally, two experiments were performed. In the first one, the bootstrapping algorithm was applied using as seed the opinion lexicon iSOL, and in the second the opinion lexicon resultant from applying the TF method with the best ratio (eSOLMovie) was employed as seed.
In order to evaluate the experiments we used the traditional measures employed in text classification: precision (P), recall (R), F1 and Accuracy. On the other hand, to calculate the polarity (p) of a review (r) with each lexicon, we take into account the total number of positive words (#positive) and the total number of negative words (#negative) within the review, according to the following strategy: 3 http://nlp.lsi.upc.edu/freeling/ As the baseline of our experimentation we took the general purpose lexicon iSOL, in order to adapt it to the movie domain with the proposed approaches and with the combination of them. The result of the polarity classification of the documents following the strategy defined previously and using iSOL is of 62.95%, in terms of accuracy.
Regarding the Term Frequency methodology, we first carried out different experiments in order to determine the best ratio to use in our final experiment combining both strategies. Thus, after testing several ratios we determined that the best one is obtained using the ratio n=4 ( Figure 1). Therefore, this lexicon was taken as the seed list for the combined experiment with bootstrapping. Table 1 shows the results obtained over the MC corpus using iSOL (domain independent) and eSOLMovie n lexicons (adapted to the movie domain with the ratios n=3,4,5,6). In relation to the Bootstraping strategy, it found 1,841 "y"/"e" patterns and 39 "pero"/"aunque" patterns and it was tested using iSOL and eSOLMovie 4 as seeds. The experiment with iSOL detected 292 highly subjective adjectives, inserted 659 adjectives, removed 110 adjectives and converged in 5 iterations, achieving 63.71% of accuracy in the polarity classification of the corpus. On the other hand, the experiment with eSOLMovie4 detected 296 highly subjective adjectives, appended 626 adjectives, deleted 228 adjectives and converged in 4 iterations, achieving 70.19% of accuracy. Table 2 Resource Macro-P Macro-R Macro-F1 Accuracy Improvement Improvement   shows the results achieved in the classification at the document level of the MC corpus using iSOL (domain independent) adapted to the movie domain with the BS approach, with the TF method (using the best ratio) and with the combination of both strategies (the bootstrapping algorithm over eSOL-Movie4).
6 Results analysis Table 3 shows a summary of the results obtained and Table 4 presents the total number of positive and negative polar terms inserted and eliminated from the original lexicon after applying each method. As we can see, the results obtained with the TF method (eSOLMovie n ) are very promising. It improves the accuracy of the classification with respect to the general purpose lexicon (iSOL) by 10.29%, inserting only 132 positive words and 126 negative words (Table 3 and 4). However, the restriction of this approach is that we need a corpus previously tagged with the polarity of the documents. Moreover, this strategy only appends new words to the original lexicon and sometimes the new terms introduce noise (for example, we consider that the words "fácil" (easy) and "rápido" (fast) could not be indicators of negative opinion in the movie domain, see Table 5). Therefore, we also decided to conduct experiments with a technique based on bootstrapping that does not require an annotated corpus and that not only appends words but also removes some of them. The application of the BS approach for the adap-tation of iSOL to the movie domain also achieves an improvement in the classification (iSOL + bootstrapping) over the baseline (iSOL), although this improvement is not as great as we expected (it is only about 1.21%, see Table 3). We think that one of the reasons could be that in this approach we have only used patterns for the extraction of adjectives, while the TF method appends a word independently of its PoS. Furthermore, we think that the fact of not only inserting but also removing adjectives would be promising but if we take a look at some of the words removed (Table 5), there are adjectives that a priori it seems they should not have been eliminated (for example, the adjectives "agradable" (pleasant) and "espectacular" (spectacular) of the positive list).
Due to this fact, we decided to combine both methods in order to take advantage of them. Thereby, the list generated with the TF method (eSOLMovie 4 ) was used as input in the BS algorithm and the resultant list (eSOLMovie 4 + Bootstrapping) was employed for the classification of the movie reviews, achieving an improvement with respect to the baseline (iSOL) of 11.50%, in terms of accuracy (Table 3). Moreover, the results obtained with this combined method improve on the performance of both approaches (iSOL + bootstrapping and eSOLMovie 4 ) when applied individually.

Conclusions and future work
In this paper we have presented two corpus-based approaches for the domain adaptation of a polarity    lexicon. Both methods are language independent and can be applied to any domain. One of them, the based on term frequency (TF), needs a corpus previously tagged with the polarity of the documents and the other one, the based on a bootstrapping algorithm (BS), does not require an annotated corpus, it only needs as input a set of patterns and a seed sentiment lexicon. The TF approach achieves very promising results while the BS strategy, although it improves on the baseline system with the general purpose lexicon, does not improve as much as we expected. Due to this fact we have combined both methods, in order to take advantage of the positive aspects of each of them. With this new lexicon we have achieved an improvement of 11.50% (in terms of accuracy) in the polarity classification of the movie reviews with respect to the results achieved with the general purpose lexicon iSOL.
In future work, we consider adding some improvements to the domain adaptation approach based on bootstrapping in order to solve one of the main disadvantages detected and to strengthen its main advantage. Thus, we plan to consider a different approach to remove an adjective from the sentiment lexicon, for example, if two adjectives are in contradictory constructions before removing them it could be a good idea to check the number of correct and contradictory constructions in which each of them appear, and remove them only if the number of contradictory constructions exceeds the number of correct constructions in a threshold. We also plan to check, before adding a new adjective to the sentiment lexicon, its polarity according to other important lexical resources (such as SentiWordNet using Multilingual Central Repository (Gonzalez-Agirre et al., 2012) to map sentiment labels to Spanish). Moreover, we will incorporate new patterns to the algorithm in order to extract polar words with another PoS (not only adjectives) such as nouns, verbs and adverbs.