Domain-Specific Sentiment Lexicons Induced from Labeled Documents

Sentiment analysis is an area of substantial relevance both in industry and in academia, including for instance in social studies. Although supervised learning algorithms have advanced considerably in recent years, in many settings it remains more practical to apply an unsupervised technique. The latter are oftentimes based on sentiment lexicons. However, existing sentiment lexicons reflect an abstract notion of polarity and do not do justice to the substantial differences of word polarities between different domains. In this work, we draw on a collection of domain-specific data to induce a set of 24 domain-specific sentiment lexicons. We rely on initial linear models to induce initial word intensity scores, and then train new deep models based on word vector representations to overcome the scarcity of the original seed data. Our analysis shows substantial differences between domains, which make domain-specific sentiment lexicons a promising form of lexical resource in downstream tasks, and the predicted lexicons indeed perform effectively on tasks such as review classification and cross-lingual word sentiment prediction.


Introduction
Sentiment analysis is among the most prominent forms of natural language processing, with applications such as social media analytics (Rosenthal et al., 2017;Wang et al., 2019;Shoeb et al., 2019), marketing and customer support (Gamon, 2004), as well as recommendation (Yang et al., 2013). Apart from machine learning-driven systems (Pang et al., 2002;Socher et al., 2013;Kalchbrenner et al., 2014, inter alia), which require supervision using labeled training data, there are also lexical resource-driven systems that exploit sentiment lexicons and can be run out-of-the-box without the need for any labeled training data. Well-known sentiment lexicons include the Hu and Liu (2004) Opinion Lexicon, Senti-WordNet (Baccianella et al., 2010), LIWC (Pennebaker et al., 2001), and VADER (Hutto and Gilbert, 2014). There are numerous techniques for lexicon-driven sentiment analysis (Taboada et al., 2011), Sen-tiStrength (Thelwall et al., 2010) being an example of a more modern lexicon-driven sentiment analysis system. Sentiment lexicons can also be used to bootstrap domain-specific supervised sentiment analysis models (Mudinas et al., 2018).
A sentiment lexicon is a resource that, for a given word (form) w, provides an annotation label l w describing its overall sentiment polarity. Some lexicons merely provide labels in {positive, negative} or {positive, neutral, negative}. Others offer more informative intensity scores to account for the fact that some words are more negative or positive than others. For example, an emphatic word such as spectacular is generally considered stronger than a simple good (de Melo and Bansal, 2013). Such scores could be in the range [−1, 1], with −1 denoting the most negative sentiment polarity, whereas +1 is the most positive score.
In this paper, we consider two perennial problems with sentiment lexicons: (i) Sentiment lexicons are based on an abstract domain-independent and context-independent notion of sentiment polarity. In reality, the polarity of a word depends substantially on what one is talking about and how the word is used. For example, when talking about music, the word hot tends to be positive. When talking about a laptop, the laptop often becoming hot would be more negative.
(ii) Sentiment lexicons are typically manually created, and thus have limited coverage. The widely used Hu and Liu (2004) Opinion Lexicon, for instance, consists of around 6,800 words. While this is by no means a small number, the lexicon is still likely to miss important signals.
To mitigate these shortcomings, we induce domain-specific sentiment lexicons using an automated data-driven approach. In our experiments, we consider a corpus of reviews from 24 different domains and first induce seed lexicons using linear predictors. Subsequently, we extend their coverage based on large-scale word vector representations with a deep neural regression model. While our lexicons do not resolve the issues of context and polysemy -these are perhaps best addressed within a full-fledged machine learning architecture -many differences in sentiment polarity for a word stem from divergent uses across different domains. Our experiments confirm that there are substantial differences between domains and that the predicted lexicons prove useful in review classification and cross-lingual word-level sentiment prediction.

Method
Our approach proceeds in two steps. First, we rely on labeled documents for a set of different domains to induce seed data for each of the domains using simple linear predictors.
This seed data already accounts for the differences between domains. However, after the first step, the coverage of the resulting seed data is limited to words occurring in the labeled corpora, which may be small. Hence, in a second step, we rely on deep neural models, exploiting vector representations of words to learn sentiment intensity scores for a much larger vocabulary.

Seed Data Induction
Our approach for seed data induction is simple. Given n domain-specific document sets D i ∈ X × Y (i = 1, . . . , n) labeled with sentiment polarity labels in Y = {positive, negative}, we learn n corresponding linear binary classification models using bag-of-words features. Then, each word present in the vocabulary is assigned a series of domain-specific sentiment polarity scores, by consulting the linear coefficients for the respective word across the n linear models.
Specifically, for each D i , we define the set of features as where V i is the term vocabulary of D i and w j denotes a negated version of word w j . In our experiments, we lower-case all terms and simply treat occurrences of "not w j " in the text as negated features, while all other word occurrences are mapped to unnegated features. Of course, one could also invoke much more sophisticated negation detection methods.
Thus, for each D i , we can map the documents x j in D i to term frequency-based document vectors x j in feature space F i . Along with the labels y j ∈ Y that are given in each D i , we thus obtain n different labeled feature vector setsD i = {(x j , y j ) | (x j , y j ) ∈ D i }. These are invoked to train n different linear models Subsequently, for any word w j ∈ V i , we consider its particular score in domain i to be w i,j , i.e., the linear coefficient for that word in the weight vector w i obtained for the trained model f i . We disregard the negated features, as their frequency tends to be too low to provide a reliable complementary signal. Rather, the main purpose of the negated features is to eliminate noise that might otherwise affect the primary word features.

Neural Vector-Based Expansion
The use of supervised learning based on domain-specific datasets D i to induce the seed data has two notable drawbacks: (i) The coverage of words for some domains i may be low, as it is limited to words in the respective labeled training set vocabulary V i .
(ii) The reliability of induced seed scores may be low if a word was infrequent in the respective domainspecific labeled corpus D i .
Machine learning based on large-scale distributional semantics as reflected in word vector representations can allow us to overcome the above shortcomings and enable the sentiment scoring of millions of words. Specifically, for each domain i, we train a model φ i (v w ) ∈ R to predict a real-valued domainspecific sentiment polarity score for a word w based on its generic vector representation v w as input. Word vectors trained on large amounts of data (Mikolov et al., 2013;Pennington et al., 2014) capture important aspects of lexical semantics. Although they are typically trained based on distributional word co-occurrence information, they have also been found to reveal sentiment signals (Rothe et al., 2016).
As the machine learning component, we consider deep neural regression networks as our prediction models φ i (v w ). The architecture is described in Table 1. In particular, we incorporate several hidden layers, but add batch normalization and dropout for regularization. Additionally, we found that initializing the output layer of our model to scale the softmax scores to the sentiment score range observed in the training data proves beneficial. Further training details are given in Section 3.2.
To train these models, we rely on the automatically induced seed data from Section 2.1 as training data for each domain. However, we need to account for the second observation above, i.e., the fact that the reliability of induced seed scores may be low if a word was observed only a few times in the domain-specific corpus D i . For such words, the predictors f i (x) (Eq. Section 1) may not have received sufficient signal about their polarity, whereas sentiment scores for words with sufficiently high frequency are expected to be more accurate. To address this, for a given domain i, the corresponding training data is defined as where f (x, w j ) denotes the term frequency of word w j in document x and f min is a predefined minimal training corpus frequency threshold. Thus, for each domain i, T i serves as training data to train a deep neural regression model φ i (v w ) to predict a word w's domain-specific sentiment polarity in that domain, based on w's word vector v w .

Results
In the following, we report on a series of experimental results to assess the merits of our proposal. In Section 3.1, we induce seed data based on a large-scale review data set. In Section 3.2, we then proceed with our domain-specific neural expansion approach. We first evaluate it on human-labeled data, and subsequently apply it to the complete vocabulary to induce large-scale domain-specific lexicons with high coverage. Finally, in Section 3.3, we evaluate the effectiveness of these induced domain-specific lexicons on review classification and cross-lingual word-level sentiment prediction.

Seed Data Induction Experiments
As our input corpus, we considered a collection of 142.8 million English language reviews from Amazon.com for the time period spanning May 1996 to July 2014, which has been made publicly available online. 1 The reviews are categorized with respect to an inventory of 24 different classes of products, as listed in Table 2. The ratings are given on a 5-point scale. We regarded reviews with a rating < 3 as negative, while those with a rating > 3 were deemed positive. Three-star reviews were considered neutral and disregarded for seed model training.
We then followed the approach from Section 2.1 by training 24 linear support vector machine models for binary classification, and extracting the resulting linear coefficients for word features as seed data for those words. The coverage of the resulting data is given in the "Seed (All)" and "Seed (Non-neutral)" columns of Table 2. The non-neutral counts refer to words for which the absolute score is above 0.2, i.e., negative scores <-0.2 as well as positive ones >0.2. We observe that the large corpus gives us orders of magnitude better coverage than existing hard-crafted sentiment lexicons. Still, the coverage differs substantially by domain, and for some we have only limited coverage with high magnitude.

Neural Vector-Based Expansion Experiments
Our subsequent experiments on the neural vector-based expansion proceeded in two major phases. First, we validated our expansion approach on a smaller dataset, such that the prediction from our system can be verified against human ground truth ratings. After establishing its accuracy, we proceeded to apply this approach on the 24 domains from Section 3.1.

Validation on VADER lexicon
Data. We started off our experiment with a domain-independent, generic sentiment prediction system such that we could draw on ground truth sentiment scores for words solicited from a group of human test subjects. In particular, we relied on the VADER lexicon (Hutto and Gilbert, 2014), a collection of 7,504 unique English words along with mean sentiment rating in [−4, 4], standard deviation, and raw human sentiment ratings from each test subject, as our pilot dataset. 2 As word vectors, we adopted GloVe CommonCrawl embeddings (Pennington et al., 2014), and we eliminated any words in VADER that are not present in GloVe. A random split of 60%/20%/20% with equally diversified sentiment scores (illustrated in Figure 1a) was used to create train/validation/test portions. Training. To train the model, we relied on a batch size of 32, dropout rate of 20%, and Adam optimization with an initial learning rate of 0.001, dynamic learning rate schedule (halving after 4 epochs of validation loss stagnation), and early stopping. Results. It is important to note that the original VADER scores were obtained from ten human test subjects and there are discrepancies among these scores, which is to be expected in any such test. The highest standard deviation of the mean sentiment rating scores was found to be 2.5, while the lowest was 0. Hence, a simple prediction accuracy is not sufficient to capture the performance of any sentiment prediction model. Thus, three different evaluation methods are presented here to assess the performance of our neural regression model.
First, we evaluated the effectiveness of the proposed model in terms of the raw accuracy across different absolute error tolerances with respect to the human mean sentiment rating. For different absolute prediction error thresholds, we obtain a different percentage of correct predictions, as plotted in Figure  1b. We observe that for 76.23% of cases, the absolute prediction error falls within 0.5 of the mean sentiment rating scores and for around 91% of cases, it falls within unity difference to the human ground truth.
Next, we consider our model as just another opinion along with the 10 original human responses. We then compute the standard deviation among the human scorers and evaluated our predicted scores against it. Figure 1c shows the percentage correct when evaluating the predictions using different standard deviation multiplier thresholds. It is observed that 80% of the model predictions fall within unity standard deviation σ of the ground truth scores, whereas 94% of the predictions fall within just two standard deviations, 2σ, of the mean sentiment rating scores.
Finally, the Pearson correlation coefficient between the predicted scores and the mean sentiment rating scores was found to be 0.903. We can conclude from these three results that our deep model succeeds at learning to recreate scores for held-out data.

Domain-Specific Sentiment Scores
Subsequently, we proceeded to apply the technique on our larger seed data set from Section 3.1, which provides domain-specific sentiment scores. Recall that this seed data was obtained from a corpus of domain-specific reviews and hence sentiment scores obtained through the previously described automated seed data induction method served as the training data for our prediction model. A separate model was trained on each domain, resulting in 24 domain-specific predictors. Hence, each word may obtain 24 different sentiment scores corresponding to the 24 domains.
In this section, we shall denote our neural model's predictions as predicted scores, while sentiment ratings from the automatic seed induction are referred to as seed scores, which here can be regarded as silver standard ground truth targets. Given that we consider the frequency of a word in the original labeled data as a factor that affects the accuracy of our seed data induction, we generated train/validation/test splits with words that have a frequency equal or above different predefined frequency thresholds f min in a given domain.
For each considered frequency threshold f min , we computed the Pearson correlation coefficients between predicted scores and seed data scores on each of the domains, and consider the average of such Pearson correlation coefficients across different domains as the overall accuracy indicator for that f min . Figure 2a plots the outcome of this experiment. In order to find an optimum training frequency threshold to filter out training data with ambiguous sentiment scores, we ran a separate additional experiment, creating a fixed dev./test set by sampling 1,000 tokens from each domain with frequency over 1,000, while generating training data with varied frequency thresholds. Again, we computed the Pearson correlation coefficients of predicted sentiment scores and ground truth seed ones for each domain and took their average as the overall score for a given frequency threshold. The corresponding results are plotted in Figure 2b. Based on the observed scores, we adopted a frequency threshold of 500 for all subsequent experiments.

Extension to Very Large Vocabulary
Finally, in this section, we describe our extension of the sentiment prediction on different domains to all tokens in the word vector vocabulary. At this point, the weights and hyperparameters of our neural regression models were all frozen. We used models trained with frequency threshold f min = 500 from the last section and generated 24 domain-specified sentiment scores for each word in GloVe. Table 2 compares the low coverage of the original seed data with the coverage of the predicted data. Due to the network architecture of the prediction model, it virtually always predicts a non-zero value. However, many words obtained a low score very close to 0. Hence, it is more informative to again consider the filtered higher-intensity words with absolute score above 0.2 as non-neutral. From this, we can observe that our deep prediction helps filled the gaps in domains for which we had smaller amounts of training data. It achieved this in part by exploiting semantic relatedness between new words and words for which we had known scores in our seed data, as revealed by the embeddings.
No ground truth scores are available for the large GloVe vocabulary. However, we confirmed in Section 3.2.1 that our deep model succeeds at learning to predict very high-quality sentiment scores. Figure 3 considers the Pearson correlation of the different domain-specific lexicons with the polarity scores given by the complete VADER lexicon. Any words not covered by our lexicons were assumed to have 0.0 as our polarity score. Obviously, an overly strong correlation with VADER is not desirable, as we seek domain-specific lexicons precisely for their ability to capture domain-specific polarities that differ from generic ones. For example, in the movie domain, a word such as twist typically indicates a plot twist, which is often regarded as positive. In general, however, a word such as twist does not inherently convey anything positive. Still, the fact that our predicted lexicons correlate vastly better with VADER than the initial seed data suggests that they are more reliable. This mainly stems from their better coverage.
For additional analysis, we studied the cross-correlation matrix of sentiment scores obtained from the 24 domains, illustrated in Figure 4(a) as a heat-map. We further applied classical MDS based on the cross-correlation matrix for dimensionality reduction in order to render a 2D representation of the inter-relationships among the sentiment scores from 24 domains, shown in Figure 4(b). We found that the sentiment scores across different domains reflect intuitive connections. For example, entertainmentrelated domains such as Digital Music, Books, CDs and Vinyl, Toys and Games, and Video Games bear clear connections in light of their similarity. Likewise, categories related to household usage such as Pet Supplies, Grocery and Gourmet Food, Tools and Home Improvement, Home and Kitchen, etc. reside in similar locations, in light of the similarity of reviews in such domains. These results, along with the high correlation of the predictions in Section 3.2.2, corroborate that our domain-specific lexicons capture human-like sentiment toward different domains.

Applications of Induced Lexicons
Finally, we assessed the performance of the induced domain-specific sentiment lexicons on downstream tasks such as review sentiment classification and cross-lingual word-level sentiment prediction.

Unsupervised Review Sentiment Classification
Here, we used our predicted domain-specific lexicon to perform sentiment classification on the IMDB movie review dataset compiled by Maas et al. (2011). The test portion of movie review data set has 25,000 reviews in total, among which 12,500 are positive and 12,500 are negative.
As for the word embeddings, in this evaluation, along with GloVe, we also used fastText  to obtain a second set of domain-specific lexicons for comparison. As baselines, along with the raw VADER lexicon, two further domain-independent lexicons were derived by using the VADER lexicon as seed data and invoking GloVe and fastText to expand their coverage using our neural expansion approach. For unsupervised prediction given a document x in the test set, we simply compute a prediction score where |x| denotes the document length and x i denotes the i-th word in x. Recall that φ( v w ) is the neural prediction score, given the word vector for w. Subsequently, we predict the polarity by setting the average of all such prediction scores in the corpus as a binary threshold. Figure 5 plots the results of this evaluation. We observe that in almost all the domains, the domainspecific lexicons (plotted as bars) outperformed the domain-independent lexicons (horizontal lines). As expected, the results are particularly strong for the domains that are closest to the movie domain.

Cross-Lingual Word-Level Sentiment Prediction
Finally, we evaluated the performance of predicted domain-specific lexicons on cross-lingual word-level sentiment score prediction. For this, cross-lingually aligned fastText word vectors (Bojanowski et al., 2017; for four languages (English, Spanish, French, and Polish) were used as word embeddings. As the ground truth, we considered the mean sentiment scores of 7,504 English tokens from VADER, as well as the mean human ratings of valence for 875 Spanish words (Hinojosa et al., 2015), 1,031 French words (Monnier and Syssau, 2013), and 1,586 Polish words (Imbir, 2014). Any words from the ground truth data that are missing in the aligned fastText word vectors are eliminated. The sentiment prediction model was trained on 24 different domains separately, as described in Section 3.2.2, except that we here used the aligned word vectors for English during training. After the training stage, the same models could then be invoked to cross-lingually predict sentiment scores for words from the ground truth data sets using aligned word vectors for non-English words. Correlations between the predicted scores and ground truth datasets are plotted in Figure 6. Although the cross-lingual results did not attain the level of the monolingual English correlation, we obtained a promising degree of crosslingual generalization across languages. Note again that we do not desire a perfect correlation, as the domain-specific scores are expected to diverge from the generic domain-independent valence ratings.

Related Work
The traditional way of obtaining sentiment lexicons has been to build them manually, relying either on experts or invoking crowd-sourcing. A prominent example is the Hu and Liu (2004) Opinion Lexicon. There are numerous algorithms that aim to increase the coverage of an individual sentiment lexicon. Often, these start from seeds and then rely on graph-based algorithms to gather additional data, as for instance explored by Kim and Hovy (2004) and in the approach used to induce SentiWordNet (Baccianella et al., 2010). The extension can also be based on vector representations of words, as proposed in the Densifier approach (Rothe et al., 2016). Such work has shown that dense word vectors trained on large amounts of data harbour signals that are useful for sentiment analysis. Instead of a regular supervised setup, Castellucci et al. (2016) used distant supervision based on emoticons to obtain sentiment labels for entire sentences. They then trained a sentiment model on sentence vector representations sharing a common representation space with word vectors, which allowed them to apply the trained model to predict word-level scores. However, techniques such as the above mostly have not targeted domain-specific sentiment lexicons.
The SocialSent project (Hamilton et al., 2016) induced Reddit community-specific sentiment lexicons without labeled corpora. Their SentProp approach constructs a graph of words and then considers random walks emanating from a small set of seed words with known sentiment polarity. The polarity scores are based on the frequency of random walk visits and the polarity of the seed word from which those random walks started. While Reddit communities provide substantial diversity, the language used in Reddit posts differs quite substantially from the kinds of language one encounters in reviews. Kreutz and Daelemans (2018) adopted SentProp to customize an existing general-purpose sentiment lexicon for use in one specific domain.
We instead focus on inducing a number of domain-specific lexicons to obtain a lexical resource that is more suitable for typical sentiment analysis use cases. The approach by Labille et al. (2017) also starts from labeled data for consumer products. It infers word polarity scores directly based on posterior probabilities and inverse document frequencies. However, such scores are limited to words that occur in the labeled training data.
Instead, in our work, we draw on word vectors to greatly enhance the coverage of the lexicons beyond the words present in the category-specific labeled data. Our initial seed data approach is based on linear models optimized for maximum margin discrimination between the positive and negative classes, in line with the observations by Mudinas et al. (2018), who found that linear models outperformed more Cross-lingual propagation of sentiment lexicons has been studied in a number of previous approaches. For example, Dong and de Melo (2018a) and Dong and de Melo (2018b) induced sentiment embeddings using translation graphs. In our experiments, we considered cross-lingual word embeddings for crosslingual transfer.

Conclusion
In this paper, we present new domain-specific sentiment lexicons for a number of domains. We bootstrap this data from a large-scale review corpus covering 24 domains and then rely on a neural model to substantially extend its coverage. Our analysis shows that there are substantial differences between domains, which make domain-specific sentiment lexicons an important form of lexical resource in downstream tasks. Further experiments show that the predicted lexicons outperform domain-independent lexicons on unsupervised review classification and can also be used for cross-lingual word-level sentiment prediction. Our data is freely available under an open source license from http://sentimentanalysis.org.