Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora

A word's sentiment depends on the domain in which it is used. Computational social science research thus requires sentiment lexicons that are specific to the domains being studied. We combine domain-specific word embeddings with a label propagation framework to induce accurate domain-specific sentiment lexicons using small sets of seed words. We show that our approach achieves state-of-the-art performance on inducing sentiment lexicons from domain-specific corpora and that our purely corpus-based approach outperforms methods that rely on hand-curated resources (e.g., WordNet). Using our framework, we induce and release historical sentiment lexicons for 150 years of English and community-specific sentiment lexicons for 250 online communities from the social media forum Reddit. The historical lexicons we induce show that more than 5% of sentiment-bearing (non-neutral) English words completely switched polarity during the last 150 years, and the community-specific lexicons highlight how sentiment varies drastically between different communities.


Introduction
The sentiment of the word soft varies drastically between an online community dedicated to sports and one dedicated to toy animals ( Figure 1). Terrific once had a highly negative connotation; now it is essentially synonomous with good ( Figure 2).
Inducing domain-specific sentiment lexicons is crucial to computational social science (CSS) research. Sentiment lexicons allow researchers to an- alyze key subjective properties of texts, such as user opinions and emotional attitudes (Taboada et al., 2011). However, without domain-specific lexicons, analyses can be misled by sentiment assignments that are biased towards domain-general contexts and that fail to take into account community-specific vernacular or demographic variations in language use (Hovy, 2015;Yang and Eisenstein, 2015). Experts or crowdsourced annotators can be used to construct sentiment lexicons for a specific domain, but these efforts are expensive and timeconsuming (Mohammad and Turney, 2010; Fast et al., 2016). Crowdsourcing is especially problematic when the domain involves very non-standard language (e.g., historical documents or obscure social media forums), since in these cases annotators must understand the sociolinguistic context of the data.
Recent work has shown that web-scale sentiment lexicons can be automatically induced for large socially-diffuse domains, such as the internet-atlarge (Velikovich et al., 2010) or all of Twitter (Tang et al., 2014). However, in cases where researchers want to analyze the sentiment of domain-specific language-such as in financial documents, historical texts, or tight-knit social media forums-it is not enough to simply use generic crowdsourced or webscale lexicons. Generic lexicons will not only be inaccurate in specific domains, they may mislead research by introducing harmful biases (Loughran and McDonald, 2011) 1 . Researchers need a principled and accurate framework for inducing lexicons that are specific to their domain of study.
To meet this need, we introduce SENTPROP, a framework to learn accurate sentiment lexicons from small sets of seed words and domain-specific corpora. Unlike previous approaches, SENTPROP is designed to maintain accurate performance when using modestly-sized domain-specific corpora (∼10 7 tokens), and it provides confidence scores along with the learned lexicons, which allows researchers to quantify uncertainty in a principled manner.
The key contributions of this work are: 1. A state-of-the-art sentiment induction algorithm, combining high-quality word vector embeddings with an intuitive label propagation approach. 2. A novel bootstrap-sampling framework for inferring confidence scores with the sentiment values. 3. Two large-scale studies that reveal how sentiment depends on both social and historical context. (a) We induce community-specific sentiment lexicons for the largest 250 "subreddit" communities on the social-media forum Reddit, revealing substantial variation in word sentiment between communities.
(b) We induce historical sentiment lexicons for 150 years of English, revealing that >5% of words switched polarity during this time.
To the best of our knowledge, this is the first work to systematically analyze the domain-dependency of sentiment at a large-scale, across hundreds of years and hundreds of user-defined online communities. All of the inferred lexicons along with code for SENTPROP and all methods evaluated are made available in the SOCIALSENT package released with this paper. 2 1 http://brandsavant.com/brandsavant/ the-hidden-bias-of-social-media-sentiment-analysis 2 http://nlp.stanford.edu/projects/socialsent Figure 2: Terrific becomes more positive over the last 150 years. Sentiment values and bootstrapped confidences were computed using SENTPROP on historical data (Section 6).

Related work
Our work builds upon a wealth of previous research on inducing sentiment lexicons, along two threads: Corpus-based approaches use seed words and patterns in unlabeled corpora to induce domainspecific lexicons. These patterns may rely on syntactic structures (Hatzivassiloglou and McKeown, 1997;Thelen and Riloff, 2002;Widdows and Dorow, 2002;Jijkoun et al., 2010;Rooth et al., 1999), which can be domain-specific and brittle (e.g., in social media lacking usual grammatical structures). Other models rely on general cooccurrence (Turney and Littman, 2003;Riloff and Shepherd, 1997;Igo and Riloff, 2009). Often corpus-based methods exploit distant-supervision signals (e.g., review scores, emoticons) specific to certain domains (Asghar et al., 2015;Blair-Goldensohn et al., 2008;Bravo-Marquez et al., 2015;Choi and Cardie, 2009;Severyn and Moschitti, 2015;Speriosu et al., 2011;Tang et al., 2014). An effective corpus-based approach that does not require distant-supervision-which we adapt here-is to construct lexical graphs using word cooccurrences and then to perform some form of label propagation over these graphs (Huang et al., 2014;Velikovich et al., 2010). Recent work has also learned transformations of word-vector representations in order to induce sentiment lexicons (Rothe et al., 2016). Fast et al. (2016) combine word vectors with crowdsourcing to produce domain-general topic lexicons.
Dictionary-based approaches use hand-curated lexical resources-usually WordNet (Fellbaum, 1998)-in order to propagate sentiment from seed labels (Esuli and Sebastiani, 2006;Hu and Liu, 2004;Kamps et al., 2004;Rao and Takamura et al., 2005;Tai and Kao, 2013). There is an implicit consensus that dictionary-based approaches will generate higher-quality lexicons, due to their use of these clean, hand-curated resources; however, they are not applicable in domains lacking such a resource (e.g., most historical texts).
Most previous work seeks to enrich or enlarge existing lexicons (San Vicente et al., 2014;Velikovich et al., 2010;Qiu et al., 2009), emphasizing recall over precision. This recall-oriented approach is motivated by the need for massive polarity lexicons in tasks like web-advertising (Velikovich et al., 2010). In contrast to these previous efforts, the goal of this work is to induce high-quality lexicons that are accurate to a specific social context.
Algorithmically, our approach is inspired by Velikovich et al. (2010). We extend this work by incorporating high-quality word vector embeddings, a new graph construction approach, an alternative label propagation algorithm, and a bootstrapping method to obtain confidences. Together these improvements, especially the high-quality word vectors, allow our corpus-based method to even outperform the state-of-the-art dictionary-based approach.

Framework
Our framework, SENTPROP, is designed to meet four key desiderata: 1. Resource-light: Accurate performance without massive corpora or hand-curated resources.

Interpretable:
Uses small seed sets of "paradigm" words to maintain interpretability and avoid ambiguity in sentiment values. 3. Robust: Bootstrap-sampled standard deviations provide a measure of confidence. 4. Out-of-the-box: Does not rely on signals that are specific to only certain domains.
SENTPROP involves two steps: constructing a lexical graph from unlabeled corpora and propagating sentiment labels over this graph.

Constructing a lexical graph
Lexical graphs are constructed from distributional word embeddings learned on unlabeled corpora.

Distributional word embeddings
The first step in our approach is building highquality semantic representations for words using a vector space model (VSM). We embed each word w i ∈ V as a vector w i that captures information about its co-occurrence statistics with other words (Landauer and Dumais, 1997;Turney and Pantel, 2010). This VSM approach has a long history in NLP and has been highly successful in recent applications (see Levy et al., 2015 for a survey).
When recreating known lexicons, we used a number of publicly available embeddings (Section 4).
In the cases where we learned embeddings ourselves, we employed an SVD-based method to construct the word-vectors. First, we construct a matrix M P P M I ∈ R |V|×|V| with entries given by wherep denotes smoothed empirical probabilities of word (co-)occurrences within fixed-size sliding windows of text. 3 M P P M I i,j is equal to a smoothed variant of the positive pointwise mutual information between words w i and w j (Levy et al., 2015). Next, we compute M P P M I = UΣV , the truncated singular value decomposition of M P P M I . The vector embedding for word w i is then given by Excluding the singular value weights, Σ, has been shown known to dramatically improve embedding qualities (Turney and Pantel, 2010; Bullinaria and Levy, 2012). Following standard practices, we learn embeddings of dimension 300. We found that this SVD method significantly outperformed word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) on the domainspecific datasets we examined. Our results echo the findings of Levy et al. (2015) that the SVD approach performs best on rare word similarity tasks.

Defining the graph edges
Given a set of word embeddings, a weighted lexical graph is constructed by connecting each word with its nearest k neighbors within the semantic space (according to cosine-similarity). The weights of the edges are set as

Propagating polarities from a seed set
Once a weighted lexical graph is constructed, we propagate sentiment labels over this graph using a random walk method (Zhou et al., 2004). A word's polarity score for a seed set is proportional to the probability of a random walk from the seed set hitting that word. Let p ∈ R |V| be a vector of word-sentiment scores constructed using seed set S (e.g., ten negative words); p is initialized to have 1 |V| in all entries. And let E be the matrix of edge weights given by equation (3). First, we construct a symmetric transition matrix from E by computing T = D 1 2 ED 1 2 , where D is a matrix with the column sums of E on the diagonal. Next, using T we iteratively update p until numerical convergence: where s is a vector with values set to 1 |S| in the entries corresponding to the seed set S and zeros elsewhere. The β term controls the extent to which the algorithm favors local consistency (similar labels for neighbors) vs. global consistency (correct labels on seed words), with lower βs emphasizing the latter.
To obtain a final polarity score for a word w i , we run the walk using both positive and negative seed sets, obtaining positive (p P (w i )) and negative (p N (w i )) label scores. We then combine these values into a positive-polarity score asp P (w i ) = p P (w i ) p P (w i )+p N (w i ) and standardize the final scores to have zero mean and unit variance (within a corpus).
Many variants of this random walk approach and related label propagation techniques exist in the literature (Zhou et al., 2004;Zhu and Ghahramani, 2002;Zhu et al., 2003;Velikovich et al., 2010;San Vicente et al., 2014). We experimented with a number of these approaches and found little difference between their performance, so we present only this random walk approach here. The SOCIALSENT package contains a full suite of these methods.

Bootstrap-sampling for robustness
Propagated sentiment scores are inevitably influenced by the seed set, and it is important for researchers to know the extent to which polarity values are simply the result of corpus artifacts that are correlated with these seeds words. We address this issue by using a bootstrap-sampling approach to obtain confidence regions over our sentiment scores. We bootstrap by running our propagation over B random equally-sized subsets of the positive and negative seed sets. Computing the standard deviation of the bootstrap-sampled polarity scores provides a measure of confidence and allows the researcher to evaluate the robustness of the assigned polarities. We set B = 50 and used 7 words per random subset (full seed sets are size 10; see Table 1).

Recreating known lexicons
We validate our approach by recreating known sentiment lexicons in the three domains: Standard English, Twitter, and Finance. Table 1 lists the seed words used in each domain.
Standard English: To facilitate comparison with previous work, we focus on the well-known General Inquirer lexicon (Stone et al., 1966). We also use the continuous valence (i.e., polarity) scores collected by Warriner et al. (2013) in order to evaluate the fine-grained performance of our framework. We test our framework's performance using two different embeddings: off-the-shelf Google news embeddings constructed from 10 11 tokens 4 and embeddings we constructed from the 2000s decade of the Corpus of Historical American English (COHA), which contains ∼2 × 10 7 words in each decade, from 1850 to 2000 (Davies, 2010). The COHA corpus allows us to test how the algorithms deal with this smaller historical corpus, which is important since we will use the COHA corpus to infer historical sentiment lexicons (Section 6).
Finance: Previous work found that general purpose sentiment lexicons performed very poorly on financial text (Loughran and McDonald, 2011), so a finance-specific sentiment lexicon (containing binary labels) was hand-constructed for this domain (ibid.). To test against this lexicon, we constructed embeddings using a dataset of ∼2×10 7 tokens from financial 8K documents (Lee et al., 2014).
Twitter: Numerous works attempt to induce Twitter-specific sentiment lexicons using supervised approaches and features unique to that domain (e.g., follower graphs; Speriosu et al., 2011). Here, we emphasize that we can induce an accurate lexicon using a simple domain-independent and resourcelight approach, with the implication that lexicons can easily be induced for related social media domains without resorting to complex supervised frameworks. We evaluate our approach using the test set from the 2015 SemEval task 10E competition (Rosenthal et al., 2015), and we use the embeddings constructed by Rothe et al. (2016). 5

Baselines and state-of-the-art comparisons
We compared SENTPROP against standard baselines and state-of-the-art approaches. The PMI baseline of Turney and Littman (2003) computes the pointwise mutual information between the seeds and the targets without using propagation. The CountVec baseline, corresponding to the method in Velikovich et al. (2010), is similar to our method but uses an alternative propagation approach and raw cooccurrence vectors instead of learned embeddings. Both these methods require raw corpora, so they function as baselines in cases where we do not use off-the-shelf embeddings. We also compare against DENSIFIER, a state-of-the-art method which learns orthogonal transformations of word vectors instead of propagating labels (Rothe et al., 2016). Lastly, on standard English we compare against a state-of-theart WordNet-based method, which performs label propagation over a WordNet-derived graph (San Vicente et al., 2014). Several variant baselines, all of which SENTPROP outperforms, are omitted for brevity (e.g., using word-vector cosines in place of PMI in Turney and Littman (2003)'s framework). Code for replicating all these variants is available in the SOCIALSENT package.

Evaluation setup
We evaluate the approaches according to (i) their binary classification accuracy (ignoring the neutral class, as is common in previous work), (ii) ternary classification performance (positive vs. neutral vs. negative) 6 , and (iii) Kendall τ rank-correlation with continuous human-annotated polarity scores.
For all methods in the ternary-classification condition, we use the class-mass normalization method (Zhu et al., 2003) to label words as positive, neutral, or negative. This method assumes knowledge of the label distribution-i.e., how many positive/negative vs. neutral words there are-and simply assigns labels to best match this distribution.

Evaluation results
Tables 2a-2d summarize the performance of our framework along with baselines and other state-ofthe-art approaches. Our framework significantly outperforms the baselines on all tasks, outperforms a state-of-the-art approach that uses WordNet on standard English (Table 2a), and is competitive with Sentiment140 on Twitter (Table 2b), a distantlysupervised approach that uses signals from emoticons (Mohammad and Turney, 2010). DENSIFIER also performs extremely well, outperforming SENT-PROP when off-the-shelf embeddings are used (Tables 2a and 2b). However, SENTPROP significantly outperforms all other approaches when using the domain-specific embeddings (Tables 2c and 2d).
Overall, SENTPROP is competitive with the stateof-the-art across all conditions and, unlike previous approaches, it is able to maintain high accuracy even when modestly-sized domain-specific cor-  pora are used. We found that the baseline method of Velikovich et al. (2010), which our method is closely related to, performed very poorly with these domain-specific corpora. This indicates that using high-quality word-vector embeddings can have a drastic impact on performance. However, it is worth noting that Velikovich et al. (2010)'s method was designed for high recall with massive corpora, so its poor performance in our regime is not surprising.

Inducing community-specific lexicons
As a first large-scale study, we investigate how sentiment depends on the social context in which a word is used. It is well known that there is substantial sociolinguistic variation between different communities, whether these communities are defined geographically (Trudgill, 1974) or via underlying sociocultural differences (Labov, 2006). However, no previous work has systematically investigated community-specific variation in word sentiment at a large scale. Yang and Eisenstein (2015) exploit social network structures in Twitter to infer a small number (1-10) of communities and analyzed sentiment variation via a supervised framework. Our analysis extends this line of work by analyzing the sentiment across hundreds of user-defined commu-nities using only unlabeled corpora and a small set of "paradigm" seed words (the Twitter seed words outlined in Table 1). In our study, we induced sentiment lexicons for the top-250 (by comment-count) subreddits from the social media forum Reddit. 7 We used all the 2014 comment data to induce the lexicons, with words lower cased and comments from bots and deleted users removed. 8 Sentiment was induced for the top-5000 non-stop words in each subreddit (again, by comment-frequency).

Examining the lexicons
Analysis of the learned lexicons reveals the extent to which sentiment can differ across communities. Figure 3 highlights some words with opposing sentiment in two communities: in r/TwoXChromosomes (r/TwoX), a community dedicated to female perspectives and gender issues, the words crazy and insane have negative polarity, which is not true in the r/sports community, and, vice-versa, words like soft are positive in r/TwoX but negative in r/sports.
To get a sense of how much sentiment differs across communities in general, we selected a random subset of 1000 community pairs and examined the correlation in their sentiment values for highly sentiment-bearing words (Figure 4). We see that the distribution is noticeably skewed, with many community pairs having highly uncorrelated sentiment values. The 1000 random pairs were selected such that each member of the pair overlapped in at least half of their top-5000 word vocabulary. We then computed the correlation between the sentiments in these community-pairs. Since sentiment is noisy and relatively uninteresting for neutral words, we compute τ 25% , the Kendall-τ correlation over the top-25% most sentiment bearing words shared between the two communities.
Analysis of individual pairs reveals some interesting insights about sentiment and inter-community dynamics. For example, we found that the sentiment correlation between r/TwoX and r/TheRedPill (τ 25% = 0.58), two communities that hold conflicting views and often attack each other 9 , was actually higher than the sentiment correlation between r/TwoX and r/sports (τ 25% = 0.41), two communities that are entirely unrelated. This result suggests that conflicting communities may have more similar sentiment in their language compared to communities that are entirely unrelated.

Inducing diachronic sentiment lexicons
Sentiment also depends on the historical time-period in which a word is used. To investigate this dependency, we use our framework to analyze how word polarities have shifted over the last 150 years. The phenomena of amelioration (words becoming more positive) and pejoration (words becoming more negative) are well-discussed in the linguistic literature (Traugott and Dasher, 2001); however, no comprehensive polarity lexicons exist for historical data (Cook and Stevenson, 2010). Such lexicons are crucial to the growing body of work on NLP analyses of historical text (Piotrowski, 2012) which are informing diachronic linguistics (Hamilton et al., 2016), the digital humanities (Muralidharan and Hearst, 2012), and history (Hendrickx et al., 2011).
The only previous work on automatically inducing historical sentiment lexicons is Cook and Stevenson (2010); they use the PMI method and (a) Lean becomes more positive. Lean underwent amelioration, becoming more similar to muscular and less similar to weak.
(b) Pathetic becomes more negative. Pathetic underwent pejoration, becoming similar to weak and less similar to passionate. a full modern sentiment lexicon as their seed set, which problematically assumes that all these words have not changed in sentiment. In contrast, we use a small seed set of words that were manually selected based upon having strong and stable sentiment over the last 150 years (Table 1; confirmed via historical entries in the Oxford English Dictionary).

Examining the lexicons
We constructed lexicons from COHA, since it was carefully constructed to be genre balanced (e.g., compared to the Google N-Grams; Pechenick et al., 2015). We built lexicons for all adjectives with counts above 100 in a given decade and also for the top-5000 non-stop words within each year. In both these cases we found that >5% of sentiment-bearing (positive/negative) words completely switched polarity during this 150-year time-period and >25% of all words changed their sentiment label (including switches to/from neutral). 10 The prevalence of full polarity switches highlights the importance of historical sentiment lexicons for work on diachronic linguistics and cultural change. Figure 5a shows an example amelioration detected by this method: the word lean lost its negative connotations associated with "weakness" and instead became positively associated with concepts like "muscularity" and "fitness". Figure 5b shows an example pejoration, where pathetic, which used to be more synonymous with passionate, gained stronger negative associations with the concepts of "weakness" and "inadequacy" (Simpson et al., 1989). In both these cases, semantic similarities 10 We defined the thresholds for polar vs. neutral using the class-mass normalization method and compared scores averaged over 1850-1880 to those averaged over 1970-2000. computed using our learned historical word vectors were used to contextualize the shifts.
Some other well-known examples of sentiment changes captured by our framework include the semantic bleaching of sorry, which shifted from negative and serious ("he was in a sorry state") to uses as a neutral discourse marker ("sorry about that") and worldly, which used to have negative connotations related to materialism and religious impurity ("sinful worldly pursuits") but now is frequently used to indicate sophistication ("a cultured, worldly woman") (Simpson et al., 1989). Our hope is that the full lexicons released with this work will spur further examinations of such historical shifts in sentiment, while also facilitating CSS applications that require sentiment ratings for historical text.

Conclusion
SENTPROP allows researchers to easily induce robust and accurate sentiment lexicons that are relevant to their particular domain of study. Such lexicons are crucial to CSS research, as evidenced by our two studies showing that sentiment depends strongly on both social and historical context. The sentiment lexicons induced by SENTPROP are not perfect, which is reflected in the uncertainty associated with our bootstrap-sampled estimates. However, we believe that these user-constructed, domain-specific lexicons, which quantify uncertainty, provide a more principled foundation for CSS research compared to domain-general sentiment lexicons that contain unknown biases. In the future our method could also be integrated with supervised domain-adaption (e.g., Yang and Eisenstein, 2015) to further improve these domain-specific results.