Sentiment Lexicon Construction with Representation Learning Based on Hierarchical Sentiment Supervision

Sentiment lexicon is an important tool for identifying the sentiment polarity of words and texts. How to automatically construct sentiment lexicons has become a research topic in the field of sentiment analysis and opinion mining. Recently there were some attempts to employ representation learning algorithms to construct a sentiment lexicon with sentiment-aware word embedding. However, these methods were normally trained under document-level sentiment supervision. In this paper, we develop a neural architecture to train a sentiment-aware word embedding by integrating the sentiment supervision at both document and word levels, to enhance the quality of word embedding as well as the sentiment lexicon. Experiments on the SemEval 2013-2016 datasets indicate that the sentiment lexicon generated by our approach achieves the state-of-the-art performance in both supervised and unsupervised sentiment classification, in comparison with several strong sentiment lexicon construction methods.


Introduction
Sentiment lexicon is a set of words (or phrases) each of which is assigned with a sentiment polarity score. Sentiment lexicon plays an important role in many practical sentiment analysis and opinion mining tasks. There were some manually annotated universal sentiment lexicons such as General Inquireer (GI) and HowNet. However, due to the ubiquitous domain diversity and absence of domain prior knowledge, the automatic construction technique for domain-specific sentiment lex- * The corresponding author of this paper.
icons has become a challenging research topic in the field of sentiment analysis and opinion mining (Wang and Xia, 2016).
The early work employed unsupervised learning for sentiment lexicon construction. They normally labelled a set of seed words at first, and then learned the polarity of each candidate word, based on either word conjunction relations (e.g., constellation and transition in texts) (Hatzivassiloglou and McKeown, 1997), or the word co-occurrence information (such as pointwise mutual information, PMI) (Turney, 2002), between the candidate word and the seed words. However, the unsupervised manner showed limited effect in sentiment prediction, and the performance greatly depends on the quality of the seed words.
To fully exploit the sentiment labeling information in texts, a series of supervised learning methods was further proposed to learn the sentiment lexicons. For example, Mohammad et al. (2013) proposed to construct sentiment lexicons by calculating PMI between the word and the distantly supervised sentiment labels (such as emoticons) in tweets and the word's sentiment orientation (SO). The resulting lexicons obtained the best results in SemEval 2013. More advanced representation learning models were also utilized, with the aim to construct the sentiment lexicons with efficient word embeddings (Tang et al., 2014a;Hamilton et al., 2016;Vo and Zhang, 2016). The traditional representation learning framework such as Word2Vec only captures the syntactic information in the texts, but ignores the sentiment relations between words. Therefore, some researchers attempted to add sentiment supervision into the network structure, in order to train a sentimentaware word embedding. For example, Tang et al. (2014a) exploited a dedicated neural architecture to integrate document-level sentiment supervision and the syntactic knowledge for representation learning. The sentiment-aware word embedding is then used to construct a sentiment lexicon. Vo and Zhang (2016) proposed to learn a two-dimensional sentiment representation based on a simple neural network. The sentiment lexicons generated by their approach obtained better performance to predict the tweet sentiment labels, in comparison with the PMI-based method (Mohammad et al., 2013).
Although these supervised learning methods can to some extent exploit the sentiment labeling information in the texts and can learn a sentiment-aware word embedding, the manner of using document-level sentiment supervision suffers from some complex linguistic phenomena such as negation, transition and comparative degree, and hence unable to capture the fine-grained sentiment information in the text. For example, in the following tweet "Four more fake people added me. Is this why people don't like Twitter? :( ", the document-level sentiment label is negative, but there is a positive word "like" in the text. In representation learning, the embeddings of words are summed up to represent the document, and the word "like" will be falsely associated with the negative sentiment label. Such linguistic phenomena occur frequently in review texts, and makes sentiment-aware word representation learning less effective. To address this problem, in this paper, we propose a new representation learning framework called HSSWE, to learn sentiment-aware word embeddings based on hierarchical sentiment supervision. In HSSWE, the learning algorithm is supervised under both document-level sentiment labels and word-level sentiment annotations (e.g., labeling "like" as a positive word). By leveraging the sentiment supervision at both document and word level, our approach can avoid the sentiment learning flaws caused by coarse-grained document-level supervision by incorporating finegrained word-level supervision, and improve the quality of sentiment-aware word embedding. Finally, following Tang et al. (2014a), a simple classifier was constructed to obtain the domainspecific sentiment lexicon by using word embeddings as inputs.
The main contributions of this work are as follows: 1. To the best of our knowledge, this is the first work that learns the sentiment-aware word representation under supervision at both document and word levels.
2. Our approach supports several kinds of wordlevel sentiment annotations such as 1) predefined sentiment lexicon; 2) PMI-SO lexicon with hard sentiment annotation; 3) PMI-SO lexicon with soft sentiment annotation. By using PMI-SO dictionary as word-level sentiment annotation, our approach is totally corpus-based, without any external resource.
3. Our approach obtains the state-of-the-art performance in comparison with several strong sentiment lexicon construction methods, on the benchmark SemEval 2013-2016 datasets for twitter sentiment classification.

Related Work
In general, sentiment lexicons construction can be classified into two categories, dictionary-based methods and corpus-based methods. Dictionary-based methods generally integrate predefined resources, such as WordNet, to construct sentiment lexicons. Hu and Liu (2004) exploited WordNet for sentiment lexicon construction. They first labelled two sets of seed words by polarities, then extended the sets by adding the synonyms for each word to the same set and antonyms to the other. For a given new word, Kim and Hovy (2004) introduced a Naive Bayes model to predict the polarities with .the synonym set obtained from WordNet as features. Kamps et al. (2004) investigated a graph-theoretic model of WordNet's synonymy relation and measured the sentiment orientation by distance between each candidate word and the seed words with different polarities. Heerschop et al. (2011) proposed a method to propagate the sentiment of seed set words through semantic relations of WordNet.
Corpus-based approaches originate from the latent relation hypothesis: "Pairs of words that cooccur in similar patterns tend to have similar semantic and sentiment relations" (Turney, 2008).
The primary corpus-based method made the use of PMI. Turney (2002) built a sentiment lexicon by calculating PMI between the candidate word and seed words. The difference of the PMI score between positive and negative seed words is finally used as the sentiment orientation (SO) of each candidate word (Turney, 2002). Many variants of PMI were proposed afterwards, for example, positive pointwise mutual information (PPMI), second order co-occurrence PMI (SOC-PMI), etc. Hamilton et al. (2016) proposed to build a sentiment lexicon by a propagation method. The key of this method is to build a lexical graph by calculating the PPMI between words. Instead of calculating the PMI between words, Mohammad et al. (2013) proposed to use emoticons as distant supervision and calculate the PMI between words and the distant class labels, and obtained sound performance for tweet sentiment classification.
The latest corpus-based approaches normally utilize the up-to-date machine learning models (e.g. neural networks) to first learn a sentimentaware distributed representation of words, based on which the sentiment lexicon is then constructed. There were many word representation learning methods such as NNLM (Bengio et al., 2003) and Word2Vec (Mikolov et al., 2013). However, they mainly consider the syntactic relation of words in the context but ignore the sentiment information. Some work were later proposed to deal with this problem by incorporating the sentiment information during representation learning. For example, Tang et al. (2014a) adapted a variant of skip-gram model, which can learn the sentiment information based on distant supervision. Furthermore, Tang et al. (2014b) proposed a new neural network approach called SSWE to train sentimentaware word representation. Vo and Zhang (2016) exploited a simple and fast neural network to train a 2-dimensional representation. Each dimension is explicitly associated with a sentiment polarity.
The sentiment-aware word representation in these methods was normally trained based on only document-level sentiment supervision. In contrast, the learning algorithm in our approach is supervised under both document-level and wordlevel sentiment supervision.

Our Approach
Our approach is comprised of three base modules: (1) Word-level sentiment learning and annotation; (2) Sentiment-aware word embedding learning; (3) Sentiment lexicon construction.
Our approach depends on document-level sentiment labels. The tweet corpus provides a cheap way to get document-level sentiment annotation, owing to the distant sentiment supervision. But it should be noted that our approach is feasible for any corpus provided with document-level sentiment labels (not merely tweets).
The first module of our method aims to learn the pseudo sentiment distribution for each word and use it as word-level sentiment annotations to supervise word embedding learning.
In the second module, we learn the sentimentaware embeddings for each word in corpus, based on hierarchical sentiment supervision.
In the last module, we construct a sentiment lexicon by using the sentiment-aware word embeddings as the basis.

Learning Word-Level Sentiment Supervision
In addition to use a pre-defined sentiment lexicon for word-level annotations, we also propose to learn the word-level sentiment supervision, based on PMI and SO.
(1) PMI and SO Given a corpus with document-level class labels. We first compute the PMI score between each word t and two class labels where + and − denote the positive and negative document-level class labels, respectively.
Second, we compute the SO score for each word t: We call {t, SO(t)} as PMI-SO dictionary. The PMI-SO dictionary was widely used as a corpusbased sentiment lexicon for sentiment classification. By contrast, in our approach, it is the first step to learn the sentiment-aware word representation. Our approach supports two kinds of wordlevel sentiment annotations: 1) PMI-SO dictionary with hard sentiment annotation; 2) PMI-SO dictionary with soft sentiment annotation.
(2) PMI-SO lexicon with hard sentiment annotation The sentiment distribution of word t predicted by our model p(c|de) The sentiment distribution of document d predicted by our model p(c|t) The word-level sentiment annotation of word t with respect to class ĉ p(c|d) The document-level sentiment annotation of document d with respect to class c "Hard sentiment annotation" indicates that [p(−|t),p(+|t)] is a two-dimensional one-hot representation, where the annotation of words is given by the class labels: (3) PMI-SO lexicon with soft sentiment annotation "Soft sentiment annotation" means that the annotation is given by the probability of two sentiment polarities, rather than the class label. We first use the sigmoid function to map the SO score to the range of a probability, and then define as the PMI-SO soft sentiment distribution of the word t.

Learning Sentiment-aware Word Representation under Hierarchical Sentiment Supervision
Till now we have obtained both document and word-level sentiment annotations, in the next step, we propose a neural network framework to learn the sentiment-aware word representation by integrating the sentiment supervision at both word and document granularities. We call it "hierarchical sentiment supervision". The architecture of our model is shown in Figure 1. We denote the corpus as D = {d 1 , d 2 , ..., d N } where N is the size of the corpus. Suppose d k is k-th document in D, and t i represents the i-th word in a document d. The parameters used in our neural network are described in Table 1.
We construct a embedding matrix C ∈ R V ×M , of which each row represents the embedding of a word in the vocabulary, where V is the size of the vocabulary and M is the dimension of word embedding. We randomly initialize each element of matrix C with a normal distribution.
(1) Word-Level Sentiment Supervision We use the word-level sentiment annotation [p(−|t),p(+|t)] provided in Section 3.1 to supervise word representation learning at the word level.
For each word in document d, we map it to a continuous representation as e ∈ C and feed e into our model to predict the sentiment distribution of the input word: The cost function is defined as the average cross entropy that measures the difference between the sentiment distribution predicted in our model and the sentiment annotations at the word level: where T is the number of words in corpus.
(2) Document-Level Sentiment Supervision We use the document-level sentiment annotations to supervise word representation learning at the document level.
In order to obtain a continuous representation of a document d, we simply use the average embedding of words in d as de: We feed de into our model to predict the sentiment probability: t i is the i-th word in d. And e t i represents the embedding of the word t i . We take de, the average embedding of [e t 1 , e t 2 , . . . , e tn ], as the representation of document d. We get each embedding of words in d as input to predict its sentiment polarities. We also take de as input to predict the sentiment for document d one time per epoch.
Similarly, the cost function is defined as average cross entropy that measures the difference between the sentiment distribution predicted in our model and the sentiment annotation at the document level: wherep(c|d k ) is the sentiment annotation of document d k .p(c|d k ) = 1 denotes the class label of d k is positive, otherwisep(c|d k ) = 0.

(3) Word and Document-Level Joint Learning
In order to learn the sentiment-aware word representation at both word and document levels, we integrate the cost function of two levels in a weighted combination way. The final cost function is defined as follows: where α is a tradeoff parameter(0 ≤ α ≤ 1). The weight of f word can be increased by choosing a lager value of α.
We train our neural model with stochastic gradient descent and use AdaGrad (Duchi et al., 2011) to update the parameters.

From Sentiment Representation to Sentiment Lexicon
In this part, we follow the method proposed by Tang et al. (2014a) to build a classifier to convert the sentiment-aware word representation learned in Section 3.2 to a sentiment lexicon. The word representation is the input of the classifier and word sentiment polarity is the output.
Firstly, we utilize the embedding of 125 positive and 109 negative seed words manually labelled by Tang et al. (2014a) as training data 1 .
Thirdly, a traditional logistic regression classifier is trained by using the embeddings of extended sentiment words as the inputs. The sentiment score of a word is the difference between its positive and negative probabilities.
Finally, the sentiment lexicon can be collected by using the classifier to predict the other words' sentiment score.

Datasets and Settings
We utilize the public distant-supervision corpus 2 (Go et al., 2009) to learn our lexicons. We set M , the dimension of embedding, as 50. The learning rate is 0.3 for stochastic gradient descent optimizer. We tune the hyper-parameter α in the training process.
We evaluate the sentiment lexicons in both supervised and unsupervised sentiment classification tasks, on the SemEval 2013-2016 datasets. The statistics of evaluation datasets are shown in Table  2.
Supervised Sentiment Classification Evaluation: To evaluate the effect of the sentiment lexicon in supervised sentiment classification, we report the supervised sentiment classification performance by using some pre-defined lexicon features. We follow (Mohammad et al., 2013) to extract the lexicon features as follows: • Total count of words in the tweet score of which is greater than 0; • Total count of words in the tweet score of which is less than 0; • The sum of scores for all word great than 0; 2 http://help.sentiment140.com/for-students • The sum of scores for all word less than 0; • The max score greater than 0; • The min score less than 0; • Non-zero score of the last positive word in the tweet; • Non-zero score of the last negative word in the tweet.
We report the performance of SVM by using these lexicon features. The LIBSVM 3 toolkit is used with a linear kernel and the penalty parameter is set as the default value. The metric is F 1 score.
Unsupervised Sentiment Classification Evaluation: For unsupervised sentiment classification, we sum up the scores of all sentiment words in the document, according to the sentiment lexicon. If the sum is greater than 0, the document will be considered as positive, otherwise negative. The unsupervised learning evaluation metric is accuracy.

(External) Comparison with Public Lexicons
We compare our HSSWE method with four sentiment lexicons generated by the related work proposed in recent years: • Sentiment140 was constructed by Mohammad et al. (2013) on tweet corpus based on PMI between each word and the emoticons.
• HIT was constructed by Tang et al. (2014a) with a representation learning approach.
• NN was constructed by Vo and Zhang (2016) with a neural network method.
Note that Tang et al. (2014a), Vo and Zhang (2016) used incomplete dataset of SemEval2013 in their papers. For fair comparison, we conduct   Supervised Sentiment Classification: We first report the supervised sentiment classification F 1 score of five compared methods on the Semeval 2013-2016 datasets in Table 3. It can be seen that our HSSWE method gets the best result on all four datasets. It outperforms Sentiment140, HIT, NN and ETSL 1.7, 2.8, 1.9, and 3.2 percentages on the average of four datasets. The improvements are significant according to the paired t-test.
Unsupervised Sentiment Classification: We then report the unsupervised sentiment classification accuracy of five methods on the Semeval 2013-2016 datasets in Table 4. In can be seen that HSSWE obtains the best performance on Semeval 2013-2015. On the Semeval 2016 dataset, it is slightly lower than ETSL. Across four datasets, the average accuracy of HSSWE is 6.6, 3.1, 9.6 and 0.94 higher than Sentiment140, HIT, NN and ET-SL, respectively.

(Internal) Comparison within the Model
In order to further verify the effectiveness of our method and analyze which part of our model contributes the most, we carried out the internal comparison within our model. We design the following two simplified versions of our model for comparison: • PMI-SO denotes a PMI-SO based sentiment lexicon with soft sentiment annotation learned in Section 3.1.
• Doc-Sup denotes the neural network system with only document-level sentiment supervision. It equals to HSSWE when α = 0.
Actually, HSSWE can be viewed as a "combination" of PMI-SO and Doc-Sup. In Tables 5 and  6, we report the comparison results on supervised and unsupervised sentiment classification respectively.
Supervised Sentiment Classification: As is shown in Table 5, two basic models PMI-SO and Doc-Sup show similar performance. They have distinct superiority across different datasets. But both are significantly lower than HSSWE. It shows that by combing the supervision at both document and word levels, it can indeed improve the quality of sentiment-aware word embedding and the subsequent sentiment lexicon.
Unsupervised Sentiment Classification: As is shown in Table 6, the conclusions are similar with that in supervised sentiment classification: HSS-WE achieves the significantly better performance.

Word-level Sentimnt Annotation: Hard vs. Soft
In Section 3.1, we introduce two kinds of wordlevel sentiment annotation, i.e., soft and hard sentiment annotation. We now compare two methods. The results are reported in Tables 5 and  6

Tuning the Parameter α
In this section, we discuss the tradeoff between two parts of supervisions by turning the tradeoff parameter α. When α is 0, HSSWE only benefits from the document-level sentiment supervision and when α is 1, HSSWE benefits from only word-level sentiment supervision. We observe that HSSWE performs better when α is in the range of [0.45,0.55]. By integrating two component parts of sentiment supervision, HSSWE has significant superiority over that learned from either one.

Lexicon Analysis
In order to gain more insight of our model and observe the effectiveness of the sentiment lexicon, in Table 7 we extract the positive sentimen-  t score of some representative words learned by different methods. The positive scores are supposed to be: best>better>well. HSSWE captures such comparative sentiment strength but PMI-SO does not. We further observe that in many cases where the results of PMI-SO and Doc-Sup are inconsistent (e.g., Doc-Sup incorrectly predicts "unreasonable", "boreddddd" and "sickkk" as positive words, but PMI-SO predicts them correctly; PMI-SO incorrectly predicts "fit" but Doc-Sup predicts it correctly.), HSSWE often yield the correct results. It shows the advantages of hierarchical sentiment supervision. HSSWE can also correct the sentiment prediction where both PMI-SO and Doc-Sup are inefficient (e.g., "overplayed").

Conclusion
In this paper, we proposed to construct sentiment lexicons based on a sentiment-aware word representation learning approach. In contrast to traditional methods normally learned based on only the document-level sentiment supervision. We proposed word representation learning via hierarchical sentiment supervision, i.e., under the supervi-sion at both word and document levels. The wordlevel supervision can be provided based on either predefined sentiment lexicons or the learned PMI-SO based sentiment annotation of words. A wide range of experiments were conducted on several benchmark sentiment classification datasets. The results indicate that our method is quite effective for sentiment-aware word representation, and the sentiment lexicon generated by our approach beats the state-of-the-art sentiment lexicon construction approaches.