Limbic: Author-Based Sentiment Aspect Modeling Regularized with Word Embeddings and Discourse Relations

We propose Limbic, an unsupervised probabilistic model that addresses the problem of discovering aspects and sentiments and associating them with authors of opinionated texts. Limbic combines three ideas, incorporating authors, discourse relations, and word embeddings. For discourse relations, Limbic adopts a generative process regularized by a Markov Random Field. To promote words with high semantic similarity into the same topic, Limbic captures semantic regularities from word embeddings via a generalized Pólya Urn process. We demonstrate that Limbic (1) discovers aspects associated with sentiments with high lexical diversity; (2) outperforms state-of-the-art models by a substantial margin in topic cohesion and sentiment classification.


Introduction
How can we understand opinionated texts, e.g., social media postings, expressing sentiments about various entities? Three phenomena are key. First, even for similar entities, authors may differ both on aspects and sentiments about those aspects. For example, when reviewing a hotel, Alice may consider aspects such as Concierge and Room, whereas Bob may consider aspects such as Nearby and Room. Capturing similarities and differences among authors can help produce recommendations for services that are better aligned with a user's expectations (Wang et al., 2013). Second, reviews exhibit discourse structure, i.e., relations between propositions, which carries valuable information about sentiment. Third, crucial relationships between rare words are lost because each review may be short and use distinct rare words.
Word co-occurrence sparsity plagues existing approaches, which model documents as distributions over latent topics and estimate them from word co-occurrence. Since word frequency follows a power law, most words are rare and representative words of a topic rarely co-occur, especially in short opinionated texts, despite semantic proximity. For example, a reviewer would not use both spotless and immaculate to express a positive sentiment toward the cleanliness of a hotel room. Losing information about word relatedness impedes learning effectiveness, producing topics that are not semantically cohesive.
We contribute Limbic, an unsupervised probabilistic model for discovering author-based aspects and sentiments from opinionated texts that incorporates discourse-level topic modeling and semantic cohesion. (1) It associates authors and sentiment-aspect pairs by generating a mixture over sentiments and aspects for each author. (2) It captures discourse relations by applying a Markov Random Field over Sentiment Expression Units (SEUs), i.e., text elements describing sentimentaspect pairs. (3) It promotes words with high semantic similarity into the same topic by incorporating semantic regularities from word embeddings using a generalized Pólya Urn process.
We empirically compare Limbic with state-ofthe-art models using datasets from two domains. Qualitatively, Limbic discovers aspect-sentiment pairs with higher lexical diversity. Quantitatively, Limbic obtains substantial improvements in topic cohesion and sentiment classification.

Model and Inference in Limbic
We now introduce our proposed model.

Sentiment Expression Unit (SEU)
Existing topic models represent documents as bags of words or as sentences. Bag-of-words models, e.g., LDA (Blei et al., 2003), AT (Rosen-Zvi et al., 2004), JST (Lin et al., 2012), JAST (Mukherjee et al., 2014), and AATS (Poddar et al., 2017), rely on word co-occurrence at the document level, which is problematic when applied to opinionated texts. Sentence-based models, e.g., ASUM (Jo and Oh, 2011), assume that words appearing in a sentence belong to the same aspect and sentiment, which often fails to hold in real text. For instance, the TripAdvisor review sentence Service was good and friendly, location is good and my room was spacious but oldish, exhibits three aspects, Service, Location, and Room, and two sentiments. Zhang and Singh's (2014) segmentation algorithm leverages transition cues to convert sentences into segments. Although transition cues are good indicators for capturing sentiment change, their algorithm disregards syntactic information in sentences, which also helps reveal changes of aspects and sentiments.  We propose a concept of sentiment expression unit (SEU). Each SEU contains either a sentiment, or an aspect, or both. We extract SEUs by incorporating both discourse and syntactic information. We first split sentences in reviews into snippets based on contradiction transition cues, such as but. Then we apply a grammar parser on each snippet. We extract phrases from snippets by using two syntactic patterns commonly observed in opinionated texts including (1) existential (EX) with verb (VB) and adjective (JJ) and (2) noun (NN) with verb (VB) and adjective (JJ). If a phrase matches a pattern, we identify it as an SEU. Otherwise, the phrase joins its following phrases iteratively until the combination matches a pattern. Figure 1 demonstrates the process of generating SEUs from the above hotel review sentence.

Discourse Relation
Markov Random Field (MRF) is a probabilistic framework to model statistical dependencies between variables. Limbic applies an MRF to capture the discourse relations between SEUs. Given a document containing N SEUs, let a i and s i be the aspect and sentiment assignments of SEU i , respectively. Limbic creates an undirected edge s i , s j between the sentiment assignments of this SEU and its preceding SEU. Let r be the discourse relation between SEUs, Limbic imposes a binary potential on the edge.
Limbic focuses on two discourse relations frequently observed in opinionated texts: Comparison and Expansion. Comparison highlights prominent differences between two SEUs and often signals a change of sentiment regardless of the change of aspect. For example, in SEU 1 : {The location was great} and SEU 2 : {but it was just too noisy}, we see that but indicates a sentiment difference. Other transition cues for Comparison include however, in contrast, and such.
Expansion extends the discourse and indicates a continuation of sentiment across SEUs. For example, in SEU 3 : {There are no safes here which is unfortunate} and SEU 4 : {And speaking of unfortunate, the breakfast is hardly impressive}, we see that and and unfortunate indicate the negative sentiment in SEU 3 continues toward aspect Breakfast in SEU 4 . Other transition cues for Expansion include also, moreover, and such.
Formally, R r,i,j asserts discourse relation r between SEU i and SEU j . For Comparison, R c,i,j holds if s i = s j , SEU j contains Comparison cues, and (1) SEU j contains syntactic patterns described in Section 2.1 and a i = a j or (2) SEU j contains incomplete syntactic patterns and a i = a j .
For Expansion, R e,i,j holds if s i = s j , SEU j contains Expansion cues, and (1) SEU j contains syntactic patterns and a i = a j or (2) SEU j contains incomplete syntactic patterns and a i = a j .
Given document d, the joint probability of its sentiment assignments is: where R is the number of discourse relation types; θ d is the sentiment distribution of d; I is an identity function that returns 1 if its argument is true; λ controls reinforcing the effects of discourse relations.
Take the expansion relation, for example. During the sampling process, Equation 1 generates a large value if two SEUs share an expansion relation and have the same sentiments and yields a small value if the two SEUs have different sentiments. Therefore, SEUs in an expansion relation have a high probability to be associated with the same sentiment. Figure 2 shows Limbic's model. With Dir (·) and Mul (·) as Dirichlet and multinomial distributions, hyperparameter α is the Dirichlet prior of the word distribution φ, β is the Dirichlet prior of the sentiment distribution θ, and γ is the Dirichlet prior of the aspect distribution ψ. Given a set of reviews D written by a set of authors U with regards to a set of aspects T and a set of sentiments S, the generative process in Limbic is as follows. First, for each pair of aspect t and sentiment s, draw a word distribution φ t,s ∼ Dir (α). Second, for each author a and each sentiment s, draw an aspect distribution ψ s,a ∼ Dir (γ). Third, given a review d written by a, draw a sentiment distribution θ d ∼ Dir (β), and for each SEU in d, (a) choose a sentiment s using Equation 1; (b) given s, choose an aspect t ∼ Mul (ψ s,a ); (c) given t and s, sample word w ∼ Mul (φ t,s ).

Model Inference
Limbic estimates p(s, t|w, a), the posterior distribution of latent variables, sentiments s and aspects t, given all words used in reviews written by author a. We factor the joint probability of the assignments of sentiments, aspects, and words for a: p(s, t,w|a, α, β, γ) = p(w|s, t, α)p(t|s, a, γ)p(s|β). (2) , we calculate the first term of Equation 2 as follows.
where W is the size of the vocabulary; n w s,t equals the number of occurrences of the word w that are assigned to sentiment s and aspect t; and Γ(·) is the Gamma function. Next, by integrating over Ψ a = {ψ i } S i=1 , we calculate the second term in Equation 2 as follows.
where n t s,a equals the number of SEUs in author a's reviews associated with sentiment s and aspect t.
Similarly, for the third term in Equation 2, by integrating over where D is the number of reviews; n s d is the number of times that an SEU from review d is associated with sentiment s; and n d is the number of SEUs in review d; L is the number of SEUs.
We obtain the conditional probability for a via Gibbs sampling (Liu, 1994) where n t s,a , as in Equation 4, is the number of SEUs associated with sentiment s and aspect t from reviews written by author a; n s d is the number of SEUs from review d associated with sentiment s; C i is the number of words in SEU t,s is the number of words v assigned sentiment s and aspect t; n t,s is the number of words assigned sentiment s and aspect t in all reviews; an index of −i means we exclude SEU i from the count; R, I, and R are as in Equation 1.
Equations 7, 8, and 9, respectively, approximate the probabilities of word w occurring given sentiment s and aspect t; of aspect t of an SEU occurring given sentiment s and author a; of sentiment s occurring given document d.
Incorporating Word Embeddings. Word embedding approaches (Mikolov et al., 2013;Pennington et al., 2014) leverage local contextual information surrounding words to map the words into continuous vector representations. Word embeddings are known to effectively capture semantic and syntactic regularities among words. Based on word embeddings trained using Word2Vec on a hotel review dataset, we observe that the generated word embeddings correctly link opinionated words that are semantically correlated even though they do not co-occur frequently. For example, the three closest words of spotless are immaculate, clean, and well appointed.
A Generalized Pólya Urn Process. To promote words with high semantic similarity into the same topic, Limbic incorporates semantic regularities from word embeddings using a generalized Pólya Urn process (Mimno et al., 2011). Start with an urn containing colored balls. At each time step, randomly choose a ball from the urn, observe its color, and return it to the urn with one replicated ball of the same color. A Pólya Urn model describes a random sampling process with reinforcement. In a generalized Pólya Urn process, given a sampled ball with a color, we put back that ball along with a certain number of balls of similar colors. When applied to document generation, balls of different colors represent distinct words. The similarity of colors represents semantic similarity of the words. Given words v and w in vocabulary W , we compute their semantic similarity sim(v, w) based on the cosine similarity between their word embeddings. For word v, we create its similarity word set S v by adding all words w ∈ W for which sim(v, w) is higher than a predefined threshold . During sampling, if word v is drawn, we reinforce w ∈ S v via a predefined weight ρ which controls the reinforcement of semantically similar words. Sentiment Alignment. Widely used Word embedding approaches, such as Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), and fastText (Bojanowski et al., 2017), are semantically oriented and do not explicitly encode sentiment information in the generated wordvector representations. Hence, semantically related words with opposite polarity may have close vectors. For example, smell and aroma are synonyms but smell often expresses a negative sentiment toward aspect Cleanliness whereas aroma is often positive. Simply promoting all words may adversely affect the generated topics. Therefore, we calculate the sentiment alignment of each word in a vocabulary based on its average cosine similarity to the words in a general sentiment word list. In the sampling process, we promote words only if their sentiments align with sampled sentiments.

Evaluation
To assess Limbic's effectiveness, we prepare online review datasets from two domains. Trip-User is a collection of hotel reviews from Trip-Advisor. It contains 28,165 reviews posted by 202 randomly selected reviewers, each of whom contributes at least 100 hotel reviews. YelpUser is a set of restaurant reviews from Yelp Dataset Challenge (2017). It contains 23,873 restaurant reviews posted by 144 users, each of whom contributes at least 100 reviews. Table 1 reports statistics on the datasets. We remove stop words and HTML tags, expand typical abbreviations, and mark special named entities using a rule-based algorithm (e.g., replace a monetary amount by #MONEY#) and the Stanford Named Entity Recognizer (Finkel et al., 2005). To handle negation, we employ the Stanford Dependency Parser to detect negations. For any word in a negation relation, we add the negated term as a prefix of the word, e.g., not work. Finally, we split each review into SEUs. Datasets and source code are publicly available for research purposes (Limbic, 2018).

Parameter Settings
Limbic includes three hand-tuned hyperparameters that influence its sampling via a smoothing effect on the associated multinomial distribution. It uses a short list of sentiment words shown in Table 2 as prior knowledge to set asymmetric priors.
Consider hyperparameter α, the Dirichlet prior of the word distribution. For any word in the positive list, α = 0 if the word appears in an SEU assigned a negative sentiment, and α = 5 if the word appears in an SEU assigned a positive sentiment, and conversely for words in the negative list. For all remaining words, we set α = 0.05. And, hyperparameter β = 5 for both sentiments is the Dirichlet prior of the sentiment distribution. Using T as the number of aspects, hyperparameter γ = 50 T is the Dirichlet prior of the aspect dis- tribution. We set the number of sentiments, S, to two (positive and negative), although our approach generalizes to additional sentiment categories. For each fold in cross validation, we pretrain two sets of Word2Vec (Mikolov et al., 2013) word embeddings with 300 dimensions and a window size of five using the training split in TripUser (hotels) and YelpUser (restaurants). We exclude words with frequency lower than three. We set the reinforcement weight ρ to 0.3 and 0.1 for hotel and restaurant reviews, respectively, and set the similarity threshold to 0.6. For all models, we perform 1,000 Gibbs iterations with a burn-in phase of 200 and a sampling gap of 50 iterations.

Sentiment Aspect Discovery
Our first experiment shows how Limbic discovers sentiment-aspect pairs. We apply Limbic and all baseline models (AT, JST, ASUM, and AATS) to TripUser and YelpUser with the number of aspects set to 30. We manually assign an aspect for each cluster of words. ASUM generates the best results among baseline models. For brevity, we show only some aspects identified by Limbic and ASUM.
Table 3 (top) shows the results on hotel reviews. We see that Limbic discovers word clusters with higher lexical diversity than ASUM. For example, for aspect Decoration, in addition to words, decor, modern and design, Limbic discovers words contemporary, minimalist, chic, and so on. For aspect Service, comparing with ASUM, Limbic extracts an expanded list of sentiment words including competent, knowledgeable, and so on.
Limbic discovers finer and more distinctive word clusters than ASUM. For example, for aspect Cleanliness, ASUM generates a word cluster that includes negative sentiment words toward multiple entities, such as carpet and hallway. Limbic generates two distinctive word clusters for aspect Cleanliness. One cluster contains words, such as smoke and reek, which describe bad odor in room and hallway. The other cluster contains words such as, worn and stain, describes negative sentiments toward carpets. By capturing word semantic relatedness, Limbic discovers highly diverse aspects, including those that arise rarely in reviews, such as peaceful, relaxing, and lush, as positive words describing aspect Environment. Limbic yields promising results for restaurant reviews. In Table 3 (bottom), we see that Limbic yields more specific sentiment words than ASUM. Aspect Service in Limbic contains additional positive words, efficient, prompt, knowledgeable, and so on. For aspect Decoration, Limbic produces sleek, ambiance, and so on. By incorporating constraints from discourse relations, Limbic yields aspects that are more sentiment coherent. For example, we see that positive aspect Portion in ASUM contains the negative word small whereas words in aspect Portion in Limbic are all positive.
We observe that restaurants associate more complex aspects than hotels-presumably, because of the large variety of cuisines and thus, on average, smaller data relevant to a cuisine. Titov and McDonald's (2008b) Multi-Grain LDA (MG-LDA) model performs well for hotel reviews but discovers only few ratable aspects from restaurant reviews, which they ascribe to the relatively small occurrences of words describing aspects for specific cuisines (e.g., Italian) and general categories (e.g., Meat), compared with the words describing major aspects, such as Service. In contrast, Limbic discovers words describing specific cuisines, such as Mexican and Seafood.

Quantitative Evaluation
Whether topics (word clusters) are semantically cohesive is an important factor in assessing topic modeling approaches. Normalized Pointwise Mutual Information (NPMI) (Lau et al., 2014) has strong correlation with human-judged topic coherence ratings and is widely used for accessing topic modeling approaches (Nguyen et al., 2015a,b; (2015) propose a topic coherence measure, W2V, based on word embeddings. For completeness, we adopt both metrics. Topics with higher scores of NPMI and W2V are semantically more coherent. We compare Limbic with four baselines: AT, JST, ASUM, and AATS, using both TripUser and YelpUser based on the top 15 words in each sentiment-aspect pair. For each number of aspects, we perform five-fold cross-validation. We perform the two-tailed exact permutation test (Good, 2005) on the improvement of Limbic over the best performing baseline. (Throughout, * and † indicate significance at 0.05 and 0.001, respectively.) Table 4 shows average NPMI and W2V scores of each model on hotel reviews for different numbers of aspects. We observe that Limbic statistically outperforms the other models for both metrics in all settings. Limbic yields substantial improvements, with average gains over the second best models of 6.00 and 0.18 in NPMI and W2V, respectively, which validates that the incorporation of semantic regularities helps Limbic promote semantically equivalent and related words into the same aspect-sentiment pair. Of the baseline models, AT yields the lowest topic coherence. AATS outperforms AT but does not perform well when the number of aspects is small, possibly due to the undesirable mixture of words with different aspects, topics, and sentiments in individual sentences. ASUM, and JST yield comparable results that are consistently better than AATS. Table 5 shows similar conclusions for restaurant reviews.

Sentiment Classification
We now evaluate Limbic for document-level sentiment classification visà vis JST, ASUM, and AATS. For comparison purposes, we add a supervised baseline, BiLSTM, using the bidirectional LSTM model (Schuster and Paliwal, 1997). Bi-LSTM uses 100 as hidden state size and 0.2 as both the recurrent dropout rate and the dropout rate in the last layer. For training, we run 20 epochs with a minibatch size of 1,000. We use two datasets, TripUser and YelpUser. To collect ground-truth labels, we use integer ratings (three and above as positive and rest as negative). Note that our review datasets are imbalanced. Our results are based on five-fold cross-validation (80% of each author's reviews for training and 20% for testing) with the two-tailed exact permutation test. As our principal evaluation metrics, we adopt accuracy; the receiver operating characteristic (ROC) curve; and area under the curve (AUC). ROC and AUC are standard metrics used for evaluating classifiers on data with class imbalance (Bradley, 1997;Hoens and Chawla, 2013).
Tables 6 and 7 report accuracy and AUC on hotel and restaurant reviews. AATS yields high accuracy but low AUC due to a strong bias toward the majority class. Compared with AATS, JST yields higher AUC for both datasets but lower accuracy for TripUser. ASUM outperforms JST, indicating that sentences are more effective as units of sentiment analysis than bags of words. Limbic significantly outperforms ASUM in all settings. For hotel reviews, Limbic attains average gains of 4.0% and 2.3% in accuracy and AUC, respectively. For restaurant reviews, Limbic yields average gains of 5.1% and 3.0% in accuracy and AUC, respectively. In Figure 3 and 4, we compare the ROC curves of Limbic with baselines. The ROC curves show how the true positive rate (TPR) (vertical axis) varies with the false positive rate (FPR) (horizontal axis) by moving the decision boundary. We see that for all FPRs, Limbic yields the highest TPRs. Its ROC curves dominate other models' curves. The results demonstrate that, among all models, Lim-

Model Analysis
To understand the contributions of incorporating authors, discourse relations, and word embeddings, we evaluate variants of Limbic for SEUlevel sentiment classification on two datasets: tSEU and tSEU(D). We create tSEU by randomly selecting 200 hotel reviews by seven authors. We manually annotate the sentiments of each SEU, obtaining 2,692 SEUs. We create tSEU(D) by selecting reviews in tSEU containing at least one Comparison or Expansion. We define three variants of Limbic (L): L A with just authors, no discourse relations or word embeddings; L AD with authors and discourse relations but no word embeddings; L AW with authors and word embeddings but no discourse relations. Table 8 compares Limbic with L A L AD , and L AW . We observe that for both datasets, incorporating discourse relations improves accuracy. By incorporating word embeddings, L AW yields better accuracy than L AD , showing that word embeddings add more value to Limbic than discourse relations do.

Related Work
Sentiment and aspect discovery are often based on Latent Dirichlet Allocation (LDA) (Blei et al., (2008a) framework discovers topics using aspect ratings provided by reviewers. JST (Lin et al., 2012) and ASUM (Jo and Oh, 2011) model a review via multinomial distributions of topics and sentiments and use them to condition the probability of generating words. Kim et al. (2013) extend ASUM by allowing its probabilistic model to discover a hierarchical structure of aspect-based sentiments. Lazaridou et al. (2013) introduce discourse transitions into the document generating process as aspect and sentiment shifters. Although the above models produce good results, they omit author information, which is an intrinsic attribute of opinionated texts. in a document, AT conditions the probability of the topic assignment on the author of the document. Kim et al.'s (2012) topic model captures entities mentioned in documents and models the probability of generating a word as conditioned on both entity and topic. Diao and Jiang's (2013) jointly model topics, events, and users on Twitter. Although these models capture the author associated with a text, they do not handle sentiments. Mukherjee et al.'s (2014) JAST model jointly considers authors, sentiments, topics, and ratings. JAST does not consider discourse relations and word semantic similarity in its generative process. Poddar et al. (2017) propose a model that jointly considers author, aspect, sentiment, and the nonrepetitive generation of aspect sequences. The model uses a Bernoulli process to capture the nonrepetitive nature of aspect sequences. This mechanism does not consider discourse relations or syntactic information.

Conclusion and Discussion
Limbic provides an unsupervised method to discover aspects and sentiments from opinionated texts. By incorporating authors as a factor, Limbic allows for reviews written by the same or similar authors to exhibit an idiosyncratic preference toward certain aspects and sentiments. It assigns aspects of SEUs by sampling author-specific aspect distributions. This makes the model more suitable for opinionated texts in which aspects and sentiments are tightly bound to authors who follow their specific criteria and preferences when writing reviews. By incorporating a Markov Random Field and word embeddings into its sampling process, Limbic imposes constraints associated with discourse relations, effectively captures word semantic relatedness, and generates word clusters with high topic cohesion and lexical diversity. In future work, we plan to extend Limbic to capture long-distance discourse relations and the influence decay of discourse relations between SEUs as their distance increases.

Acknowledgments
Thanks to Chung-Wei Hang and the anonymous reviewers for helpful comments, corrections and inspiration.