An Unsupervised Neural Attention Model for Aspect Extraction

Aspect extraction is an important and challenging task in aspect-based sentiment analysis. Existing works tend to apply variants of topic models on this task. While fairly successful, these methods usually do not produce highly coherent aspects. In this paper, we present a novel neural approach with the aim of discovering coherent aspects. The model improves coherence by exploiting the distribution of word co-occurrences through the use of neural word embeddings. Unlike topic models which typically assume independently generated words, word embedding models encourage words that appear in similar contexts to be located close to each other in the embedding space. In addition, we use an attention mechanism to de-emphasize irrelevant words during training, further improving the coherence of aspects. Experimental results on real-life datasets demonstrate that our approach discovers more meaningful and coherent aspects, and substantially outperforms baseline methods on several evaluation tasks.


Introduction
Aspect extraction is one of the key tasks in sentiment analysis. It aims to extract entity aspects on which opinions have been expressed (Hu and Liu, 2004;Liu, 2012). For example, in the sentence "The beef was tender and melted in my mouth", the aspect term is "beef". Two sub-tasks are performed in aspect extraction: (1) extracting all aspect terms (e.g., "beef") from a review corpus, (2) clustering aspect terms with similar meaning into categories where each category represents a single aspect (e.g., cluster "beef", "pork", "pasta", and "tomato" into one aspect food).
Previous works for aspect extraction can be categorized into three approaches: rule-based, supervised, and unsupervised. Rule-based methods usually do not group extracted aspect terms into categories. Supervised learning requires data annotation and suffers from domain adaptation problems. Unsupervised methods are adopted to avoid reliance on labeled data needed for supervised learning.
In recent years, Latent Dirichlet Allocation (LDA) (Blei et al., 2003) and its variants (Titov and McDonald, 2008;Brody and Elhadad, 2010;Zhao et al., 2010;Mukherjee and Liu, 2012) have become the dominant unsupervised approach for aspect extraction. LDA models the corpus as a mixture of topics (aspects), and topics as distributions over word types. While the mixture of aspects discovered by LDA-based models may describe a corpus fairly well, we find that the individual aspects inferred are of poor quality -aspects often consist of unrelated or loosely-related concepts. This may substantially reduce users' confidence in using such automated systems. There could be two primary reasons for the poor quality. Conventional LDA models do not directly encode word co-occurrence statistics which are the primary source of information to preserve topic coherence (Mimno et al., 2011). They implicitly capture such patterns by modeling word generation from the document level, assuming that each word is generated independently. Furthermore, LDA-based models need to estimate a distribution of topics for each document. Review documents tend to be short, thus making the estimation of topic distributions more difficult.
In this work, we present a novel neural approach to tackle the weaknesses of LDA-based methods. We start with neural word embeddings that al-ready map words that usually co-occur within the same context to nearby points in the embedding space (Mikolov et al., 2013). We then filter the word embeddings within a sentence using an attention mechanism (Bahdanau et al., 2015) and use the filtered words to construct aspect embeddings. The training process for aspect embeddings is analogous to autoencoders, where we use dimension reduction to extract the common factors among embedded sentences and reconstruct each sentence through a linear combination of aspect embeddings. The attention mechanism deemphasizes words that are not part of any aspect, allowing the model to focus on aspect words. We call our proposed model Attention-based Aspect Extraction (ABAE).
In contrast to LDA-based models, our proposed method explicitly encodes word-occurrence statistics into word embeddings, uses dimension reduction to extract the most important aspects in the review corpus, and uses an attention mechanism to remove irrelevant words to further improve coherence of the aspects.
We have conducted extensive experiments on large review data sets. The results show that ABAE is effective in discovering meaningful and coherent aspects. It substantially outperforms baseline methods on multiple evaluation tasks. In addition, ABAE is intuitive and structurally simple. It can also easily scale to a large amount of training data. Therefore, it is a promising alternative to LDA-based methods proposed previously.

Related Work
The problem of aspect extraction has been well studied in the past decade. Initially, methods were mainly based on manually defined rules. Hu and Liu (2004) proposed to extract different product features through finding frequent nouns and noun phrases. They also extracted opinion terms by finding the synonyms and antonyms of opinion seed words through WordNet. Following this, a number of methods have been proposed based on frequent item mining and dependency information to extract product aspects (Zhuang et al., 2006;Somasundaran and Wiebe, 2009;Qiu et al., 2011). These models heavily depend on predefined rules which work well only when the aspect terms are restricted to a small group of nouns.
Supervised learning approaches generally model aspect extraction as a standard sequence labeling problem. Jin andHo (2009) and proposed to use hidden Markov models (HMM) and conditional random fields (CRF), respectively with a set of manually-extracted features. More recently, different neural models (Yin et al., 2016;Wang et al., 2016) were proposed to automatically learn features for CRF-based aspect extraction. Rule-based models are usually not refined enough to categorize the extracted aspect terms. On the other hand, supervised learning requires large amounts of labeled data for training purposes.
Unsupervised approaches, especially topic models, have been proposed subsequently to avoid reliance on labeled data. Generally, the outputs of those models are word distributions or rankings for each aspect. Aspects are naturally obtained without separately performing extraction and categorization. Most existing works (Brody and Elhadad, 2010;Zhao et al., 2010;Mukherjee and Liu, 2012;Chen et al., 2014) are based on variants and extensions of LDA (Blei et al., 2003). Recently, Wang et al. (2015) proposed a restricted Boltzmann machine (RBM)-based model to simultaneously extract aspects and relevant sentiments of a given review sentence, treating aspects and sentiments as separate hidden variables in RBM. However, the RBM-based model proposed in (Wang et al., 2015) relies on a substantial amount of prior knowledge such as part-of-speech (POS) tagging and sentiment lexicons. A biterm topic model (BTM) that generates co-occurring word pairs was proposed in (Yan et al., 2013). We experimentally compare ABAE and BTM on multiple tasks in this paper.
Attention models (Mnih et al., 2014) have recently gained popularity in training neural networks and have been applied to various natural language processing tasks, including machine translation (Bahdanau et al., 2015;Luong et al., 2015), sentence summarization (Rush et al., 2015), sentiment classification (Chen et al., 2016;Tang et al., 2016), and question answering (Hermann et al., 2015). Rather than using all available information, attention mechanism aims to focus on the most pertinent information for a task. Unlike previous works, in this paper, we apply attention to an unsupervised neural model. Our experimental results demonstrate its effectiveness under an unsupervised setting for aspect extraction.

Model Description
We describe the Attention-based Aspect Extraction (ABAE) model in this section. The ultimate goal is to learn a set of aspect embeddings, where each aspect can be interpreted by looking at the nearest words (representative words) in the embedding space. We begin by associating each word w in our vocabulary with a feature vector e w ∈ R d . We use word embeddings for the feature vectors as word embeddings are designed to map words that often co-occur in a context to points that are close by in the embedding space (Mikolov et al., 2013). The feature vectors associated with the words correspond to the rows of a word embedding matrix E ∈ R V ×d , where V is the vocabulary size. We want to learn embeddings of aspects, where aspects share the same embedding space with words. This requires an aspect embedding matrix T ∈ R K×d , where K, the number of aspects defined, is much smaller than V . The aspect embeddings are used to approximate the aspect words in the vocabulary, where the aspect words are filtered through an attention mechanism.
Each input sample to ABAE is a list of indexes for words in a review sentence. Given such an input, two steps are performed as shown in Figure 1. First, we filter away non-aspect words by down-weighting them using an attention mechanism, and construct a sentence embedding z s from weighted word embeddings. Then, we try to reconstruct the sentence embedding as a linear combination of aspect embeddings from T. This process of dimension reduction and reconstruction, where ABAE aims to transform sentence embeddings of the filtered sentences (z s ) into their reconstructions (r s ) with the least possible amount of distortion, preserves most of the information of the aspect words in the K embedded aspects. We next describe the process in detail.

Sentence Embedding with Attention Mechanism
We construct a vector representation z s for each input sentence s in the first step. In general, we want the vector representation to capture the most relevant information with regards to the aspect (topic) of the sentence. We define the sentence embedding z s as the weighted summation of word embeddings e w i , i = 1, ..., n corresponding to the word indexes in the sentence.
For each word w i in the sentence, we compute a positive weight a i which can be interpreted as the probability that w i is the right word to focus on in order to capture the main topic of the sentence. The weight a i is computed by an attention model, which is conditioned on the embedding of the word e w i as well as the global context of the sentence: where y s is simply the average of the word embeddings, which we believe captures the global context of the sentence. M ∈ R d×d is a matrix mapping between the global context embedding y s and the word embedding e w and is learned as part of the training process. We can think of the attention mechanism as a two-step process. Given a sentence, we first construct its representation by averaging all the word representations. Then the weight of a word is assigned by considering two things. First, we filter the word through the transformation M which is able to capture the relevance of the word to the K aspects. Then we capture the relevance of the filtered word to the sentence by taking the inner product of the filtered word to the global context y s .

Sentence Reconstruction with Aspect Embeddings
We have obtained the sentence embedding. Now we describe how to compute the reconstruction of the sentence embedding. As shown in Figure 1, the reconstruction process consists of two steps of transitions, which is similar to an autoencoder. Intuitively, we can think of the reconstruction as a linear combination of aspect embeddings from T: where r s is the reconstructed vector representation, p t is the weight vector over K aspect embeddings, where each weight represents the probability that the input sentence belongs to the related aspect. p t can simply be obtained by reducing z s from d dimensions to K dimensions and then applying a softmax non-linearity that yields normalized non-negative weights: where W, the weighted matrix parameter, and b, the bias vector, are learned as part of the training process.

Training Objective
ABAE is trained to minimize the reconstruction error. We adopted the contrastive max-margin objective function used in previous work (Weston et al., 2011;Socher et al., 2014;Iyyer et al., 2016). For each input sentence, we randomly sample m sentences from our training data as negative samples. We represent each negative sample as n i which is computed by averaging its word embeddings. Our objective is to make the reconstructed embedding r s similar to the target sentence embedding z s while different from those negative samples. Therefore, the unregularized objective J is formulated as a hinge loss that maximize the inner product between r s and z s and simultaneously minimize the inner product between r s and the negative samples: where D represents the training data set and θ = {E, T, M, W, b} represents the model parameters.

Regularization Term
We hope to learn vector representations of the most representative aspects for a review dataset. However, the aspect embedding matrix T may suffer from redundancy problems during training. To ensure the diversity of the resulting aspect embeddings, we add a regularization term to the objective function J to encourage the uniqueness of each aspect embedding: where I is the identity matrix, and T n is T with each row normalized to have length 1. Any nondiagonal element t ij (i = j) in the matrix T n · T n corresponds to the dot product of two different aspect embeddings. U reaches its minimum value when the dot product between any two different aspect embeddings is zero. Thus the regularization term encourages orthogonality among the rows of the aspect embedding matrix T and penalizes redundancy between different aspect vectors. Our final objective function L is obtained by adding J and U : where λ is a hyperparameter that controls the weight of the regularization term.

Datasets
We evaluate our method on two real-word datasets. The detailed statistics of the datasets are summarized in Table 1.
(1) Citysearch corpus: This is a restaurant review corpus widely used by previous works (Ganu et al., 2009;Brody and Elhadad, 2010;Zhao et al., 2010), which contains over 50,000 restaurant reviews from Citysearch New York. Ganu et al. (2009) also provided a subset of 3,400 sentences from the corpus with manually labeled aspects. These annotated sentences are used for evaluation of aspect identification. There are six manually defined aspect labels: Food, Staff, Ambience, Price, Anecdotes, and Miscellaneous.
(2) BeerAdvocate: This is a beer review corpus introduced in (McAuley et al., 2012), containing over 1.5 million reviews. A subset of 1,000 reviews, corresponding to 9,245 sentences, are annotated with five aspect labels: Feel, Look, Smell, Taste, and Overall.

Baseline Methods
To validate the performance of ABAE, we compare it against a number of baselines: (1) LocLDA (Brody and Elhadad, 2010): This method uses a standard implementation of LDA. In order to prevent the inference of global topics and direct the model towards rateable aspects, each sentence is treated as a separate document.
(2) k-means: We initialize the aspect matrix T by using the k-means centroids of the word embeddings. To show the power of ABAE, we compare its performance with using the kmeans centroids directly.
(3) SAS (Mukherjee and Liu, 2012): This is a hybrid topic model that jointly discovers both aspects and aspect-specific opinions. This model has been shown to be competitive among topic models in discovering meaningful aspects (Mukherjee and Liu, 2012;Wang et al., 2015).
(4) BTM (Yan et al., 2013): This is a biterm topic model that is specially designed for short texts such as texts from social media and review sites. The major advantage of BTM over conventional LDA models is that it alleviates the problem of data sparsity in short documents by directly modeling the generation of unordered word-pair co-occurrences (biterms) over the corpus. It has been shown to perform better than conventional LDA models in discovering coherent topics.

Experimental Settings
Review corpora are preprocessed by removing punctuation symbols, stop words, and words appearing less than 10 times. For LocLDA, we use the open-source implementation GibbsLDA++ 1 and for BTM, we use the implementation released by (Yan et al., 2013) 2 . We tune the hyperparameters of all topic model baselines on a held-out set with grid search using the topic coherence metric to be introduced later in Eq 10: for LocLDA, the Dirichlet priors α = 0.05 and β = 0.1; for SAS and BTM, α = 50/K and β = 0.1. We run 1,000 iterations of Gibbs sampling for all topic models. For the ABAE model, we initialize the word embedding matrix E with word vectors trained by word2vec with negative sampling on each dataset, setting the embedding size to 200, window size to 10, and negative sample size to 5. The parameters we use for training word embeddings are standard with no specific tuning to our data. We also initialize the aspect embedding matrix T with the centroids of clusters resulting from running k-means on word embeddings. Other parameters are initialized randomly. During the training process, we fix the word embedding matrix E and optimize other parameters using Adam (Kingma and Ba, 2014) with learning rate 0.001 for 15 epochs and batch size of 50. We set the number of negative samples per input sample m to 20, and the orthogonality penalty weight λ to 1 by tuning the hyperparameters on a held-out set with grid search. The results reported for all models are the average over 10 runs.
Following (Brody and Elhadad, 2010;Zhao et al., 2010), we set the number of aspects for the restaurant corpus to 14. We experimented with different number of aspects from 10 to 20 for the beer corpus. The results showed no major difference, so we also set it to 14. As in previous work (Brody and Elhadad, 2010;Zhao et al., 2010), we manually mapped each inferred aspect to one of the gold-standard aspects according to its top ranked representative words. In ABAE, representative words of an aspect can be found by looking at its nearest words in the embedding space using cosine as the similarity metric.

Evaluation and Results
We describe the evaluation tasks and report the experimental results in this section. We evaluate ABAE on two criteria: • Is it able to find meaningful and semantically coherent aspects?
• Is it able to improve aspect identification performance on real-world review datasets?   standard labels, the inferred aspects are more finegrained. For example, it can distinguish main dishes from desserts, and drinks from food.

Coherence Score
In order to objectively measure the quality of aspects, we use coherence score as a metric which has been shown to correlate well with human judgment (Mimno et al., 2011). Given an aspect z and a set of top N words of z, S z = {w z 1 , ..., w z N }, the coherence score is calculated as follows: where D 1 (w) is the document frequency of word w and D 2 (w 1 , w 2 ) is the co-document frequency of words w 1 and w 2 . A higher coherence score indicates a better aspect interpretability, i.e., more meaningful and semantically coherent. Figure 2 shows the average coherence score of each model which is computed as 1 K K k=1 C(z k ; S z k ) on both the restaurant domain and beer domain. From the results, we make the following observations: (1) ABAE outperforms previous models for all ranked buckets.
(2) BTM performs slightly better than LocLDA and SAS. This may be because BTM directly models the generation of biterms, while conventional LDA just implicitly captures such patterns by modeling word generation from the document level. (3) It is interesting to note that performing k-means on the word embeddings is sufficient to perform better than all topic model baselines, including BTM. This indicates that neural word embedding is a better model for capturing co-occurrence than LDA, even for BTM which specifically models the generation of co-occurring word pairs. k-means LocLDA SAS BTM ABAE Restaurant 11 8 9 9 11 Beer 9 8 8 9 10 Table 3: Number of coherent aspects. K (number of aspects) = 14 for all models.

User Evaluation
As we want to discover a set of aspects that the human user finds agreeable, it is also necessary to carry out user evaluation directly. Following the experimental setting in (Chen et al., 2014), we recruited three human judges. Each aspect is labeled as coherent if the majority of judges assess that most of its top 50 terms coherently represent a product aspect. The numbers of coherent aspects discovered by each model are shown in Table 3. ABAE discovers the most number of coherent aspects compared with other models. For a coherent aspect, each of its top terms is labeled as correct if and only if the majority of judges assess that it reflects the related aspect. We adopt precision@n (or p@n) to evaluate the results, which was also used in (Mukherjee and Liu, 2012;Chen et al., 2014). Figure 3 shows the average p@n results over all coherent aspects for each domain. We can see that the user evaluation results correlate well with the coherence scores shown in Figure 2, where ABAE substantially outperforms all other models for all ranked buckets, especially for large values of n.

Aspect Identification
We evaluate the performance of sentence-level aspect identification on both domains using the annotated sentences shown in Table 1. The evaluation criterion is to judge how well the predictions match the true labels, measured by precision, recall, and F 1 scores. The results 4 are shown in Table 4 and Table 5.
Given a review sentence, ABAE first assigns an inferred aspect label which corresponds to the highest weight in p t calculated as shown in Equation 6 . And we then assign the gold-standard label to the sentence according to the mapping between inferred aspects and gold-standard labels.
3 k-means assigns a sentence an inferred aspect whose embedding is the closest to the averaged word embeddings of the sentence. 4 Note that the values of P/R/F1 reported are the average over 10 runs (except some values taken from published results in Table 4 (Wang et al., 2015).
For the restaurant domain, we follow the experimental settings of previous work (Brody and Elhadad, 2010;Zhao et al., 2010;Wang et al., 2015) to make our results comparable. To do that, (1) we only used the single-label sentences for evaluation to avoid ambiguity (about 83% of labeled sentences have a single label), and (2) we only evaluated on three major aspects, namely Food, Staff, and Ambience. The other aspects do not show clear patterns in either word usage or writing style, which makes these aspects very hard for even humans to identify. Besides the baseline models, we also compare the results with other published models, including MaxEnt-LDA (ME-LDA) (Zhao et al., 2010) and SERBM (Wang et al., 2015). SERBM has reported state-of-the-art results for aspect identification on the restaurant corpus to date. However, SERBM relies on a substantial amount of prior knowledge.  We make the following observations from Table 4: (1) ABAE outperforms all other models on F 1 score for aspects Staff and Ambience. (2) The F 1 score of ABAE for Food is worse than SERBM while its precision is very high. We analyzed the errors and found that most of the sentences we failed to recognize as Food are general descriptions without specific food words appearing. For example, the true label for the sentence "The food is prepared quickly and efficiently." is Food. ABAE assigns Staff to it as the highly focused words according to the attention mechanism are quickly and efficiently which are more related to Staff. In fact, although this sentence contains the word food, we think it is a rather general description of service. (3) ABAE substantially outperforms k-means for this task although both methods perform well for extracting coherent aspects as shown in Figure 2 and Figure 3. This shows the power brought by the attention mechanism, which is able to capture the main topic of a sentence by only focusing on aspect-related words.
For the beer domain, in addition to the five goldstandard aspect labels, we also combined Taste and Smell to form a single aspect -Taste+Smell. This is because these two aspects are very similar Figure 4: Visualization of the attention layer. and many words can be used to describe both aspects. For example, the words spicy, bitter, fresh, sweet, etc. are top ranked representative words in both aspects, which makes it very hard even for humans to distinguish them. Since Taste and Smell are highly correlated and difficult to separate in real life, a natural way to evaluate is to treat them as a single aspect.
We can see from Table 5 that due to the issue described above, all models perform poorly on Taste and Smell. ABAE outperforms previous models in F 1 scores on all aspects except for Taste. The results demonstrate the capability of ABAE in identifying separable aspects.  Table 6: Comparison between ABAE and ABAE − on aspect identification on the restaurant domain. Figure 4 shows the weights of words assigned by the attention model for some example sentences. As we can see, the weights learned by the model correspond very strongly with human intuition. In order to evaluate how attention model affects the overall performance of ABAE, we conduct experiments to compare ABAE and ABAE − on aspect identification, where ABAE − denotes the model in which the attention layer is switched off and sentence embedding is calculated by averaging its word embeddings: z s = 1 n n i=1 e w i . The results on the restaurant domain are shown in Table 6. ABAE achieves substantially higher precision and recall on all aspects compared with ABAE − , which demonstrates the effectiveness of the attention mechanism.

Conclusion
We have presented ABAE, a simple yet effective neural attention model for aspect extraction. In contrast to LDA models, ABAE explicitly captures word co-occurrence patterns and overcomes the problem of data sparsity present in review corpora. Our experimental results demonstrated that ABAE not only learns substantially higher quality aspects, but also more effectively captures the aspects of reviews than previous methods. To the best of our knowledge, we are the first to propose an unsupervised neural approach for aspect extraction. ABAE is intuitive and structurally simple, and also scales up well. All these benefits make it a promising alternative to LDA-based methods in practice.