Double Embeddings and CNN-based Sequence Labeling for Aspect Extraction

One key task of fine-grained sentiment analysis of product reviews is to extract product aspects or features that users have expressed opinions on. This paper focuses on supervised aspect extraction using deep learning. Unlike other highly sophisticated supervised deep learning models, this paper proposes a novel and yet simple CNN model employing two types of pre-trained embeddings for aspect extraction: general-purpose embeddings and domain-specific embeddings. Without using any additional supervision, this model achieves surprisingly good results, outperforming state-of-the-art sophisticated existing methods. To our knowledge, this paper is the first to report such double embeddings based CNN model for aspect extraction and achieve very good results.


Introduction
Aspect extraction is an important task in sentiment analysis (Hu and Liu, 2004) and has many applications (Liu, 2012). It aims to extract opinion targets (or aspects) from opinion text. In product reviews, aspects are product attributes or features. For example, from "Its speed is incredible" in a laptop review, it aims to extract "speed". Aspect extraction has been performed using supervised (Jakob and Gurevych, 2010;Chernyshevich, 2014;Shu et al., 2017) and unsupervised approaches (Hu and Liu, 2004;Zhuang et al., 2006;Mei et al., 2007;Qiu et al., 2011;Yin et al., 2016;. Recently, supervised deep learning models achieved state-of-the-art performances (Li and Lam, 2017). Many of these models use handcrafted features, lexicons, and complicated neural network architectures (Poria et al., 2016;Wang et al., 2016Li and Lam, 2017). Although these approaches can achieve better performances than their prior works, there are two other considerations that are also important. (1) Automated feature (representation) learning is always preferred. How to achieve competitive performances without manually crafting features is an important question. (2) According to Occam's razor principle (Blumer et al., 1987), a simple model is always preferred over a complex model. This is especially important when the model is deployed in a real-life application (e.g., chatbot), where a complex model will slow down the speed of inference. Thus, to achieve competitive performance whereas keeping the model as simple as possible is important. This paper proposes such a model.
To address the first consideration, we propose a double embeddings mechanism that is shown crucial for aspect extraction. The embedding layer is the very first layer, where all the information about each word is encoded. The quality of the embeddings determines how easily later layers (e.g., LSTM, CNN or attention) can decode useful information. Existing deep learning models for aspect extraction use either a pre-trained general-purpose embedding, e.g., GloVe (Pennington et al., 2014), or a general review embedding (Poria et al., 2016). However, aspect extraction is a complex task that also requires fine-grained domain embeddings. For example, in the previous example, detecting "speed" may require embeddings of both "Its" and "speed". However, the criteria for good embeddings for "Its" and "speed" can be totally different. "Its" is a general word and the general embedding (trained from a large corpus) is likely to have a better representation for "Its". But, "speed" has a very fine-grained meaning (e.g., how many instructions per second) in the laptop domain, whereas "speed" in general embeddings or general review embeddings may mean how many miles per second. So using in-domain embeddings is important even when the in-domain embedding corpus is not large. Thus, we leverage both general embeddings and domain embeddings and let the rest of the network to decide which embeddings have more useful information.
To address the second consideration, we use a pure Convolutional Neural Network (CNN) (Le-Cun et al., 1995) model for sequence labeling. Although most existing models use LSTM (Hochreiter and Schmidhuber, 1997) as the core building block to model sequences (Liu et al., 2015;Li and Lam, 2017), we noticed that CNN is also successful in many NLP tasks (Kim, 2014;Zhang et al., 2015;Gehring et al., 2017). One major drawback of LSTM is that LSTM cells are sequentially dependent. The forward pass and backpropagation must serially go through the whole sequence, which slows down the training/testing process 2 . One challenge of applying CNN on sequence labeling is that convolution and max-pooling operations are usually used for summarizing sequential inputs and the outputs are not well-aligned with the inputs. We discuss the solutions in Section 3.
We call the proposed model Dual Embeddings CNN (DE-CNN). To the best of our knowledge, this is the first paper that reports a double embedding mechanism and a pure CNN-based sequence labeling model for aspect extraction.

Related Work
Sentiment analysis has been studied at document, sentence and aspect levels (Liu, 2012;Pang and Lee, 2008;Cambria and Hussain, 2012). This work focuses on the aspect level (Hu and Liu, 2004). Aspect extraction is one of its key tasks, and has been performed using both unsupervised and supervised approaches. The unsupervised approach includes methods such as frequent pattern mining (Hu and Liu, 2004;Popescu and Etzioni, 2005), syntactic rules-based extraction (Zhuang et al., 2006;Wang and Wang, 2008;Qiu et al., 2011), topic modeling (Mei et al., 2007;Titov and McDonald, 2008;Lin and He, 2009;Moghaddam and Ester, 2011), word alignment (Liu et al., 2013) and label propagation (Zhou et al., 2013).
Traditionally, the supervised approach (Jakob and Gurevych, 2010;Mitchell et al., 2013;Shu et al., 2017) uses Conditional Random Fields (CRF) (Lafferty et al., 2001). Recently, deep neural networks are applied to learn better features for supervised aspect extraction, e.g., using LSTM (Williams and Zipser, 1989;Hochreiter and Schmidhuber, 1997;Liu et al., 2015) and attention mechanism  together with manual features (Poria et al., 2016;Wang et al., 2016). Further, (Wang et al., 2016Li and Lam, 2017) also proposed aspect and opinion terms co-extraction via a deep network. They took advantage of the goldstandard opinion terms or sentiment lexicon for aspect extraction. The proposed approach is close to (Liu et al., 2015), where only the annotated data for aspect extraction is used. However, we will show that our approach is more effective even compared with baselines using additional supervisions and/or resources.
The proposed embedding mechanism is related to cross domain embeddings (Bollegala et al., 2015(Bollegala et al., , 2017 and domain-specific embeddings (Xu et al., 2018a,b). However, we require the domain of the domain embeddings must exactly match the domain of the aspect extraction task. CNN (LeCun et al., 1995;Kim, 2014) is recently adopted for named entity recognition (Strubell et al., 2017). CNN classifiers are also used in sentiment analysis (Poria et al., 2016;Chen et al., 2017). We adopt CNN for sequence labeling for aspect extraction because CNN is simple and parallelized.

Model
The proposed model is depicted in Figure 1. It has 2 embedding layers, 4 CNN layers, a fullyconnected layer shared across all positions of words, and a softmax layer over the labeling space Y = {B, I, O} for each position of inputs. Note that an aspect can be a phrase and B, I indicate the beginning word and non-beginning word of an aspect phrase and O indicates non-aspect words.
Assume the input is a sequence of word indexes x = (x 1 , . . . , x n ). This sequence gets its two corresponding continuous representations x g and x d via two separate embedding layers (or embedding matrices) W g and W d . The first embedding matrix W g represents general embeddings pretrained from a very large general-purpose corpus (usually hundreds of billions of tokens). The sec- ond embedding matrix W d represents domain embeddings pre-trained from a small in-domain corpus, where the scope of the domain is exactly the domain that the training/testing data belongs to. As a counter-example, if the training/testing data is in the laptop domain, then embeddings from the electronics domain are considered to be out-ofdomain embeddings (e.g., the word "adapter" may represent different types of adapters in electronics rather than exactly a laptop adapter). That is, only laptop reviews are considered to be in-domain.
We do not allow these two embedding layers trainable because small training examples may lead to many unseen words in test data. If embeddings are tunable, the features for seen words' embeddings will be adjusted (e.g., forgetting useless features and infusing new features that are related to the labels of the training examples). And the CNN filters will adjust to the new features accordingly. But the embeddings of unseen words from test data still have the old features that may be mistakenly extracted by CNN.
Then we concatenate two embeddings x (1) = x g ⊕ x d and feed the result into a stack of 4 CNN layers. A CNN layer has many 1D-convolution filters and each (the r-th) filter has a fixed kernel size k = 2c+1 and performs the following convolution operation and ReLU activation:  where l indicates the l-th CNN layer. We apply each filter to all positions i = 1 : n. So each filter computes the representation for the i-th word along with 2c nearby words in its context. Note that we force the kernel size k to be an odd number and set the stride step to be 1 and further pad the left c and right c positions with all zeros. In this way, the output of each layer is well-aligned with the original input x for sequence labeling purposes. For the first (l = 1) CNN layer, we employ two different filter sizes. For the rest 3 CNN (l ∈ {2, 3, 4}) layers, we only use one filter size. We will discuss the details of the hyperparameters in the experiment section. Finally, we apply a fully-connected layer with weights shared across all positions and a softmax layer to compute label distribution for each word. The output size of the fully-connected layer is |Y| = 3. We apply dropout after the embedding layer and each ReLU activation. Note that we do not apply any max-pooling layer after convolution layers because a sequence labeling model needs good representations for every position and max-pooling operation mixes the representations of different positions, which is undesirable (we show a maxpooling baseline in the next section).

Datasets
Following the experiments of a recent aspect extraction paper (Li and Lam, 2017), we conduct experiments on two benchmark datasets from Se-mEval challenges (Pontiki et al., 2014(Pontiki et al., , 2016 as shown in Table 4.1. The first dataset is from the laptop domain on subtask 1 of SemEval-2014 Task 4. The second dataset is from the restaurant domain on subtask 1 (slot 2) of SemEval-2016 Task 5. These two datasets consist of review sentences with aspect terms labeled as spans of characters. We use NLTK 3 to tokenize each sentence into a sequence of words.
For the general-purpose embeddings, we use the glove.840B.300d embeddings (Pennington et al., 2014), which are pre-trained from a corpus of 840 billion tokens that cover almost all web pages. These embeddings have 300 dimensions. For domain-specific embeddings, we collect a laptop review corpus and a restaurant review corpus and use fastText (Bojanowski et al., 2016) to train domain embeddings. The laptop review corpus contains all laptop reviews from the Amazon Review Dataset (He and McAuley, 2016). The restaurant review corpus is from the Yelp Review Dataset Challenge 4 . We only use reviews from restaurant categories that the second dataset is selected from 5 . We set the embedding dimensions to 100 and the number of iterations to 30 (for a small embedding corpus, embeddings tend to be under-fitted), and keep the rest hyper-parameters as the defaults in fastText. We further use fastText to compose out-of-vocabulary word embeddings via subword N-gram embeddings.

Baseline Methods
We perform a comparison of DE-CNN with three groups of baselines using the standard evaluation of the datasets 6 7 . The results of the first two groups are copied from (Li and Lam, 2017). The first group uses single-task approaches. CRF is conditional random fields with basic features 8 and GloVe word embedding (Pennington et al., 2014).
BiLSTM-CNN-CRF (Reimers and Gurevych, 2017) is the state-of-the-art from the Named Entity Recogntion (NER) community. We use this baseline 9 to demonstrate that a NER model may need further adaptation for aspect extraction.
The second group uses multi-task learning and also take advantage of gold-standard opinion terms/sentiment lexicon.
RNCRF (Wang et al., 2016) is a joint model with a dependency tree based recursive neural network and CRF for aspect and opinion terms coextraction. Besides opinion annotations, it also uses handcrafted features.
CMLA ) is a multi-layer coupled-attention network that also performs aspect and opinion terms co-extraction. It uses goldstandard opinion labels in the training data.
MIN (Li and Lam, 2017) is a multi-task learning framework that has (1) two LSTMs for jointly extraction of aspects and opinions, and (2) a third LSTM for discriminating sentimental and nonsentimental sentences. A sentiment lexicon and high precision dependency rules are employed to find opinion terms.
The third group is the variations of DE-CNN.
GloVe-CNN only uses glove.840B.300d to show that domain embeddings are important.
Domain-CNN does not use the general embeddings to show that domain embeddings alone are not good enough as the domain corpus is limited for training good general words embeddings.
MaxPool-DE-CNN adds max-pooling in the last CNN layer. We use this baseline to show that the max-pooling operation used in the traditional CNN architecture is harmful to sequence labeling.
DE-OOD-CNN replaces the domain embeddings with out-of-domain embeddings to show that a large out-of-domain corpus is not a good replacement for a small in-domain corpus for domain embeddings. We use all electronics reviews as the out-of-domain corpus for the laptop and all the Yelp reviews for restaurant.
DE-Google-CNN replaces the glove embeddings with GoogleNews embeddings 10 , which are pre-trained from a smaller corpus (100 billion tokens). We use this baseline to demonstrate that general embeddings that are pre-trained from a larger corpus performs better.
DE-CNN-CRF replaces the softmax activation with a CRF layer 11 . We use this baseline to demonstrate that CRF may not further improve the challenging performance of aspect extraction.

Hyper-parameters
We hold out 150 training examples as validation data to decide the hyper-parameters. The first CNN layer has 128 filters with kernel sizes k = 3 (where c = 1 is the number of words on the left (or right) context) and 128 filters with kernel sizes k = 5 (c = 2). The rest 3 CNN layers have 256 filters with kernel sizes k = 5 (c = 2) per layer. The dropout rate is 0.55 and the learning rate of Adam optimizer (Kingma and Ba, 2014) is 0.0001 because CNN training tends to be unstable. In fact, the CRF layer can improve 1-2% when the laptop's performance is about 75%. But it doesn't contribute much when laptop's performance is above 80%. CRF is good at modeling label dependences (e.g., label I must be after B), but many aspects are just single words and the major types of errors (mentioned later) do not fall in what CRF can solve. Note that we did not tune the hyperparameters of DE-CNN-CRF for practical purpose because training the CRF layer is extremely slow.

Results and Analysis
One important baseline is BiLSTM-CNN-CRF, which is markedly worse than our method. We believe the reason is that this baseline leverages dependency-based embeddings (Levy and Goldberg, 2014), which could be very important for NER. NER models may require further adaptations (e.g., domain embeddings) for opinion texts.
DE-CNN has two major types of errors. One type comes from inconsistent labeling (e.g., for the restaurant data, the same aspect is sometimes labeled and sometimes not). Another major type of errors comes from unseen aspects in test data that require the semantics of the conjunction word "and" to extract. For example, if A is an aspect and when "A and B" appears, B should also be extracted but not. We leave this to future work.

Conclusion
We propose a CNN-based aspect extraction model with a double embeddings mechanism without extra supervision. Experimental results demonstrated that the proposed method outperforms state-of-the-art methods with a large margin.

Acknowledgments
This work was supported in part by NSF through grants IIS-1526499, IIS-1763325, and IIS1407927, CNS-1626432, NSFC 61672313, and a gift from Huawei Technologies.