Recursive Neural Conditional Random Fields for Aspect-based Sentiment Analysis

In aspect-based sentiment analysis, extracting aspect terms along with the opinions being expressed from user-generated content is one of the most important subtasks. Previous studies have shown that exploiting connections between aspect and opinion terms is promising for this task. In this paper, we propose a novel joint model that integrates recursive neural networks and conditional random fields into a unified framework for explicit aspect and opinion terms co-extraction. The proposed model learns high-level discriminative features and double propagate information between aspect and opinion terms, simultaneously. Moreover, it is flexible to incorporate hand-crafted features into the proposed model to further boost its information extraction performance. Experimental results on the SemEval Challenge 2014 dataset show the superiority of our proposed model over several baseline methods as well as the winning systems of the challenge.


Introduction
Aspect-based sentiment analysis (Pang and Lee, 2008; aims to extract important information, e.g. opinion targets, opinion expressions, target categories, and opinion polarities, from user-generated content, such as microblogs, reviews, etc. This task was first studied by Hu and Liu (2004a;2004b), followed by (Popescu and Etzioni, 2005;Zhuang et al., 2006;Qiu et al., 2011;. In aspect-based sentiment analysis, one of the goals is to extract explicit aspects of an entity from text, along with the opinions being expressed. For example, in a restaurant review "I have to say they have one of the fastest delivery times in the city.", the aspect term is delivery times, and the opinion term is fastest.
Among previous work, one of the approaches is to accumulate aspect and opinion terms from a seed collection without label information, by utilizing syntactic rules or modification relations between them (Qiu et al., 2011;Liu et al., 2013b). In the above example, if we know fastest is an opinion word, then delivery times is probably deduced as an aspect because fastest is its modifier. However, this approach largely relies on hand-coded rules, and is restricted to certain Part-of-Speech (POS) tags, e.g., opinion words are restricted to be adjectives. Another approach focuses on feature engineering based on predefined lexicons, syntactic analysis, etc (Jin and Ho, 2009;. A sequence labeling classifier is then built to extract aspect and opinion terms. This approach requires extensive efforts for designing hand-crafted features, and only combines features linearly for classification, which ignores higher order interactions. To overcome the limitations of existing methods, we propose a novel model, namely Recursive Neural Conditional Random Fields (RNCRF). Specifically, RNCRF consists of two main components. The first component is to construct a recursive neural network (RNN) 1 (Socher et al., 2010) based on dependency tree of each sentence. The goal is to learn high-level feature representation of each word in the context of each sentence, and make the representation learning for aspect and opinion terms interactive through the underlying dependency structure among them. The output of the RNN is then fed into a Conditional Random Field (CRF) (Lafferty et al., 2001) to learn a discriminative mapping from high-level features to labels, i.e., aspects, opinions, or others, so that context information can be well captured. Our main contributions are to use RNN for encoding aspect-opinion relations in high-level representation, and to present a joint optimization approach based on maximum likelihood and backpropagation to learn the RNN and CRF components, simultaneously. In this way, the label information of aspect and opinion terms can be dually propagated from parameter learning in CRF to representation learning in RNN. We conducted expensive experiments on the SemEval challenge 2014 (task 4) dataset (Pontiki et al., 2014) to verify the superiority of RNCRF over several baseline methods as well as the winning systems of the challenge. Hu et al. (2004a) proposed to extract product aspects through association mining, and opinion terms by augmenting a seed opinion set using synonyms and antonyms in WordNet. In follow-up work, syntactic relations are further exploited for aspect/opinion extraction (Popescu and Etzioni, 2005;Wu et al., 2009;Qiu et al., 2011). For example, Qiu et al. (2011) used syntactic relations to double propagate and augment the sets of aspects and opinions. Though the above models are unsupervised, they heavily depend on predefined rules for extraction, and are also restricted to specific types of POS tags for product aspects and opinions. Jin et al. (2009) and, Jakob et al. (2010), Ma et al. (2010) modeled the extraction problem as a sequence tagging problem, and proposed to use HMMs or CRFs to solve it. These methods rely on richly hand-crafted features, and do not consider interactions between aspect and opinion terms explicitly. Another direction is to use word alignment model to capture opinion relations among a sentence (Liu et al., 2012;Liu et al., 2013a). This method requires sufficient data for modeling desired relations.

Deep Learning for Sentiment Analysis
Recent studies have shown that deep learning models can automatically learn the inherent semantic and syntactic information from data and thus achieve better performance for sentiment analysis (Socher et al., 2011b;Socher et al., 2012;Socher et al., 2013;Glorot et al., 2011;Kalchbrenner et al., 2014;Kim, 2014;Le and Mikolov, 2014). These methods generally belong to sentence-level or phrase/wordlevel sentiment polarity predictions. Regarding aspect-based sentiment analysis, Irsoy et al. (2014) applied deep recurrent neural networks for opinion expression extraction. Dong et al. (2014) proposed an adaptive recurrent neural network for target-dependent sentiment classification, where targets or aspects are given as input. Tang et al. (2015) used Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) for the same task. Nevertheless, there is little work in aspects and opinions co-extraction using deep learning models.
To the best of our knowledge, the most related work to ours is Yin et al., 2016).  proposed to combine recurrent neural network and word embeddings to extract explicit aspects. However, their proposed model simply uses recurrent neural network on top of word embeddings, and thus heavily depends on the quality of word embeddings. In addition, it fails to explicitly model dependency relations or compositionalities within certain syntactic structure in a sentence. Recently, Yin et al. (2016) proposed an unsupervised method to improve word embeddings using dependency path embeddings. A CRF is then trained with the embeddings independently in the pipeline. Different from their work, our model does not focus on developing a new unsupervised word embedding methods, but encoding the information of dependency paths inside RNN for constructing syntactically meaningful and discriminative hidden representations with labels. Moreover, we integrate RNN and CRF into a unified framework, and develop a joint optimization approach, instead of training word embeddings and a CRF separately in their work. Note that Weiss et al. (2015) combined deep learning and structured learning for language parsing which can be learned by structured perceptron. However, they also separate neural network training with structured prediction.
RNNs have been used for many NLP tasks, such as learning phrase representations (Socher et al., 2010), sentence-level sentiment analysis (Socher et al., 2013), language parsing (Socher et al., 2011a), and question answering (Iyyer et al., 2014). The tree structures used for RNNs include constituency tree and dependency tree. In a constituency tree, all the words lie at leaf nodes, each internal node represents a phrase or a constituent of a sentence, and the root node represents the entire sentence (Socher et al., 2010;Socher et al., 2012;Socher et al., 2013). In a dependency tree, each node including terminal and nonterminal nodes, represents a word, with dependency connections to other nodes Iyyer et al., 2014). The resultant model is known as dependency-tree RNN (DT-RNN). An advantage of using dependency tree over the other is the ability to extract word-level representations considering syntactic relations and semantic robustness. Therefore, we adopt DT-RNN in this work.

Problem Statement
Suppose that we are given a training set of customer reviews in a specific domain, denoted by S = {s 1 , ..., s N }, where N is the number of review sentences. For any s i ∈ S, there may exist a set of as- can be a single word or a sequence of words expressing explicitly some aspect of an entity, and a set of opinion terms O i = {o i1 , ..., o im }, where each o ir can be a single word or a sequence of words expressing the subjective sentiment of the comment holder. The task is to learn a classifier to extract the set of aspect terms A i and the set of opinion terms O i from each review sentence s i ∈ S. This task can be formulated as a sequence tagging problem by using the BIO encoding scheme. Specifically, each review sentence s i is composed of a sequence of words Each word w ip ∈ s i is labeled as one out of the following 5 classes: "BA" (beginning of aspect), "IA" (inside of aspect), "BO" (beginning of opinion), "IO" (inside of opinion), and "O" (others). Let L = {BA, IA, BO, IO, O}. We are also given a test set of review sentences denoted by Note that a sequence of predictions with "BA" at the beginning followed by "IA" are indication of one aspect, which is similar for opinion terms. 2

Recursive Neural CRFs
As described in Section 1, RNCRF consists of two main components: 1) a DT-RNN to learn high-level representation for each word in a sentence, and 2) a CRF to take the learned representation as input to capture context around each word for explicit aspect and opinion terms extraction. Next, We present these two components in details.

Dependency-Tree RNNs
We begin by associating each word w in our vocabulary with a feature vector x ∈ Ê d , which corresponds to a column of a word embedding matrix W e ∈ Ê d×v , where v is the size of the vocabulary.
For each sentence, we build a DT-RNN based on the corresponding dependency parse tree with word embeddings as initialization. An example of the dependency parse tree is shown in Figure 1(a), where each edge starts from the parent and points to its dependent with a syntactic relation.
In a DT-RNN, each node n, including leaf nodes, internal nodes and the root node for a specific sentence is associated with a word w, an input feature vector x w and a hidden vector h n ∈ Ê d of the same dimension as x w . Each dependency relation r is associated with a separate matrix W r ∈ Ê d×d . In addition, a common transformation matrix W v ∈ Ê d×d is introduced to map the word embedding x w at node n to its corresponding hidden vector h n .
(a) Example of a dependency tree.
(b) Example of a DT-RNN tree structure.
(c) Example of a RNCRF structure. Figure 1: Examples of dependency tree, DT-RNN structure and RNCRF structure for a review sentence.
Along with a particular dependency tree, a hidden vector h n is computed from its own word embedding x w at node n with the transformation matrix W v and its children's hidden vectors h child(n) with the corresponding relation matrices {W r }'s. For instance, given the parse tree shown in Figure 1(a), we first compute the leaf nodes associated with I and the using W v as follows, where f is a non-linear activation function and b is a bias term. In this paper, we adopt tanh(·) as the activation function. Once the hidden vectors of all the leaf nodes are generated, we can recursively generate hidden vectors for interior nodes using the corresponding relation matrix W r and the common transformation matrix W v as follows, The resultant DT-RNN is shown in Figure 1(b). In general, a hidden vector for any node n associated with a word vector x w can be computed as follows, where K n denotes the set of children of node n, r nk denotes the dependency relation between node n and its child node k, and h k is the hidden vector of the child node k. The parameters of DT-RNN, Θ RNN = {W v , W r , W e , b} are learned during training.

Integration with CRFs
CRFs are a discriminant graphical model for structured prediction. In RNCRF, we feed the output of DT-RNN, i.e., the hidden representation of each word in a sentence, to a CRF. Updates of parameters for RNCRF are carried out successively from the top to bottom, by propagating errors through CRF to the hidden layers of RNN (including word embeddings) using backpropagation through structure (BPTS) (Goller and Küchler, 1996). Formally, for each sentence s i , we denote the input for CRF by h i , which is generated by DT-RNN.
Here h i is a matrix with columns of hidden vec- where Y is a set of possible combinations of labels in label set L. The entire structure can be represented by an undirected graph G = (V, E) with cliques c ∈ C. In this paper, we employed linear-chain CRF, which has two different cliques: unary clique (U) representing input-output connection, and pairwise clique (P) representing adjacent output connection, as shown in Figure 1(c). During inference, the model aims to outputŷ with the maximum conditional probability p(y|h). (We drop the subscript i here for simplicity.) The distribution is computed from potential outputs of the cliques: where Z(h) is the normalization term, and ψ c (h, y c ) is the potential of clique c, computed as ψ c (h, y c ) = exp W c , F (h, y c ) , where the RHS is the exponential of a linear combination of feature vector F (h, y c ) for clique c, and the weight vector W c is tied for unary and pairwise cliques. We also incorporate a context window of size 2T + 1 when computing unary potentials. Thus, the potential of unary clique at node k can be written as where W 0 , W +t and W −t are weight matrices of the CRF for the current position, the t-th position to the right, and the t-th position to the left within context window, respectively. The subscript y k indicates the corresponding row in the weight matrix.
For instance, Figure 2 shows an example of window size 3. At the second position, the input features for like are composed of the hidden vectors at position 1 (h I ), position 2 (h like ) and position 3 (h the ). Therefore, the conditional distribution for the entire sequence y in Figure 1(c) can be calculated as where the first three terms in the exponential of the RHS consider unary clique while the last term considers the pairwise clique with matrix V representing pairwise state transition score. For simplicity in description on parameter updates, we denote the log-potential for clique c ∈ {U, P } by g c (h, y c ) = W c , F (h, y c ) .

Joint Training for RNCRF
Through the objective of maximum likelihood, updates for parameters of RNCRF are first conducted on the parameters of the CRF (unary weight matrices Θ U = {W 0 , W +t , W −t } and pairwise weight matrix V ) by applying chain rule to log-potential updates. Below is the gradient for Θ U (updates for V are similar through the log-potential of pairwise clique g P (y ′ k , y ′ k+1 )): where y ′ k represents possible label configuration of node k. The parameters of DT-RNN are updated subsequently by applying chain rule with (4) through BPTS as follows,

Algorithm 1 Recursive Neural CRFs
where h root represents the hidden vector of the word pointed by ROOT in the corresponding DT-RNN. Since this word is the topmost node in the tree, it only inherits error from the CRF output. h par(k) is the hidden vector of the parent node of node k in DT-RNN. Hence the lower nodes receive error from both the CRF output and error propagation from parent node. The parameters within DT-RNN, Θ RNN , are updated by applying chain rule with respect to updates of hidden vectors, and aggregating among all associated nodes, as shown in (8). The overall procedure of RNCRF is summarized in Algorithm 1.

Discussion
The best performing system (Toh and  for SemEval challenge 2014 employed CRFs with extensive hand-crafted features including those induced from dependency trees. However, their experiments showed that the addition of the features induced from dependency relations does not improve the performance. This indicates the infeasibility or difficulty of incorporating dependency structure explicitly as input features, which motivates the design of our model to use DT-RNN to encode dependency between words for feature learning. The most important advantage of RNCRF is the ability to learn the underlying dual propagation between aspect and opinion terms from the tree structure itself. Specif-ically as shown in Figure 1(c), where the aspect is food and the opinion expression is like. In the dependency tree, food depends on like with the relation DOBJ. During training, RNCRF computes the hidden vector h like for like, which is obtained from h food . As a result, the prediction for like is affected by h food . This is one-way propagation from food to like. During backpropagation, the error for like is propagated through a top-down manner to revise the representation h food . This is the other-way propagation from like to food. Therefore, the dependency structure together with the learning approach help to enforce the dual propagation of aspect-opinion pairs as long as the dependency relation exists, either directly or indirectly.

Adding Linguistic/Lexicon Features
RNCRF is an end-to-end model, where feature engineering is not necessary. However, it is flexible to incorporate light hand-crafted features into RN-CRF to further boost its performance, such as features with POS tags, name-list, or sentiment lexicon. These features could be appended to the hidden vector of each word, but keep fixed during training, unlike learnable neural inputs and the CRF weights as described in Section 4.3. As will be shown in experiments, RNCRF without any hand-crafted features slightly outperforms the best performing systems that involve heavy feature engineering efforts, and RNCRF with light feature engineering can achieve better performance.

Dataset and Experimental Setup
We evaluate our model on the SemEval Challenge 2014 task 4 dataset with reviews from two domains: restaurant and laptop reviews. The detailed description of the dataset is given in Table 1. As the original dataset only includes manually annotated labels for aspect terms but not for opinion terms, we manually annotated opinion terms for each sentence by ourselves to facilitate our experiments. For word vector initialization, we trained word embeddings with word2vec (Mikolov et al., 2013) on the Yelp Challenge dataset 3 for the 3 http://www.yelp.com/dataset challenge Domain  Training  Test  Total  Restaurant  3,041  800  3,841  Laptop  3,045  800  3,845  Total  6,086 1,600 7,686 Empirical sensitivity studies on different dimensions of word embeddings are also conducted. Dependency trees are generated using Stanford Dependency Parser . Regarding CRFs, we implemented a linear-chain CRF using CRFSuite (Okazaki, 2007). Because of the relatively small size of training data and a large number of parameters, we performed pretraining on the parameters in DT-RNN with cross-entropy error, which is a common strategy for deep learning (Erhan et al., 2009). We implemented mini-batch stochastic gradient descent (SGD) with batch size 25, and adaptive learning rate (AdaGrad) initialized at 0.02 for pretraining of DT-RNN, which runs 4 epochs for restaurant domain and 5 epochs for laptop domain. For parameter learning of the joint model RNCRF, we implemented SGD with decaying learning rate initialized at 0.02. We also tried with varying context window size, and used 3 for laptop domain and 5 for restaurant domain, respectively. All parameters are chosen by cross validation. As discussed in Section 5.1, hand-crafted features can be easily incorporated into RNCRF. We generated three types of simple features based on POS tags, name-list and sentiment lexicon to show further improvement by incorporating these features. Following (Toh and Wang, 2014), we extracted two sets of name list from the training data for each domain, where one includes high-frequency aspect terms, and the other includes high-probability aspect words. These two sets are used to construct two lexicon features, i.e. we built a 2D binary vec-tor: if a word is in a set, the corresponding value is 1, otherwise 0. For POS tags, we used Stanford POS tagger (Toutanova et al., 2003), and converted them to universal POS tags that have 15 different categories. We then generated 15 one-hot POS tag features. For sentiment lexicon, we used the collection of commonly used opinion words (around 6,800) (Hu and Liu, 2004a). Similar to name list, we create a binary feature to indicate whether the word belongs to opinion lexicon. We denote by RN-CRF+F the proposed model with the three types of features.
Compared to the winning systems of SemEval Challenge 2014, RNCRF or RNCRF+F uses additional labels of opinion terms for training. Therefore, to conduct fair comparison experiments with the winning systems, we implemented RNCRF-O by omitting opinion labels to train our model (labels become "BA", "IA", "O") Accordingly, we denote by RNCRF-O+F the RNCRF-O model with the three additional types of features.

Experimental Results
We compare our model with several baselines: CRF-1: a linear-chain CRF with standard linguistic features including word string, stylistics, POS tag, context string, and context POS tags. CRF-2: a linear-chain CRF with both standard linguistic features and dependency information including head word, dependency relations with parent token and child tokens. LSTM: an LSTM network built on top of word embeddings proposed by . We keep original settings in  but replace their word embeddings with ours (300 dimension). We tried different hidden layer dimensions (50,100,150,200) and reported the best result with size 50. LSTM+F: the above LSTM model with the three additional types of features as with RNCRF. SemEval-1, SemEval-2: top two winning systems for SemEval challenge 2014 (task 4). WDEmb+B+CRF 5 : the model proposed by (Yin et al., 2016) Table 2 for both the restaurant domain and the laptop domain. Note that we provided the same annotated dataset (both aspect labels and opinion labels are included for training) for CRF-1, CRF-2 and LSTM for fair comparison. It is clear that our proposed model RN-CRF achieves superior performance compared with all the baseline models. The performance is better by adding simple hand-crafted features, i.e., RN-CRF+F, with 0.92% and 3.87% absolute improvement over the best system in the challenge for aspect extraction for the restaurant domain and the laptop domain, respectively. This shows the advantage of combining high-level continuous features and discrete hand-crafted features. Though CRFs usually show promising results in sequence tagging problems, it fails to achieve comparable performance when lacking of extensive features (e.g., CRF-1). By adding dependency information explicitly in CRF-2, the result only improves slightly for aspect extraction. Alternatively, by incorporating dependency information into deep models (e.g., RNCRF), the result shows more than 7% improvement for aspect extraction and 2% for opinion extraction.
By removing the labels for opinion terms, RNCRF-O produces inferior results than RNCRF because the effect of dual propagation of aspect and opinion pairs disappears with the absence of opinion labels. This verifies our previous assumption that DT-RNN could learn the interactive effects within aspects and opinions. However, the performance of RNCRF-O is still comparable to the top systems and even better with the addition of simple linguistic fea-   . However, in their work, they used well-pretrained word embeddings by training with large corpus or extensive external resources, e.g. chunking, and NER. To compare their model with RNCRF, we re-implemented LSTM with the same word embedding strategy and labeling resources as ours. The results show that our model outperforms LSTM in aspect extraction by 2.90% and 4.10% for the restaurant domain and the laptop domain respectively. We conclude that a single LSTM model fails to extract the relations between aspect and opinion terms. Even with the addition of same linguistic features, LSTM is still inferior than RNCRF itself in terms of aspect extraction. Our result is comparable with WDEmb+B+CRF in the restaurant domain and better in the laptop domain (+3.26%). Note that WDEmb+B+CRF appended dependency context information into CRF while our model encode such information into highlevel representation learning.
To test the impact of each component of RN-CRF and hand-crafted features, we conducted experiments on different model settings: DT-RNN+SoftMax: rather than using a CRF, a softmax classifier is used on top of DT-RNN. CRF+word2vec: a linear-chain CRF with word embeddings only without using DT-RNN. RNCRF+POS/NL/Lex: the RNCRF model with POS tag or name list or sentiment lexicon feature.
The comparison results are shown in Table 3. Similarly, both aspect and opinion term labels are provided for training for each of the above models. Firstly, RNCRF achieves much better results com-  pared to DT-RNN+SoftMax (+11.60% and +10.72% for restaurant domain and laptop domain for aspect extraction). This is because DT-RNN fails to fully exploit context information for sequential labeling, which can be achieved by CRF. Secondly, RNCRF outperforms CRF+word2vec, which proves the importance of DT-RNN for modeling interactions between aspects and opinions. Hence, the combination of DT-RNN and CRF inherits the advantages from both models. Moreover, by separately adding hand-crafted features, we can observe that name-list based features and sentiment lexicon are most effective for aspect extraction and opinion extraction respectively. This might be explained by the fact that name-list based features usually contain informative evident for aspect terms and sentiment lexicon provides explicit indication about opinions.
Besides the comparison experiments, we also conducted sensitivity test for our proposed model in terms of word vector dimensions. We tested a set of different dimensions ranging from 25 to 400, with 25 increment. The sensitivity plot is shown in Figure 3. The performance for aspect extraction is smooth with different vector lengths for both domains. For restaurant domain, the result is stable after dimension 100, with the highest at 325. For the laptop domain, the best result is at dimension 300, but with relatively small variations. For opinion extraction, the performance reaches a good level after dimension 75 for the restaurant domain and 125 for the laptop domain. This proves the stability and robustness of our model.

Conclusion
We have presented a joint model, RNCRF, that achieves the state-of-the-art performance for explicit aspect and opinion term extraction on a benchmark dataset. With the help of DT-RNN, high-level features can be learned by encoding the underlying dual propagation of aspect-opinion pairs. RNCRF combines the advantages of DT-RNNs and CRFs, and thus outperforms the traditional rule-based methods in terms of flexibility, because aspect terms and opinion terms are not only restricted to certain observed relations and POS tags. Compared to feature engineering methods with CRFs, the proposed model saves much effort in composing features, and it is able to extract higher-level features obtained from non-linear transformations.