Fine-grained Opinion Mining with Recurrent Neural Networks and Word Embeddings

The tasks in ﬁne-grained opinion mining can be regarded as either a token-level sequence labeling problem or as a semantic compositional task. We propose a general class of discriminative models based on recurrent neural networks (RNNs) and word embeddings that can be successfully applied to such tasks without any task-speciﬁc feature engineering effort. Our experimental results on the task of opinion target identiﬁcation show that RNNs, without using any hand-crafted features, outperform feature-rich CRF-based models. Our framework is ﬂexible, allows us to incorporate other linguistic features, and achieves results that rival the top performing systems in SemEval-2014.


Introduction
Fine-grained opinion mining involves identifying the opinion holder who expresses the opinion, detecting opinion expressions, measuring their intensity and sentiment, and identifying the target or aspect of the opinion . For example, in the sentence "John says, the hard disk is very noisy", John, the opinion holder, expresses a very negative (i.e., sentiment with intensity) opinion towards the target "hard disk" using the opinionated expression "very noisy". A number of NLP applications can benefit from fine-grained opinion mining including opinion summarization and opinion-oriented question answering.
The tasks in fine-grained opinion mining can be regarded as either a token-level sequence labeling problem or as a semantic compositional task at the sequence (e.g., phrase) level. For example, identifying opinion holders, opinion expressions and opinion targets can be formulated as a token-level sequence tagging problem, where the task is to  Table 1: An example sentence annotated with BIO labels for opinion target (TARG tags) and for opinion expression (EXPR tags) extraction.
label each word in a sentence using the conventional BIO tagging scheme. For example, Table  1 shows a sentence tagged with BIO scheme for opinion target (middle row) and for opinion expression (bottom row) identification tasks. On the other hand, characterizing intensity and sentiment of an opinionated expression can be regarded as a semantic compositional problem, where the task is to aggregate vector representations of tokens in a meaningful way and later use them for sentiment classification (Socher et al., 2013). Conditional random fields (CRFs) (Lafferty et al., 2001) have been quite successful for different fine-grained opinion mining tasks, e.g., opinion expression extraction (Yang and Cardie, 2012). The state-of-the-art model for opinion target extraction is also based on a CRF (Pontiki et al., 2014). However, the success of CRFs depends heavily on the use of an appropriate feature set and feature function expansion, which often requires a lot of engineering effort for each task in hand.
An alternative approach of deep learning automatically learns latent features as distributed vectors and have recently been shown to outperform CRFs on similar tasks. For example, Irsoy and Cardie (2014) apply deep recurrent neural networks (RNNs) to extract opinion expressions from sentences and show that RNNs outperform CRFs. Socher et al. (2013) propose recursive neural networks for a semantic compositional task to identify the sentiments of phrases and sentences hierarchically using the syntactic parse trees.
Meanwhile, recent advances in word embed-ding induction methods (Collobert and Weston, 2008;Mikolov et al., 2013b) have benefited researchers in two ways: (i) they have contributed to significant gains when used as extra word features in existing NLP systems (Turian et al., 2010;Lebret and Lebret, 2013), and (ii) they have enabled more effective training of RNNs by providing compact input representations of the words (Mesnil et al., 2013;Irsoy and Cardie, 2014). Motivated by the recent success of deep learning, in this paper we propose a general class of models based on RNN architecture and word embeddings, that can be successfully applied to finegrained opinion mining tasks without any taskspecific feature engineering effort. We experiment with several important RNN architectures including Elman-RNN, Jordan-RNN, long short term memory (LSTM) and their variations. We acquire pre-trained word embeddings from several external sources to give better initialization to our RNN models. The RNN models then fine-tune the word vectors during training to learn task-specific embeddings. We also present an architecture to incorporate other linguistic features into RNNs.
Our results on the task of opinion target extraction show that word embeddings improve the performance of state-of-the-art CRF models, when included as additional features. They also improve RNNs when used as pre-trained word vectors and fine-tuning them on the task gives the best results. A comparison between models demonstrates that RNNs outperform CRFs, even when they use word embeddings as the only features. Incorporating simple linguistic features into RNNs improves the performance even further. Our best results with LSTM RNN outperform the top performing system on the Laptop dataset and achieve the second best on the Restaurant dataset in SemEval-2014. We make our source code available. 1 In the remainder of this paper, after discussing related work in Section 2, we present our RNN models in Section 3. In Section 4, we briefly describe the pre-trained word embeddings. The experiments and analysis of results are presented in Section 5. Finally, we summarize our contributions with future directions in Section 6.

Related Work
A line of previous research in fine-grained opinion mining focused on detecting opinion (subjective) 1 https://github.com/ppfliu/opinion-target expressions, e.g., Breck et al., 2007). The common approach was to formulate the problem as a sequence tagging task and use a CRF model. Later approaches extended this to jointly identify opinion holders (Choi et al., 2005), and intensity and polarity (Choi and Cardie, 2010).
Extracting aspect terms or opinion targets have been actively investigated in the past. Typical approaches include association mining to find frequent item sets (i.e., co-occurring words) as candidate aspects (Hu and Liu, 2004), classificationbased methods such as hidden Markov model (Jin et al., 2009) and CRF (Shariaty and Moghaddam, 2011;Yang and Cardie, 2012;Yang and Cardie, 2013), as well as topic modeling techniques using Latent Dirichlet Allocation (LDA) model and its variants (Titov and McDonald, 2008;Lin and He, 2009;Moghaddam and Ester, 2012).
Conventional RNNs (e.g., Elman type) and LSTM have been successfully applied to various sequence prediction tasks, such as language modeling (Mikolov et al., 2010;Sundermeyer et al., 2012), speech recognition (Graves and Jaitly, 2014;Sak et al., 2014) and spoken language understanding (Mesnil et al., 2013). For sentiment analysis, Socher et al. (2013) propose to use recursive neural networks to hierarchically compose semantic word vectors based on syntactic parse trees, and use the vectors to identify the sentiments of the phrases and sentences. Le and Zuidema (2015) extended recursive neural networks with LSTM to compute a parent vector in parse trees by combining information of both output and LSTM memory cells from its two children.
Most relevant to our work is the recent work of Irsoy and Cardie (2014), where they apply deep Elman-type RNN to extract opinion expressions and show that deep RNN outperforms CRF, semi-CRF and shallow RNN. They used word embeddings from Google without fine-tuning them.
Although inspired, our work differs from the work of Irsoy and Cardie (2014) in many ways. (i) We experiment with not only Elman-type, but also with a Jordan-type and with a more advanced LSTM RNN, and demonstrate the performance of various RNN models. (ii) We use not only Google embeddings as pre-trained word vectors, but also other embeddings including SENNA and Amazon, and show their performances. (iii) We also finetune the embeddings for our task, which is shown to be very crucial. (iv) We present an RNN ar-chitecture to include other linguistic features and show its effectiveness. (v) Finally, we present a comprehensive experiment exploring different embedding dimensions and hidden layer sizes for all the variations of the RNNs (i.e., including features and bi-directionality).

Recurrent Neural Models
The recurrent neural models in this section compute compositional vector representations for word sequences of arbitrary length. These highlevel (i.e., hidden-layer) distributed representations are then used as features to classify each token in the sentence. We first describe the common properties shared among the RNNs below, followed by the descriptions of the specific RNNs.
Each word in the vocabulary V is represented by a D dimensional vector in the shared look-up table L ∈ R |V |×D . Note that L is considered as a model parameter to be learned. We can initialize L randomly or by pre-trained word embedding vectors (see Section 4). Given an input sentence s = (s 1 , · · · , s T ), we first transform it into a feature sequence by mapping each word token s t ∈ s to an index in L. The look-up layer then creates a context vector x t ∈ R mD covering m − 1 neighboring tokens for each s t by concatenating their respective vectors in L. For example, given the context size m = 3, the context vector x t for the word disk in Figure 1 is formed by concatenating the embeddings of hard, disk and is. This window-based approach is intended to capture short-term dependencies between neighboring words in a sentence (Collobert et al., 2011).
The concatenated vector is then passed through non-linear recurrent hidden layers to learn highlevel compositional representations, which are in turn fed to the output layer for classification using softmax. Formally, the probability of k-th label in the output for classification into K classes: where, h t = φ(x t ) defines the transformations of x t through the hidden layers, and w k are the weights from the last hidden layer to the output layer. We fit the models by minimizing the negative log likelihood (NLL) of the training data. The NLL for the sentence s can be written as: where, y tk = I(y t = k) is an indicator variable to encode the gold labels, i.e., y tk = 1 if the gold label y t = k, otherwise 0. 2 The loss function minimizes the cross-entropy between the predicted distribution and the target distribution (i.e., gold labels). The main difference between the models described below is how they compute h t = φ(x t ).

Elman-type RNN (Elman, 1990)
In an Elman-type RNN (Fig. 1a), the output of the hidden layer h t at time t is computed from a nonlinear transformation of the current input x t and the previous hidden layer output h t−1 . Formally, where f is a nonlinear function (e.g., sigmoid) applied to the hidden units. U and V are weight matrices between two consecutive hidden layers, and between the input and the hidden layers, respectively, and b is the bias vector. This RNN thus creates internal states by remembering previous hidden layer, which allows it to exhibit dynamic temporal behavior. We can interpret h t as an intermediate representation summarizing the past, which is in turn used to make a final decision on the current input.

Jordan-type RNN (Jordan, 1997)
Jordan-type RNNs ( Fig. 1b) are similar to Elmantype RNNs except that the hidden layer h t at time t is fed from the previous output layer y t−1 instead of the previous hidden layer h t−1 . Formally, where U , V , b, and f are similarly defined as before. Both Elman-type and Jordan-type RNNs are known as simple RNNs. These types of RNNs are generally trained using stochastic gradient descent (SGD) with backpropagation through time (BPTT), where errors (i.e., gradients) are propagated back through the edges over time.
One common issue with BPTT is that as the errors get propagated, they may soon become very small or very large that can lead to undesired values in weight matrices, causing the training to fail.
with a context window of size 3. One memory block in the LSTM hidden layer has been enlarged.
This is known as the vanishing and the exploding gradients problem (Bengio et al., 1994). One simple way to overcome this issue is to use a truncated BPTT (Mikolov, 2012) for restricting the backpropagation to only few steps like 4 or 5. However, this solution limits the RNN to capture long-range dependencies. In the following, we describe an elegant RNN architecture to address this problem.

Long Short-Term Memory RNN
Long Short-Term Memory or LSTM (Hochreiter and Schmidhuber, 1997) is specifically designed to model long term dependencies in RNNs. The recurrent layer in a standard LSTM is constituted with special (hidden) units called memory blocks (Fig. 1c). A memory block is composed of four elements: (i) a memory cell c (i.e., a neuron) with a self-connection, (ii) an input gate i to control the flow of input signal into the neuron, (iii) an output gate o to control the effect of the neuron activation on other neurons, and (iv) a forget gate f to allow the neuron to adaptively reset its current state through the self-connection. The following sequence of equations describe how a layer of memory blocks is updated at every time step t: where U k , V k and C k are the weight matrices between two consecutive hidden layers, between the input and the hidden layers, and between two consecutive cell activations, respectively, which are associated with gate k (i.e., input, output, forget and cell), and b k is the associated bias vector. The symbol denotes a element-wise product of the two vectors. The gate function σ is the sigmoid activation, and g and h are the cell input and cell output activations, typically a tanh. LSTMs are generally trained using truncated or full BPTT.

Bidirectionality
Notice that the RNNs defined above only get information from the past. However, information from the future could also be crucial. In our example sentence (Table 1), to correctly tag the word hard as a B-TARG, it is beneficial for the RNN to know that the next word is disk. Our window-based approach, by considering the neighboring words, already captures short-term dependencies like this from the future. However, it requires tuning to find the right window size, and it disregards long-range dependencies that go beyond the context window, which is typically of size 1 (i.e., no context) to 5 (see Section 5.2). For instance, consider the two sentences: (i) Do you know about the crunchy tuna here, it is to die for. and (ii) Do you know about the crunchy tuna here, it is imported from Norway. The phrase crunchy tuna is an aspect term in the first (subjective) sentence, but not in the second (objective) one. The RNN models described above will assign the same labels to crunchy and tuna in both sentences, since the preceding sequences and the context window (of size 1 to 5) are the same.
To capture long-range dependencies from the future as well as from the past, we propose to use bidirectional RNNs (Schuster and Paliwal, 1997), which allow bidirectional links in the network. In an Elman-type bidirectional RNN (Fig. 2a), the forward hidden layer − → h t and the backward hidden layer ← − h t at time t are computed as follows: ← − h t ] is passed to the output layer. We can thus interpret h t as an intermediate representation summarizing the past and the future, which is then used to make a final decision on the current input.
Similarly, the unidirectional LSTM RNN can be extended to bidirectional LSTM by allowing bidirectional connections in the hidden layer. This amounts to having a backward counterpart for each of the equations from 5 to 9.
Notice that the forward and the backward computations of bidirectional RNNs are independently done until they are combined in the output layer. This means, during training, after backpropagating the errors from the output layer to the forward and to the backward hidden layers, two independent BPTT can be applied -one to each direction.

Fine-tuning of Embeddings
In our RNN framework, we intend to avoid manual feature engineering efforts by using word embeddings as the only features. As mentioned before, we can initialize the embeddings randomly and learn them as part of model parameters by backpropagating the errors to the look-up layer. One issue with random initialization is that it may lead the SGD to get stuck in local minima (Murphy, 2012). On the other hand, one can plug the readily available embeddings from external sources (Section 4) in the RNN model and use them as features without tuning them further for the task, as is done in any other machine learning model. However, this approach does not exploit the automatic fea-ture learning capability of NN models, which is one of the main motivations of using them.
In our work, we use the pre-trained word embeddings to better initialize our models, and we fine-tune them for our task in training, which turns out to be quite beneficial (see Section 5.2).

Incorporating other Linguistic Features
Although NNs learn word features (i.e., embeddings) automatically, we may still be interested in incorporating other linguistic features like part-ofspeech (POS) tags and chunk information to guide the training and to learn a better model. However, unlike word embeddings, we want these features to be fixed during training. As shown in Figure  2b, this can be done in our RNNs by feeding these additional features directly to the output layer, and learn their associated weights in training.

Word Embeddings
Word embeddings are distributed representations of words, represented as real-valued, dense, and low-dimensional vectors. Each dimension potentially describes syntactic or semantic properties of the word. Here we briefly describe the three types of pre-trained embeddings that we use in our work.

SENNA Embeddings
Collobert et al. (2011) present a unified NN architecture for various NLP tasks (e.g., POS tagging, chunking, semantic role labeling, named entity recognition) with a window-based approach and a sentence-based approach (i.e., the input layer is a sentence). Each word in the input layer is represented by M features, each of which has an embedding vector associated with it in a lookup table. To give their network a better initialization, they learn word embeddings using a nonprobabilistic language model, which was trained on English Wikipedia for about 2 months. They released their 50-dimensional word embeddings (vocabulary size 130K) under the name SENNA. 3 Mikolov et al. (2013a) propose two log-linear models for computing word embeddings from large corpora efficiently: (i) a bag-of-words model CBOW that predicts the current word based on the context words, and (ii) a skip-gram model that predicts surrounding words given the current word.

Google Embeddings
They released their pre-trained 300-dimensional word embeddings (vocabulary size 3M ) trained by the skip-gram model on part of Google news dataset containing about 100 billion words. 4

Amazon Embeddings
Since we work on customer reviews, which are less formal than Wikipedia and news, we have also trained domain-specific embeddings (vocabulary size 1M ) using the CBOW model of word2vec tool (Mikolov et al., 2013b) from a large corpus of Amazon reviews. 5 The corpus contains 34, 686, 770 reviews (4.7 billion words) on Amazon products from June 1995 to March 2013 (McAuley and Leskovec, 2013). For comparison with SENNA and Google, we learn word embeddings of 50-and 300-dimensions.

Experiments
In this section, we present our experimental settings and results for the task of opinion target extraction from customer reviews.

Experimental Settings
Datasets: In our experiments, we use the two review datasets provided by the SemEval-2014 task 4: aspect-based sentiment analysis evaluation campaign (Pontiki et al., 2014), namely the Laptop and the Restaurant datasets. Table 2 shows some basic statistics about the datasets. The majority of aspect terms have only one word, while about one third of them have multiple words. In both datasets, some sentences have no aspect terms and some have more than one aspect terms. We use the standard train:test split to compare our results with the SemEval best systems. In addition, we show a more general performance of our models on the two datasets based on 10-fold cross validation.  Evaluation Metric: The evaluation metric measures the standard precision, recall and F 1 score based on exact matches. This means that a candidate aspect term is considered to be correct only if it exactly matches with the aspect term annotated by the human. In all our experiments when comparing two models, we use paired t-test on the F 1 scores to measure statistical significance and report the corresponding p-value.
CRF Baseline: We use a linear-chain CRF (Lafferty et al., 2001) of order 2 as our baseline, which is the state-of-the-art model for opinion target extraction (Pontiki et al., 2014). Specifically, the CRF generates (binary) feature functions of order 1 and 2; see (Cuong et al., 2014) for higher order CRFs. The features used in the baseline model include the current word, its POS tag, its prefixes and suffixes between one to four characters, its position, its stylistics (e.g., case, digit, symbol, alphanumeric), and its context (i.e., the same features for the two preceding and the two following words). In addition to the hand-crafted features, we also include the three different types of word embeddings described in Section 4.

RNN Settings:
We pre-processed each dataset by lowercasing all words and spelling out each digit number as DIGIT. We then built the vocabulary from the training set by marking rare words with only one occurrence as UNKNOWN, and adding a PADDDING word to make context windows for boundary words.
To implement early stopping in SGD, we prepared a validation set by separating out randomly 10% of the available training data. The remaining 90% is used for training. The weights in the network were initialized by sampling from a small random uniform distribution U(−0.2, 0.2). The time step in the truncated BPTT was fixed to 6 based on the performance on the validation set; smaller values hurt the performance, while larger values showed no significant gains but increased the training time.
We use a fixed learning rate of 0.01, but we change the batch size depending on the sentence length following Mesnil et al. (2013). The net effect is a variable step size in the SGD. We run SGD for 30 epochs, calculate the F 1 score on the validation set after each epoch, and stop if the accuracy starts to decrease. The size of the context window and the hidden layer are empirically set based on the performance on the validation set. We experimented with the window size ∈ {1, 3, 5}, and found 3 to be the optimal on the validation set. The hidden layer sizes we experimented with are 50, 100, 150, and 200; we report the optimal values in Table 3 (see |h l | and |h r | columns).

Linguistic Features in RNNs:
In addition to the neural features, we also explore the contribution of simple linguistic features in our RNN models using the architecture described in Section 3.6. Specifically, we encode four POS tag classes (noun, adjective, verb, adverb) and BIO-tagged chunk information (NP, VP, PP, ADJP, ADVP) as binary features. We feed these extra features directly to the output layer of the RNNs and learn their relative weights. Part-of-speech and phrasal information are arguably the most informative features for identifying aspect terms (i.e., aspect terms are generally noun phrases). BIO tags could be useful to find the right text spans (i.e., aspect terms are unlikely to violate phrasal boundaries). Table 3 presents our results of aspect term extraction on the standard testset in F 1 scores. In Table  4, we show the results on the whole datasets based on 10-fold cross validation. RNNs in Table 4 are trained using SENNA embeddings. We perform significance tests on the 10-fold results. In the following, we highlight our main findings.

Contributions of Word Embeddings in CRF:
From the first group of results in Table 3, we can observe that even though CRF uses a handful of hand-designed features, including word embeddings still leads to sizable improvements on both datasets. The domain-specific Amazon embeddings (300 dim.) yield more general performance across the datasets, delivering the best gain of absolute 3.54% on the Laptop and the second best on the Restaurant dataset. Google embeddings give the best gain on the Restaurant dataset (absolute 3.08%). The contribution of embeddings in CRF is also validated by the 10-fold results in Table 4 (see first two rows), where SENNA embeddings yield significant improvements -absolute 1.47% on Laptop (p < 0.03) and absolute 1.24% on Restaurant (p < 0.01). This demonstrates that word embeddings offer generalizations that complement other strong features, and thus should be considered.     RNNs with those of CRF in Table 3, we see that most of our RNN models outperform CRF models with the maximum absolute gains of 2.80% by LSTM-RNN+Feat. on Laptop and 1.70% by Bi-Elman-RNN+Feat. on Restaurant. What is remarkable is that RNNs without any hand-crafted features outperform feature-rich CRF models by a good margin -absolute maximum gains of 2.23% by Elman-RNN and 1.83% by Bi-LSTM-RNN on Laptop. When we compare their general performance on the 10-folds in Table 4, we observe similar gains, maximum 5.88% on Laptop and 1.97% on Restaurant, which are significant with p < 6 × 10 −6 on Laptop and p < 2 × 10 −4 on Restaurant. These results demonstrate that RNNs as sequence labelers are more effective than CRFs for fine-grained opinion mining tasks. This can be attributed to RNN's ability to learn better features automatically and to capture long-range sequential dependencies between the output labels.
Comparison among RNN Models: A comparison among the RNN models in Table 3 tells that Elman RNN generally outperforms Jordan RNN.
However, bi-directionality and LSTM do not provide clear gains over the simple Elman RNN. In fact, bi-directionality hurts the performance in most cases. This finding contrasts the finding of Irsoy and Cardie (2014) in opinion expression detection task, where bi-directional Elman RNNs outperform their uni-directional counterparts. However, when we analyzed the data, we found it to be unsurprising because aspect terms are generally shorter than opinion expressions. For example, the average length of an aspect term in our Restaurant dataset is 1.4, where the average length of an expressive subjective expression in their MPQA corpus is 3.3. Therefore, the information required to correctly identify aspect terms (e.g., hard disk) is already captured by the simple (as opposed to LSTM) unidirectional link and the context window covering the neighboring words. LSTM and Bi-directionality increase the number of parameters in the RNNs, which might contribute to overfitting on this specific task. 6 As a partial solution to this problem, we experimented with a bi-directional Elman-RNN, where both directions share the same parameters. Therefore, the number of parameters remains the same as the uni-directional one. This modification improves the performance over the non-shared one slightly but not significantly. This demands for better modeling of the two sources of information rather than simple concatenation or sharing.

Contributions of Linguistic Features in RNNs:
Although our linguistic features are quite simple (i.e., POS tags and chunk), they give gains on both datasets when incorporated into Elman and LSTM RNNs. The maximum gains on the standard testset (Table 3) are 0.64% on Laptop and 1.96% on Restaurant for Bi-Elman, and 1.48% on Laptop and 1.58% on Restaurant for LSTM. Similar gains are also observed on the 10-folds in Table 4, where the maximum gains are 1.25% on Laptop and 1.44% on Restaurant. These gains are significant with p < 0.004 on Laptop and p < 6 × 10 −5 on Restaurant. Linguistic features thus complement word embeddings in RNNs.
Importance of Fine-tuning in RNNs: Finally, in order to show the importance of fine-tuning the word embeddings in RNNs on our task, we present in Table 5 the performance of Elman and Jordan RNNs, when the embeddings are used as they are ('-tune'), and when they are fine-tuned ('+tune') on the task. The table also shows the contributions of pre-trained embeddings as compared to random initialization. Surprisingly, Amazon embeddings without fine-tuning deliver the worst performance, even lower than the Random initialization. We found that with Amazon embeddings the network gets stuck in a local minimum from the very first epoch.
Other pre-trained (untuned) embeddings improve over the Amazon and Random by providing better initialization. In most cases fine-tuning makes a big difference. For example, the absolute gains for fine-tuning SENNA embeddings in Elman RNN are 13.01% in Laptop and 4.11% in Restaurant. Remarkably, fine-tuning brings both Random and Amazon embeddings close to the best ones.
Comparison with SemEval-2014 Systems: When our RNN results are compared with the top performing systems in the SemEval-2014 (last two rows in Table 3), we see that RNNs without using any linguistic features achieve the second best results on both Laptop and Restaurant datasets. Note that these RNNs only use word embeddings, while IHS RD and DLIREC use complex features like dependency relations, named entity, sentiment orientation of words, word cluster and many more in their CRF models, most of which are expensive to compute; see (Toh and Wang, 2014;Chernyshevich, 2014). The performance of our RNNs improves when they are given access to very simple features like POS tags and chunks, and LSTM-RNN+Feat. achieves the best results on the Laptop dataset.

Conclusion and Future Direction
We presented a general class of discriminative models based on recurrent neural network (RNN) architecture and word embeddings, that can be successfully applied to fine-grained opinion mining tasks without any task-specific manual feature engineering effort. We used pre-trained word embeddings from three external sources in different RNN architectures including Elman-type, Jordantype, LSTM and their several variations.
Our results on the opinion target extraction task demonstrate that word embeddings improve the performance of both CRF and RNN models, however, fine-tuning them in RNNs on the task gives the best results. RNNs outperform CRFs, even when they use word embeddings as the only features. Incorporating simple linguistic features into RNNs improves the performance further. Our best results with LSTM RNN outperform the top performing system on the Laptop dataset and achieve the second best on the Restaurant dataset in SemEval-2014 evaluation campaign. We made our code publicly available for research purposes.
In the future, we would like apply our models to other fine-grained opinion mining tasks including opinion expression detection and characterizing the intensity and sentiment of the opinion expressions. We would also like to explore to what extent these tasks can be jointly modeled in an RNN-based multi-task learning framework.