Deep Multi-Task Learning for Aspect Term Extraction with Memory Interaction

We propose a novel LSTM-based deep multi-task learning framework for aspect term extraction from user review sentences. Two LSTMs equipped with extended memories and neural memory operations are designed for jointly handling the extraction tasks of aspects and opinions via memory interactions. Sentimental sentence constraint is also added for more accurate prediction via another LSTM. Experiment results over two benchmark datasets demonstrate the effectiveness of our framework.


Introduction
The aspect-based sentiment analysis (ABSA) task is to identify opinions expressed towards specific entities such as laptop or attributes of entities such as price (Liu, 2012a). This task involves three subtasks: Aspect Term Extraction (ATE), Aspect Polarity Detection and Aspect Category Detection. As a fundamental subtask in ABSA, the goal of the ATE task is to identify opinionated aspect expressions. One of most important characteristics is that opinion words can provide indicative clues for aspect detection since opinion words should co-occur with aspect words. Most publicly available datasets contain the gold standard annotations for opinionated aspects, but the ground truth of the corresponding opinion words is not commonly provided. Some works tackling the ATE task ignore the consideration of opinion words and just focus on aspect term modeling and learning (Jin * The work described in this paper is substantially supported by a grant from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Code: 14203414). We thank Lidong Bing and Piji Li for their helpful comments on this draft and the anonymous reviewers for their valuable feedback. et al., 2009;Jakob and Gurevych, 2010;Toh and Wang, 2014;Chernyshevich, 2014;Manek et al., 2017;San Vicente et al., 2015;Liu et al., 2015;Poria et al., 2016;Toh and Su, 2016;Yin et al., 2016). They fail to leverage opinion information which is supposed to be useful clues.
Some works tackling the ATE task consider opinion information (Hu and Liu, 2004a,b;Popescu and Etzioni, 2005;Zhuang et al., 2006;Qiu et al., 2011;Liu et al., 2012bLiu et al., , 2013aLiu et al., ,b, 2014 in an unsupervised or partially supervised manner. Qiu et al. (2011) proposed Double Propagation (DP) to collectively extract aspect terms and opinion words based on information propagation over a dependency graph. One drawback is that it heavily relies on the dependency parser, which is prone to generate mistakes when applying on informal online reviews. Liu et al. (2014) modeled relation between aspects and opinions by constructing a bipartite heterogenous graph. It cannot perform well without a high-quality phrase chunker and POS tagger reducing its flexibility. As unsupervised or partially supervised frameworks cannot take the full advantages of aspect annotations commonly found in the training data, the above methods lead to deficiency in leveraging the data. Recently, Wang et al. (2016) considered relation between opinion words and aspect words in a supervised model named RNCRF. However, RNCRF tends to suffer from parsing errors since the structure of the recursive network hinges on the dependency parse tree. CMLA (Wang et al., 2017a) used a multilayer neural model where each layer consists of aspect attention and opinion attention. However CMLA merely employs standard GRU without extended memories.
We propose MIN (Memory Interaction Network), a novel LSTM-based deep multi-task learning framework for the ATE task. Two LSTMs with extended memory are designed for handling the extraction tasks of aspects and opinions. The aspect-opinion relationship is established based on neural memory interactions between aspect extraction and opinion extraction where the global indicator score of opinion terms and local positional relevance between aspects and opinions are considered. To ensure that aspects are from sentimental sentences, MIN employs a third LSTM for sentimental sentence classification facilitating more accurate aspect term extraction. Experiment results over two benchmark datasets show that our framework achieves superior performance.

Overview
Let an input review sentence with T word tokens and the corresponding distributed representations be w = {w 1 , ..., w T } and x = {x 1 , ..., x T } respectively. The ATE task is treated as a sequence labeling task with BIO tagging scheme and the set of aspect tags for the word w t is y A t ∈ {B, I, O}, where B, I, O represent beginning of, inside and outside of the aspect span respectively. Commonly found training data contains gold annotations for aspect terms and opinionated sentences, but the gold standard of opinion words are usually not available.
In our multi-task learning framework, three tasks are involved: (1) aspect term extraction (ATE), (2) opinion word extraction and (3) sentimental sentence classification. We design a taskspecific LSTM, namely, A-LSTM, O-LSTM and S-LSTM, for tackling each of the above tasks respectively. The first component of our proposed framework consists of A-LSTM and O-LSTM where we equip LSTMs with extended operational memories and some operations are defined over the memories for task-level memory interactions. The second component is to determine if a review sentence is sentimental. This is achieved by employing a vanilla LSTM, namely, S-LSTM.

Model Description
The first component of our framework MIN is composed of A-LSTM and O-LSTM. Both LSTMs have extended memories for task-level memory interactions. A-LSTM involves a large aspect memory H A t ∈ R nm×dim A h and an opinion summary vector m O t ∈ R dim O h where H A t contains n m pieces of aspect hidden states of dimen- h and an aspect-specific summary vector m A t ∈ R dim A h are included. We use the aspect term annotations in the training data for training A-LSTM. As there is no ground truth available for opinion words in the training data, sentiment lexicon and highprecision dependency rules are introduced to find potential opinion words. Commonly used opinion words can be found in some general sentiment lexicons. To find opinion words, not in sentiment lexicons, in a sentence, we build a small rule set R composed of dependency relations with high confidence, e.g., amod, nsubj, and determine if w t directly depends on the gold aspect word through the dependencies in R. If so, w t will be treated as a potential opinion word. Then such opinion words are used as training data for O-LSTM.
In the memory-enhanced A-LSTM and O-LSTM, we manually design three kinds of operations: (1) READ to select n m pieces of aspect (opinion) hidden states from the past memories and build where influences of opinion terms and relative positions of inputs are considered; (3) INTERACT to perform interaction between A-LSTM and O-LSTM using the task specific summaries (i.e., m A t and m O t ). Consider the work flow of A-LSTM for aspect term extraction. Since opinion words and aspect terms should co-occur, the goal of A-LSTM participating in memory interactions is to acquire opinion summaries from O-LSTM (i.e., m O t ) for better aspect prediction. First of all, MIN will READ n m pieces of opinion memories which are most related to w t from O-LSTM. Syntax structure could be used but syntactic parsers are not effective for processing short informal review sentences. Therefore, MIN selects memory segments temporally related to w t . Precisely, the opinion memory at the time step t is Since the linear context contains most of the parent nodes and the child nodes of w t on the dependency parse tree, treating the corresponding memory segments as relevant segments to w t is reasonable.
Then MIN will DIGEST the collected opinion memories H O t in the A-LSTM. As different memory segments are not of equal importance for the current decision and the same segment in different memories (i.e., different H O t ) also makes a difference, MIN leverages two kind of weights to summarize the collected content. The first weight is the indicator score of being opinion terms denoted as v I ∈ R nm , which is used to measure how much opinion information the word w t−i (i = 1, .., n m ) holds. We adopt Euclidean distance between distributed representations of w t−i and opinion words. It is obvious that computing the distance between x t−i and each opinion word is expensive. Thus, we run an off-the-shelf clustering algorithm over opinion words in the training set and then use the produced n c centroids to estimate the indicator score v I i of w t−i being an opinion word: where x t−i is the distributed representation of w t−i and c j is the centroid vector representation of j-th cluster. This weighting scheme ensures that w t−i is assigned a high score as long as x t−i is close to a particular centroid. The aspect decision of w t is also affected by relative position between w t−i and w t . Thus, MIN employs the second weight v P to explicitly model their positional relevance and the initial weight for the i-th segment v P i is calculated as below: where n m is the number of hidden state in H O t . This position-aware weight enables that the closer the word w t−i is to the current input, the more the corresponding memory segment will contribute to the current decision. To better capture the local positional relevance, we make the initialized v P as learnable parameters. Combining the above two weights helps to utilize each active memory segment according to the importance for prediction and m O t , the summary of H O t is generated: where denotes element-wise multiplication and || * || 2 is Euclidean norm of vectors. From Equation 3, m O t is dominated by the associated memory segment of w t−i that obtains the high combined weights.
In the last operation INTERACT, A-LSTM communicates with O-LSTM by acquiring m O t from O-LSTM and incorporating the summary into the memory update. The update process is as follows: where W A * , U A * and b A * are weight parameters of the A-LSTM and σ is the sigmoid activation function.
[:] denotes vector concatenation operation. m O t can be seen as the summary of the opinion indicator in the left context of w t and H A t [1] is the most immediate hidden memory of A-LSTM. MIN blends the opinion summary from O-LSTM with the memory from A-LSTM. The co-occurrence relation between aspects and opinion words is modeled by such "memory fusion" strategy. Since opinion words can appear on both sides of w t , memory segments corresponding to the right context (i.e., "future" memory) should be included. Hence, we conduct bi-directional training for A-LSTM.
The work flow of memory interaction and the update process of the internal memories in O-LSTM are kept same with those in A-LSTM except the DIGEST operation. Specifically, we set m A t , the task-specific summary of A-LSTM, as h A t . The second component of MIN is a generic LSTM called S-LSTM for discriminating sentimental sentences and non-sentimental sentences. The design and the process of the memory update in this component are similar to that in Jozefowicz et al. (2015). In sentences not conveying any sentimental meanings, some words like food, service tend to be misclassified as aspect terms since they are commonly used in user reviews. To avoid this kind of error, we add a constraint that an aspect term should come from sentimental sentence. Specifically, S-LSTM learns the sentimental representation h S T of the sentence and then feeds it in aspect prediction as a soft constraint: where W A f c denotes the weight matrix of the fullyconnected softmax layer.
On the whole, our proposed MIN framework has three LSTMs and each of them is differentiable. Thus, our MIN framework can be efficiently trained with gradient descent. For A-LSTM and O-LSTM, we use the token-level cross-entropy error between the predicted distribution P (y T t |x t ) and the gold standard distribution P (y T ,g t |x t ) as the loss function (T ∈ {A, O}): For S-LSTM, sentence-level cross entropy error are employed to calculate the corresponding loss:

Dataset
We conduct experiments on two benchmark datasets from SemEval ABSA challenge (Pontiki et al., 2014(Pontiki et al., , 2016 as shown in Table 1. D 1 (Se-mEval 2014) contains reviews from the laptop domain and D 2 (SemEval 2016) contains reviews from the restaurant domain. In these datasets, aspect terms have been labeled and sentences containing at least one golden truth aspect are regarded as sentimental sentences. As gold standard annotations for opinion words are not provided, we select words with strong subjectivity from MPQA 1 as potential opinion words. Apart from the common opinion words in the sentiment lexicon, we also treat words, which directly depend on gold standard aspect terms through highprecision dependency rules, as opinion words.

Experiment Design
To evaluate the proposed MIN framework, we perform comparison with the following two groups of methods: (1) CRF based methods: • CRF: Conditional Random Fields with basic feature templates 2 and word embeddings.
• Semi-CRF: First-order semi-Markov conditional random fields (Sarawagi et al., 2004) and the feature template in Cuong et al.
For datasets in the restaurant domain, we train word embeddings of dimension 200 with word2vec (Mikolov et al., 2013) on Yelp reviews 5 . For those in laptop domain, we use pre-trained glove.840B.300d 6 . 2 http://sklearn-crfsuite.readthedocs.io/en/latest/ 3 As we use our own implementation of LSTM, the reported results are different from those in (Liu et al., 2015) 4 Specifically, we list the result of RNCRF over D1 without opinion annotations for fair comparison. As no result is provided for RNCRF-no-opinion over D2, we report the corresponding performance of the full model. See their following works (Wang et al., 2017a,b). Also, CMLA (Wang et al., 2017a) reports better results than RNCRF but we do not compare with it. The reason is that CMLA introduces the gold standard opinion labels in the training data while such labels are not available for our experiments 5 https://www.yelp.com/dataset challenge 6 https://nlp.stanford.edu/projects/glove/  The hyper-parameters are selected via ten-fold cross validation. The dimension of hidden representations are 100, 20, 40 for A-LSTM, O-LSTM and S-LSTM respectively. The dropout rate for O-LSTM and S-LSTM is 0.4. The size of the aspect (opinion) memory n m is 4. The batch size is set to 32. As for initialization of network parameters, we adopt the strategy that the initial weights are sampled from the uniform distribution (Glorot and Bengio, 2010). We employ ADAM (Kingma and Ba, 2014) as optimizer and the default settings of ADAM are used.
To better reveal the capability of the proposed MIN, we train 5 models with the same group of hyper-parameters and report the average F 1 score over the testing set. Table 2 depicts experiment results. Compared to the best systems in SemEval challenge, MIN achieves 3.0% and 1.1% absolute gains on D 1 and D 2 respectively. Besides, our MIN outperforms WDEmb, a strong CRF-based system benefiting from several kinds of useful word embeddings, by 2.1% on D 1 . With memory interactions and consideration of sentimental sentence, our MIN boosts the performance of vanilla bi-directional LSTM (+2.0% and +1.7% respectively). It validates the effectiveness of the manually designed memory operations and the proposed memory interaction mechanism. MIN also outperforms the state-of-the-art RNCRF on each dataset suggesting that memory interactions can be an alternative strategy instead of syntactic parsing. To further study the impact of each element in MIN, we conduct ablation experiments. As shown in Table 3, removing bi-directionality decreases the extraction performances (-2.0% and -1.0%). The soft sentimental constraint proves to be useful since MIN is 1.5% and 1.0% superior than the framework without S-LSTM on D 1 and D 2 respectively. O-LSTM brings in the largest performance gains on D 2 compared with ablated framework (i.e., MIN without O-LSTM), verifying our postulation that aspect-opinion "interaction" is more effective than only considering aspect terms. We also observe that the contribution of O-LSTM is less significant than that of bi-directionality on D 1 (+1.6% vs +2.0%). This is reasonable since using opinion words as adjective modifiers placed after the aspects is common in English.

Conclusions
We propose Memory Interaction Network (MIN), a multi-task learning framework, to detect aspect terms from the online user reviews. Compared with previous studies, our MIN has following features: • Co-occurrence pattern between aspects and opinions is captured via memory interactions, where the neural memory operations are designed to summarize task-level information and perform interactions.
• A novel LSTM unit with extended memories is developed for memory interactions.