Exploring Sequence-to-Sequence Learning in Aspect Term Extraction

Aspect term extraction (ATE) aims at identifying all aspect terms in a sentence and is usually modeled as a sequence labeling problem. However, sequence labeling based methods cannot make full use of the overall meaning of the whole sentence and have the limitation in processing dependencies between labels. To tackle these problems, we first explore to formalize ATE as a sequence-to-sequence (Seq2Seq) learning task where the source sequence and target sequence are composed of words and labels respectively. At the same time, to make Seq2Seq learning suit to ATE where labels correspond to words one by one, we design the gated unit networks to incorporate corresponding word representation into the decoder, and position-aware attention to pay more attention to the adjacent words of a target word. The experimental results on two datasets show that Seq2Seq learning is effective in ATE accompanied with our proposed gated unit networks and position-aware attention mechanism.


Introduction
Aspect term extraction (ATE) is a fundamental task in aspect-level sentiment analysis, and aims at extracting all aspect terms present in the sentences (Hu and Liu, 2004;Pontiki et al., 2014Pontiki et al., , 2015Pontiki et al., , 2016. For example, given a restaurant review "The staff is friendly, and their cheese pizza is delicious", the ATE system should extract aspect terms "staff" and "cheese pizza". Early works focus on detecting the pre-defined aspects in a sentence (Hu and Liu, 2004;Zhuang et al., 2006;Popescu and Etzioni, 2007). Then, some works regard ATE as a sequence labeling task and utilize Hidden Markov Model (Jin et al., 2009) or Conditional Random Fields (Jin et al., 2009;Ma and Wan, 2010;Jakob and Gurevych, 2010;Liu et al., 2013) to extract all possible aspect terms. With the development of deep learning techniques, neural networks based methods (Wang et al., 2016;Liu et al., 2015;Li and Lam, 2017;Xu et al., 2018) have achieved good performances in ATE task, and they still treat ATE as a sequence labeling problem and extract more useful features surrounding a word. Obviously, the overall meaning of the sentence is important to predict the label sequence. For example, the word memory should be an aspect term in the laptop review "The memory is enough for use.", but it is not an aspect term in the sentence "The memory is sad for me.". However, sequence labeling methods are not good at grasping the overall meaning of the whole sentence because they cannot read the whole sentence in advance. In addition, neural networks based sequence labeling methods have the limitation in processing label dependencies because they only use transition matrix to encourage valid label paths and discourage other paths (Collobert et al., 2011). As we know, the label of each word is conditioned on its previous label. For example, "O" is followed by "B/O" but not "I" in the B-I-O tagging schema. To the best of our knowledge, no neural networks based method utilizes the previous label to improve their performances directly.
Recently, sequence to sequence (Seq2Seq) learning has been successfully applied to many generation tasks (Cho et al., 2014b;Sutskever et al., 2014;Bahdanau et al., 2014;Nallapati et al., 2016). Seq2Seq learning encodes a source sequence into a fixed-length vector based on which a decoder generates a target sequence. It just has the benefits of first collecting comprehensive information from the source text and then paying more attention to the generation of the target sequence. Thus, we propose to formalize the ATE task as a sequence-to-sequence learning problem, where the source and target sequences are word and label sequence respectively. Our proposed method can make full use of the overall meaning of the sentence when decoding the target sequence because the fix-length vector stores all useful information of a sentence and will be used in the decoding process. At the same time, Seq2Seq learning can remedy the label dependencies problem because each label is conditioned on the previous label when generating the label sequence.
Though Seq2Seq learning has its obvious advantages of generating a sequence, it faces the difficulties of how to precisely map each word with its corresponding label. As we know, the label of each word is highly related to its own meaning. For example, an aspect term tends to be some words used to identify any of a class of people, places, or things (e.g. staff, restaurant, pizza), while some words to describe an action, state, or occurrence (e.g. hear, become, happen) are rarely a part of an aspect term. Furthermore, our proposed method can know for which word it generates a label, and this kind of one-to-one match does not exist in other Seq2Seq task (e.g. machine translation). To incorporate the exact meaning of each word into Seq2Seq learning, we propose the gated unit networks (GUN) which contain a gated unit produced based on the hidden states of encoder and decoder. The gated unit can automatically integrate information from the encoder and decoder hidden states of the current word when decoding its label.
Furthermore, the label of each word is dependent on its adjacent words because the adjacent words of an aspect term tend to be article, verb, adjective and etc. As the example in the first paragraph, the adjacent words of staff : The, is and friendly have positive effect on predicting its label, while the rest words are not key factors. This shows the importance of adjacent words of each word in predicting its label. In classic Seq2Seq learning, attention mechanism is used to make the decoder select important parts of source sequence to form a context vector for decoding current word (Bahdanau et al., 2014). However, this kind of attention mechanism cannot pay more attention to the adjacent words of a word because it does not take distance into account. To overcome this shortage, we introduce the position-aware attention which first computes the weight of each word with regard to previous hidden state s i−1 .
Then, the weight of word i will be decreased based on the distance between word i and current word t. The more distant, the lower important. Therefore, our position-aware attention model can force the decoder to pay more attention to the adjacent words of the current word when decoding its label.
We conduct experiments on two datasets, and the experimental results demonstrate that our proposed method achieves comparable results compared with existing methods.

Model
Our proposed method is based on sequence-tosequence learning framework, plus two supplementary components namely position-aware attention and gated unit networks, which are used to capture features from the current word and its adjacent words. In this section, we will introduce our model in detail, whose overall architecture is displayed in Figure 1.

Sequence-to-Sequence Learning
For convenience, we first define the notations which will be used next. Let X = [x 1 , x 2 , ..., x n ] denote a sentence which contains n words, and x i ∈ R d is word embedding which can be learned by a neural language model (Bengio et al., 2003;Mikolov et al., 2013). Let Y = [y 1 , y 2 , ..., y n ] denote the aspect term labels of sentence X where y i ∈ {B, I, O}. we call X and Y as source and target sequence respectively.
The sequence-to-sequence learning method is composed of two basic components: encoder and decoder. The encoder reads the embeddings of the source sequence and learns the hidden states H = [h 1 , h 2 , ..., h n ] for all words, and the commonly used method is the Recurrent Neural Networks (RNN). In our model, we use a bidirectional gated recurrent unit (Bi-GRU) (Cho et al., 2014b) to obtain the hidden states: where Bi-GRU represents the operations of bidirectional GRU. h t ∈ R se represents the hidden state of word t, and s e is the hidden state size of the encoder. The decoder is also a RNN which generates the target sequence Y based on X, and predicts the next label y t based on the context vector c t and all previous labels [y 1 , y 2 , ..., y t−1 ] predicted by the  same decoder. Therefore, the joint probability of the target sequence is defined as: where y [1:t−1] = [y 1 , ..., y t−1 ] and the conditional probability of label y t can be modeled by the decoder, and defined as: |V | is the target vocabulary size, and s d is the hidden state size of decoder. s t ∈ R s d is the hidden state in the decoder at time step t, and computed as: where GRU is a unidirectional GRU. ⊕ is the concatenation operation, and y e t−1 is label embedding for label y t−1 . The context vector c t will be explained in the next section. It is noticed that the initial hidden state of the decoder is the last hidden state of the encoder. This means that the decoder can be aware of the meaning of the whole source sequence during the decoding process.
The encoder and the decoder are jointly trained by minimizing the negative log-likelihood loss: where l t is the ground truth label of word t, and θ denotes the parameters of the encoder and the decoder. From Eq. (3) and (4), we can see that the previous label is regarded as input when decoding the label for the current word. However, existing neural network based sequence labeling methods first compute the label scores of each word simultaneously, and obtain the globally optimized label sequence (Collobert et al., 2011). Therefore, they do not know the label of previous word when computing the label scores for the current word. By contrast, our proposed model generates the label for current word based on the label of previous word. This is the main difference between our proposed model and existing methods in solving label dependencies for ATE task.

Position-Aware Attention
In ATE task, the adjacent words of each word have important effects on predicting its label, while the distant words make less contribution to its label. The reason is that aspect terms are often surrounded by their modifiers. To the best of our knowledge, the current widely-used attention mechanism usually ignores the influence of positions when measuring the weights of each word. Therefore, we propose a Position-Aware Attention (PAA) model which regularly decreases the weight of word i with respect to the distance between word i and word t. Supposing that we compute the context vector c t at position t, PAA first computes the weight for each word by: where f (s t−1 , h i ) is the score function which computes the weight of h i given previous decoder hidden state s t−1 and the corresponding distance.
The score function is defined as: where 1 d(w i ,wt) calculates the weight decay rate for word i, W s ∈R (s d +se)×(s d +se) , v s ∈ R (s d +se) and b s ∈ R (s d +se) are weight matrix, weight vector and bias separately. v T s means the transpose of v s . In our model, we set d(w i , w t ) as the function log 2 (2 + l), where l is the distance between word w i and current word w t . As the example in Figure 1, when computing the context vector for rings, the d(union, rings) is log 2 (2 + 1).
Finally, the context vector c t is computed as a weighted sum of these encoder hidden states: We can see that PAA can tune the weights of each word according to the distance. Therefore, compared with vanilla attention, our model can pay more attention to its adjacent words given a word.

Gated Unit Networks
When solving ATE by our proposed method, there exists a consistent one-to-one mapping between source sequence and target sequence. This means that the word representation can be used to help the decoder to generate its label. For example, some kinds of words (e.g. food, place, and people) tend to be aspect term, while other words (e.g. verb, adjective and adverb) have less opportunity to be a part of aspect term. Therefore, we design the Gated Unit Networks (GUN) to incorporate word information into our model.
The main component of GUN is a merge gate which integrates information from encoder hidden state h t and decoder hidden state s t . To make s t and h t have the same dimension s g , we apply fullconnection layers on s t and h t to obtain new representations s t ∈ R sg and h t ∈ R sg . The merge gate is defined as: where σ is sigmoid function. W g , U g ∈ R sg×sg are weight matrices and b g ∈ R sg is bias. The merge gate automatically controls how much information should be taken from h t and s t  for decoding the label for word t by: Finally, we feed r t to softmax rather than s t used in Eq. (3) to obtain the label distribution for word t. h t plays a more important role than s t if g t is greater than 0.5, and vice versa. In such way, GUN can make full use of the corresponding word representation to help the decoder to generate its label.

Experiments
In this section, we first introduce the datasets and hyper-parameters used in our experiments. Then, we show the baselines for comparison. Finally, we compare the performance of our model with the baselines and analyze the reason why our model work.

Dataset & Hyperparameter Setting
We conduct experiments on two widely used datasets of the ATE task (Li and Lam, 2017;Li et al., 2018;Xu et al., 2018) Table  1. All sentences are tokenized by NLTK 3 . In our experiments, we randomly split 10% of the training data as validation data. We adopt F1-Measure to evaluate the performance of the baselines and our model.
In our experiments, all word embeddings are initialized by pre-trained GloVe embeddings (Pennington et al., 2014) 4 . We also use fastText (Joulin et al., 2016) 5 to compute word vector for outof-vocabulary (OOV) words. The label embeddings are initialized randomly. The word and label embedding size are set as 300 and 50 respectively. The parameters of our model are initialized by uniform distribution u ∼ (−0.1, 0.1). Both the encoder and decoder have two layers of GRU, and their hidden size is set to 300. We use Adam (Kingma and Ba, 2014) to optimize our model with the learning rate of 0.001, and two momentum coefficients are set to 0.9 and 0.999 respectively. The batch size is set to 8. To avoid overfitting, we use dropout on word embedding and label embedding, and the dropout rate is set to 0.5.

Baselines
To evaluate the effectiveness of our approach, we compare our model with three groups of baselines. The first group of baselines utilizes conditional randomly fields (CRF): • CRF trains a CRF model with basic feature templates 6 and word embeddings (Pennington et al., 2014) for ATE.
• IHS R&D is the best system of laptop domain, and uses CRF with features extracted using named entity recognition, POS tagging, parsing, and semantic analysis (Chernyshevich, 2014).
• NLANGP utilizes CRF with the word, name list and word cluster feature to tackle the task and obtains the best results in the restaurant domain. It also uses the output of a Recurrent Neural Network (RNN) as additional features to enhance their performances (Toh and Su, 2016).
• WDEmb first learns embeddings of words and dependency paths based on the optimization objective formalized as w 1 + r ≈ w 2 , where w 1 , w 2 are words, r is the corresponding dependency path. Then, the learned embeddings of words and dependency paths are utilized as features in CRF for ATE (Yin et al., 2016).
The second group of baselines employs neural networks methods to address the ATE problem: • Bi-LSTM applies different kinds of Bi-RNN (Elman/Jordan-type RNN) with different kinds of embeddings in the ATE task (Liu et al., 2015).
• BiLSTM-CNN-CRF is the state-of-the-art system for named entity recognition task, which adopts CNN and Bi-LSTM to learn character-level and word-level features respectively, and CRF is used to avoid the illegal transition between labels (Reimers and Gurevych, 2017).
The third group of baselines are joint methods for aspect term and opinion term extraction, and they take advantages of opinion label information to improve their performances.
• MIN is an LSTM-based deep multi-task learning framework for ATE, opinion word extraction and sentimental sentence classification. It has two LSTMs equipped with extended memories, and neural memory operations are designed for jointly handling the extraction tasks of aspects and opinions via memory interactions (Li and Lam, 2017).
• CMLA is made up of multi-layer attention network, where each layer consists of a couple of attention with tensor operators. One attention is for extracting aspect terms, while the other is for extracting opinion terms (Wang et al., 2017).
• RNCRF 8 learns structure features for each word from parse tree by Recursive Neural Networks, and the learned features are fed to CRF to decode the label for each word (Wang et al., 2016).
• HAST tackles ATE by exploiting two useful clues, namely opinion summary and aspect detection history (Li et al., 2018). 7 To make it fair, we compare our method with GloVe-CNN which only uses GloVe embeddings because our model just uses Glove embeddings but DE-CNN uses additional domain embeddings trained with large domain corpus. 8 They also use handcraft features to improve their performances.

Method
Laptop

Results Discussion
In this section, we report the performances of all models and analyze the advantages and disadvantages of them. The results of baselines and our model are displayed in Table 2. From the first part, we can see that CRF model obtains the worst performances on both datasets. Compared with the CRF model, IHS RD and NLANGP achieves better performances because they add more handcraft features to CRF. This shows that useful features are key factors for CRF based methods. Different from three previous approaches, WDEmb only uses word embeddings as inputs and performs better than IHS RD model. In fact, the CRF model also uses GloVe embeddings, but its results are much worse than WDEmb. The reason may be that embeddings used in WDEmb are trained with parsing information which plays important roles in ATE task. For example, the subject and object have a higher probability to be an aspect term than other components. We can find that the CRF based methods are heavily dependent on the quality of features. However, it is hard to extract effective features, and this prevents CRF based methods from improving their results.
From the second part, we can observe that the Bi-LSTM model obtains the worst performances on both datasets compared with the other neural networks based methods. Although Bi-LSTM model only takes embeddings as features, it achieves comparable results compared with the best CRF based methods. The main reason is that Bi-LSTM can learn dependencies between words, and this phenomenon demonstrates that neural networks based methods have bigger advantages than CRF-based methods in solving the ATE task. Compared with Bi-LSTM, the GloVe-CNN model improves 2.42% and 0.82% on laptop and restaurant datasets respectively. It is noticed that the GloVe-CNN just extracts features in a fixed-size window of each word for predicting its label. That is to say, the adjacent words are key factors for ATE, and this important information is also incorporated into our model by PAA. The BiLSTM-CNN-CRF model takes advantages of Bi-LSTM and CNN and achieves better performances than both systems. This shows that Bi-LSTM and CNN can complement each other.
From the third part, we can see that MIN, CMLA, RNCRF and HAST achieve good performances on both datasets. This implies that joint learning is a new direction for ATE task. However, they take advantage of opinion information to improve their performances, and the opinion information is not accessible in many situations. It is noticed that HAST also use the information of previous words to predict the current label, and they find that previous word information (not the predicted label of the previous word) is important to model the label dependencies.
Finally, we can see that Seq2Seq4ATE raises its performances about 0.79% and 1.53% on two datasets compared with HAST. In addition, Seq2Seq4ATE does not take advantage of any extra features such as handcraft/syntactic features and opinion information. This demonstrates the effectiveness of our model.
In a word, our proposed method can make use of the overall meaning of the sentence to better deal with polysemous words (e.g. memory) and remedy the label dependencies through decoding current word conditioned on previous label. In addition, we propose the PAA and GUN to make Seq2seq learning method better suit the ATE task.

Ablation Study
In this section, we study the effectiveness of the key components (e.g. PAA and GUN) in our proposed model and conduct an extensive ablation study. There are two main ablation baselines: (1)Seq2Seq4ATE-w/o-PAA removes the PAA from the Seq2Seq4ATE, (2)Seq2Seq4ATE-w/o-  GUN removes the GUN from the Seq2Seq4ATE.
In addition, we also use vanilla attention mechanism (VAM) to compute the context vector (named Seq2Seq+VAM) for verifying the advantage of PAA. Table 3 reports the results of Seq2Seq4ATE and its variants. From Table 3, we can first observe that both PAA and GUN are important components in our model because removing any of them from our model would result in heavily drop in performances on both datasets.
Secondly, we can see that Seq2Seq4ATE-w/o-GUN performs better on the laptop dataset but Seq2Seq4ATE-w/o-PAA performs better on the restaurant dataset. The reason may be that the aspect terms in the laptop domain are fixed words such as CPU, memory and etc. But the aspect terms in the restaurant domain are more arbitrary such as The Mom Kitchen, Hot Pizzeria and etc. Therefore, GUN is more important in the laptop domain because it can incorporate the word representation into Seq2Seq by merge gate, but PAA is more important for the restaurant domain because it can leverage the adjacent words of each word to help predict its label.
In addition, we also find that the Seq2Seq4ATE removing both PAA and GUN performs very bad in both datasets. We think the main reason is that the number of aspect term is much smaller compared with all words. Therefore, our model can hardly learn useful information from data. We analyze the datasets and find that the words of aspect term make up 8.8% and 6.9% of the training data of restaurant and laptop domain.
Finally, we can see that Seq2Seq4ATE improves about 2.92% and 2.67% on laptop and restaurant compared with Seq2Seq+VAM. The great improvements again prove that the adjacent words play important roles in ATE. The reason is that the weights of distant words in VAM may be large in VAM. However, the weights of distant words in PAA will be heavily decayed by the position information and the weights of adjacent words  will be decayed little because d(w i , w t ) is proportional to the distance.

Analysis of Label Dependencies
In this section, we conduct experiments to validate the effectiveness of our proposed model in handling label dependencies. Collobert et al. (2011) have demonstrated that it is important to model label dependencies in sequence labeling task. To validate the effectiveness of our model in addressing this problem, we compare our model Seq2Seq4ATE with two models: BiLSTM 9 and BiLSTM+CRF. BiLSTM does not take the label dependencies into account, and BiL-STM+CRF uses transition matrix (Collobert et al., 2011) to address label dependencies problem.
To evaluate the effectiveness of model in modeling label dependencies, we propose an evaluation criterion: Illegal Transition Rate (IT-Rate) which is computed by: IT-Rate = #illegal transition #aspect term × 100 where "#illegal transition" is the number of illegal transition (e.g. O→I) occurrences in predicted label sequence, and "#aspect term" is the number of aspect term. Generally speaking, lower IT-Rate means better performance in modeling label dependencies. Table 4 shows the results of three models on testing data. First, we can observe that the higher F1 is accompanied by lower IT-Rate. This once again demonstrates the importance of modeling label dependencies. Secondly, we can observe that BiLSTM+CRF decreases IT-Rate about 2.75% and 5.29% on two datasets compared with the BiLSTM model. This indicates that the transition matrix is a good way to model label dependencies. However, they also do not utilize the previous label to improve their performances directly. The most impressive results are that the IT-Rate of Seq2Seq4ATE is 0.02% and 0.03% which almost can be ignored compared with BiLSTM and BiL-STM+CRF. The main reason is that Seq2Seq4ATE leverages previous label information y t−1 to decode label y t for word t. Consequently, y t is compatible with y t−1 . This indicates the advantages of our model in handling label dependencies compared with previous methods.

Related Work
Aspect-based sentiment analysis (ABSA) is a subfield of sentiment analysis (Hu and Liu, 2004;Pontiki et al., 2014Pontiki et al., , 2015Pontiki et al., , 2016. In this paper, we only focus on the ATE task, and we solve this task by Seq2Seq learning which is often used in the generative task. We will introduce the recent study progresses in ATE and Seq2Seq learning. Hu and Liu (2004) first propose to evaluate the sentiment of different aspects in a document, and all aspects are predefined artificially. The key step is to extract all possible aspects of a document (Zhuang et al., 2006;Popescu and Etzioni, 2007;Mei et al., 2007;Titov and McDonald, 2008;He et al., 2017). However, predefined aspects may not cover all the aspects appearing in a document. Therefore, many works turn to extract all possible aspect terms in a document. The mainstream methods for aspect term extraction include the unsupervised method and supervised method. The typical unsupervised methods include bootstrapping (Wang and Wang, 2008), double propagation (Qiu et al., 2011) and others. The supervised methods contain Hidden Markov Model (Jin et al., 2009), Conditional Random Fields (Jakob and Gurevych, 2010;Li et al., 2010;Yang and Cardie, 2013;Chernyshevich, 2014;Toh and Su, 2016;Yin et al., 2016;Shu et al., 2017) and other approaches (Wu et al., 2009;Ma and Wan, 2010;Liu et al., 2013). With the developments of deep learning, neural networks based method such as recurrent NN (Liu et al., 2015;Li and Lam, 2017), recursive NN (Wang et al., 2016), convolution NN (Poria et al., 2016;Xu et al., 2018) and attention model (Wang et al., 2017) have achieved good performances in ATE. In addition, many works utilize multi-task learning (Yang and Cardie, 2013;Wang et al., 2016Wang et al., , 2017Li et al., 2018) and other resources (Xu et al., 2018) to improve their performances.

Sequence-to-Sequence Learning
Sequence-to-sequence model is a generative model which is proposed by (Cho et al., 2014b;Sutskever et al., 2014), and first used in the field of machine translation. In addition, Cho et al. (2014a) improves the decoding by beam-search. However, vanilla Seq2Seq model performs worse in generating long sentences. The reason is that the encoder needs to compress the whole sentence into a fix length representation. To address this problem, Bahdanau et al. (2014) introduce an attention mechanism which selects important parts of the source sentence with respect to the previous hidden state in decoding the next state. Afterward, some studies focus on improving attention mechanism (Luong et al., 2015). So far, Seq2Seq models and attention mechanism have been applied to many fields such as dialog (Serban et al., 2016) generation, text summarization (Nallapati et al., 2016) and etc.
In this paper, we first attempt to formalize the ATE as a sequence-to-sequence learning task because it can make full use of both the meaning of the sentence and label dependencies compared with existing methods. Furthermore, we design a position-aware attention model and gated unit networks to make Seq2Seq model better suit to this task. Generally, Seq2Seq model is timeconsuming in many fields because the target vocabulary size is very large, but the time costs in ATE is acceptable because the target vocabulary size is 3.

Conclusion and Future Work
In this paper, we propose a sequence-to-sequence learning based approach to address the ATE task. Our proposed method can take full advantage of the meaning of the whole sentence and the previous label during the decoding process. Furthermore, we find that each word's adjacent words and its own word representation are key factors for its label, and we propose a PAA and GUN model to incorporate two kinds of information into our model. The experimental results demonstrate that our approach can achieve comparable performances on ATE task. In our future work, we plan to apply our approach to other sequence labeling tasks, such as named entity recognition, word segmentation and so on.