Target-specified Sequence Labeling with Multi-head Self-attention for Target-oriented Opinion Words Extraction

Opinion target extraction and opinion term extraction are two fundamental tasks in Aspect Based Sentiment Analysis (ABSA). Many recent works on ABSA focus on Target-oriented Opinion Words (or Terms) Extraction (TOWE), which aims at extracting the corresponding opinion words for a given opinion target. TOWE can be further applied to Aspect-Opinion Pair Extraction (AOPE) which aims at extracting aspects (i.e., opinion targets) and opinion terms in pairs. In this paper, we propose Target-Specified sequence labeling with Multi-head Self-Attention (TSMSA) for TOWE, in which any pre-trained language model with multi-head self-attention can be integrated conveniently. As a case study, we also develop a Multi-Task structure named MT-TSMSA for AOPE by combining our TSMSA with an aspect and opinion term extraction module. Experimental results indicate that TSMSA outperforms the benchmark methods on TOWE significantly; meanwhile, the performance of MT-TSMSA is similar or even better than state-of-the-art AOPE baseline models.


Introduction
Aspect-Based Sentiment Analysis (ABSA) (Pontiki et al., 2014) has attracted much attention of researchers in recent years. In ABSA, aspect (or called opinion target) extraction and opinion term extraction are two fundamental tasks. Aspect is the word or phrase in the reviews referring to the object towards which users show attitudes, while opinion terms are those words or phrases representing users' attitudes (Wu et al., 2020). For example, in the sentence "The dim sum is delicious.", the phrase "dim sum" is an aspect and the word "delicious" is an opinion term. See the upper part of Table 1 for more examples. Plenty of works based on neural networks have been done in both aspect * The corresponding author.

Reviews:
"Soooo great! The food is delicious and inexpensive, and the environment is in a nice. The only problem is that the soup and dessert are ordinary." Aspect-Opinion Pairs: food : [delicious, inexpensive] (one-to-many) environment : [nice] (one-to-one) soup, dessert : [ordinary] (many-to-one) Table 1: The upper part is a restaurant review and the lower part shows the corresponding aspect-opinion pairs. Extracted aspects and opinion terms are marked in red and blue, respectively. and opinion term extraction (Liu et al., 2015;Poria et al., 2016;Xu et al., 2018); moreover, some studies combine these two tasks into a multi-task structure to extract aspects and opinion terms simultaneously (Wang et al., 2016(Wang et al., , 2017Li and Lam, 2017;Dai and Song, 2019).
However, one critical deficiency in the researches mentioned above is that they ignore the relation of aspects and opinion terms, which leads to the birth of Target-oriented Opinion Words (or Terms) Extraction (TOWE) (Fan et al., 2019) for extracting the corresponding opinion terms of a given opinion target. Subsequently, Aspect-Opinion Pair Extraction (AOPE) (Chen et al., 2020) and Pair-wise Aspect and Opinion Terms Extraction (PAOTE)  have emerged, which both aim at extracting aspects and opinion terms in pairs. AOPE and PAOTE are exactly the same task, only named differently. In the following, we use AOPE to denote this task for simplicity. It can be considered that AOPE contains aspect and opinion word extraction and TOWE. Since aspect extraction has been fully studied and satisfactory results have been obtained, TOWE, which aims at mining the relation between aspects and opinion terms, is the key to the AOPE task. As shown in the lower part of Table 1, the relational structure of the aspect-opinion pairs within a sentence can be complicated, including one-to-one, one-to-many, and many-to-one.
The challenge of TOWE is the learning of representations of the given opinion target accurately and a few works focus on this task. For instance, Fan et al. (2019) propose an Inward-Outward LSTM to pass target information to the left context and the right context of the target respectively, and then they combine the left, right, and global context to encode the sentence. Recently, SDRN (Chen et al., 2020) and SpanMlt  both adopt a pre-trained language model to learn contextual representations for AOPE. In SDRN, a double-channel recurrent network and a synchronization unit are applied to extract aspects, opinion terms and their relevancy. In SpanMlt, the terms are extracted under annotated span boundaries with contextual representations, and then the relations between every two span combinations are identified. However, apart from hyper-parameters in the pre-trained language model, these two methods introduce many other hyper-parameters (e.g., the hidden size, thresholds and recurrent steps in SDRN, and the span length, top k spans and the balanced factor of different tasks in SpanMlt). Some of these hyper-parameters have a significant impact on the model performance.
Motivated by the previous work and to address the challenges mentioned above, we propose a Target-Specified sequence labeling method based on Multi-head Self-Attention (Vaswani et al., 2017) (TSMSA). The sentence is first processed in the format "[SEP] Aspect [SEP]" (e.g., "The [SEP] food [SEP] is delicious."), which is inspired by Soares et al. (2019) who utilized a special symbol "[SEP]" to label all entities and output their corresponding representations. Then we develop a sequence labeling model based on multi-head self-attention to identify the corresponding opinion terms. By using the special symbol and self-attention mechanism, TSMSA is capable of capturing the information of the specific aspect. To improve the performance of our model, we apply pre-trained language models like BERT (Devlin et al., 2019) which contain a multi-head self-attention module as the encoder. As a case study, we integrate aspect and opinion term extraction, and TOWE into a Multi-Task architecture named MT-TSMSA to validate the effectiveness of our method on the AOPE task. In addition, apart from hyper-parameters in the pre-trained language model, we only need to adjust the balanced factor of different tasks in MT-TSMSA. In summary, our main contributions are as follows: • We propose a target-specified sequence labeling method with multi-head self-attention mechanism to perform TOWE, which generates target-specific context representations for different targets in the same review with the special symbol and multi-head self-attention. Pre-trained language models can be conveniently applied to improve the performance.
• For our TSMSA and MT-TSMSA, only a small amount of hyper-parameters need to be adjusted when using pre-trained language models. Compared to the existing models for TOWE and AOPE, we alleviate the tradeoff issue between a model's complexity and performance.
Extensive experiments validate that our TSMSA can achieve the best performance on TOWE, and MT-TSMSA performs quite competitive on AOPE. The rest of this paper is organized as follows. Section 2 introduces the existing studies on TOWE and AOPE, respectively. Section 3 details the proposed TSMSA and MT-TSMSA. Section 4 presents our experimental results and discussions. Finally, we draw conclusions in Section 5.

Target-oriented Opinion Words Extraction
Plenty of works have been carried out for aspect extraction and opinion term extraction. Early researches can be divided into unsupervised/semisupervised methods (Hu and Liu, 2004;Zhuang et al., 2006;Qiu et al., 2011) and supervised methods (Jakob and Gurevych, 2010;Shu et al., 2017). With the development of neural networks, deep learning methods (Liu et al., 2015;Yin et al., 2016;Poria et al., 2016;Xu et al., 2018) have made impressive progress in recent years. Several works integrate aspect extraction and opinion term extraction into a co-extraction process. Qiu et al. (2011) expand the list of aspects and opinion terms in a bootstrapping method by double propagation. Some other works adopt the co-extraction structure in neural networks with multi-task learning (Wang et al., 2016(Wang et al., , 2017Li and Lam, 2017). However, the above methods ignore the relation between aspects and opinion terms and only a few works focus on this field. Rule-based methods (Hu and Liu, 2004;Zhuang et al., 2006) are proposed to select corresponding opinion terms with distance rule and syntactic rule templates based on dependency parsing trees. However, the performance of these methods heavily relies on expert knowledge and these rules usually cover only a small amount of cases. Fan et al. (2019) carry out TOWE by extracting the corresponding opinion terms for a given aspect, and then utilize Inward-Outward LSTM to generate implicit representations of aspects. Nevertheless, this approach is not capable of applying powerful pre-trained language models like BERT as the encoder to perform better. Our model aims to extract corresponding opinion terms of the given aspect with explicit representations, in addition to boost performance by employing BERT as the encoder.

Aspect-Opinion Pair Extraction
Aspect-Opinion Pair Extraction (AOPE) (Chen et al., 2020) and Pair-wise Aspect and Opinion Terms Extraction (PAOTE)  both aim at extracting aspects and opinion terms in pairs. AOPE and PAOTE are essentially the same task with different names, and they can be split into aspect extraction and TOWE. Chen et al. (2020) propose a Synchronous Double-channel Recurrent Network (SDRN) which consists of an opinion entity extraction unit, a relation detection unit, and a synchronization unit for pair extraction.  develop a span-based multi-task learning framework (SpanMlt) where the terms are extracted under annotated span boundaries, so as to identify the relations between every two span combinations.
However, SDRN contains a lot of hyperparameters and SpanMlt generates a great many of candidate spans if the value of maximal length of a span is large or the sentence is too long. The advantage of our methods is that only a small amount of hyper-parameters adjustment is required and similar or even better performance can be achieved.

Task Description
Given a sentence s = {w 1 , w 2 , ..., w n } consisting of n words, an aspect (opinion target) a = {w i , w i+1 , ..., w i+k }, and an opinion term o = {w j , w j+1 , ..., w j+m } (a and o are substrings of s), the probabilities of target-oriented opinion terms are defined as p(o|s, a) in the TOWE task and the probabilities of aspect-opinion pairs are defined as p( a, o |s) = p(a|s) × p(o|s, a) in the AOPE task. The BIO tagging scheme (Ramshaw and Marcus, 1995) and a special symbol " [SEP]" are applied to this task, where each word w i in the sentence s is tagged as y

Framework
The structures of our Target-Specified sequence labeling method based on Multi-head Self-Attention (TSMSA) and the Multi-Task version (MT-TSMSA) are shown in Figure 1 (c) and (d). As aforementioned, we first use a special symbol "[SEP]" to label each aspect. Next, the multi-head self-attention method is applied to capture the context representations of the specific aspect explicitly, then they are passed to a projection layer and a Conditional Random Field (CRF) (Lafferty et al., 2001) layer for sequence labeling. Furthermore, the aspect and opinion words extraction (task 0) as well as the target-oriented opinion words extraction (task 1) are combined for multi-task learning. These two tasks share the parameters of encoder but differ in projection and CRF layers.

Multi-Head Self-Attention
We describe the multi-head self-attention approach according to Vaswani et al. (2017) with the details shown in Figure 1 (a) and (b). For each attention head in the above approach, we first compute the scaled dot-product attention. Particularly, the input consists of a set of queries, keys, and values, where d k stands for the dimension of queries and keys, and d v represents the dimension of values. Then they are packed together into matrices Q, K, and V , respectively. The scaled dot-product attention is calculated as follows: (1) Next, given the number of attention heads h, we can get the dimension of output Finally, the multi-head attention is described as follows:  length. The parameter matrices of projections are

Target-Specified Encoder
To start with, the input vector of each word is generated by utilizing a word embedding lookup table L w ∈ R r×dw and a positional embedding lookup table L p ∈ R n×dp , where d w is the dimension of word embeddings, r is the vocabulary size, and d p is the dimension of positional embeddings. These embedding lookup tables will map s = {w 1 , ..., w n } to { e 1 w , ..., e n w } and { e 1 p , ..., e n p }, respectively. For our base models (not using a pre-trained language model), e i w will be projected to a low dimensional vector e i low which is calculated as follows: e i low = σ(W e e i w ), where W e ∈ R d low ×dw (d low < d w ) denotes the matrix of projection and σ(·) is the activation function. In this case, Then, the input vector T is passed to multihead self-attention modules, where a feed-forward network and an add-norm network are combined in sequence to generate the context representation of each layer H = {H 1 , ..., H l }, where l is the number of multi-head attention layers and where h is the number of attention heads, is a layer normalization method applying to sequential data (Ba et al., 2016). Finally, the output of the encoder is H l , i.e., the last layer of H.

Decoder and Training
Given a sequential representation H l and a sequential label Y = {y 1 , ..., y n } (y i ∈ {B, I, O, [SEP]} or y i ∈ {B-ASP, I-ASP, B-OP, I-OP, O} 1 ), we can use H l to compute p(Y |H l ). Greedy decoding or CRF can be adopted in the decoding process. CRF is chosen as our decoding strategy because CRF has the ability to capture the correlations between tokens and labels and the correlations between adjacent labels simultaneously. Given a new sentence, we use Viterbi algorithm (Viterbi, 1967) to predict the label sequence by maximizing the conditional probability p(Y |H l ) in the decoding process.

Single-Task Version
The single-task version of our approaches is TSMSA. Given a predicted label sequence Y and a sequential representation H l , the score function S(H l , Y ) can be defined as follows: where the matrix Q ∈ R k×k captures the relation of adjacent labels, the matrix P ∈ R n×k learns the relation of tokens and labels, and the matrices W p ∈ R d model ×k and b p ∈ R n×k indicate a projection operation from dimension d model to dimension k. In the above, k means the dimension of the label space. Then, the linear-chain CRF is exploited to calculate the conditional probability of the predicted sequence Y as follows: where Y all denotes the set of all possible sequential labels. So the loss of a sentence can be calculated by the negative log likelihood as follows:

Multi-Task Version
By integrating aspect and opinion term extraction (task 0) and TOWE (task 1) into a multi-task architecture, we propose a MT-TSMSA method for AOPE. MT-TSMSA can be defined as using a sentence H l and a task id ∈ {0, 1} to calculate the conditional probability p(Y |H l , id). When the task id equals 0, it means aspect and opinion term extraction. For TOWE, the task id is 1. Some examples are shown in Figure 1 (d). Aiming at handling different tasks, different score functions S 0 (H l , Y 0 ) and S 1 (H l , Y 1 ) are defined, where S 0 (·) and S 1 (·) have different parameter matrices, Y 0 (Y 0 i ∈ {B-ASP, I-ASP, B-OP, I-OP O}) and Y 1 (Y 1 i ∈ {B, I, O, [SEP]}) represent the sequential labels of aspect and opinion term extraction, and TOWE, respectively. So the conditional probabilities of the predicted sequences Y 0 and Y 1 can be calculated as follows: where Y 0 all denotes the set of all possible sequential labels of task 0 and Y 1 all represents the set of all possible sequential labels of task 1. The loss of a sentence is also calculated by the negative log likelihood as follows: Given M sentences S = {s 1 , s 2 , ..., s M } with id = {id 1 , ..., id M }, we can minimize the loss for training: where λ is the hyper-parameter used to balance these two tasks.

Inference Process
For TOWE, a sentence with a given aspect (i.e., target) is first processed into target-specified mode ("[SEP] Aspect [SEP]") with the special symbol "[SEP]" and then passed into TSMSA, the outputs of which are the target-oriented opinion terms. For AOPE, MT-TSMSA generates aspect-opinion pairs by a two-stage inference process. Firstly, a sentence is passed into MT-TSMSA, where aspects are extracted in task 0. Secondly, given extracted aspects, repeating the inference process of TOWE, MT-TSMSA outputs the target-oriented opinion terms from task 1. Accordingly, the combinations of aspects from task 0 and target-orient opinion terms from task 1 are aspect-opinion pairs.

Datasets
To evaluate the performance of our model 2 , we conduct experiments on two public datasets from laptop and restaurant domains. These two datasets were respectively built by  (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016. For the first dataset, every sentence was annotated by two people, and the conflicts were checked and eliminated manually. The second dataset was developed by extending the first one. The statistics of these benchmark datasets are shown in Table 2, from which we can observe that the second dataset includes many negative samples for AOPE (i.e., the sentences only contain aspects and opinion terms, without any aspect-opinion pairs). Note that these negative samples will also be considered when testing our model on AOPE.

Baselines
Fan et al. (2019) have employed various baselines in TOWE, including Distance-rule (Hu and Liu, 2004), Dependency-rule (Zhuang et al., 2006), BiLSTM + Distance-rule, and TC-BiLSTM, except for BERT-based methods. To achieve comprehensive comparative analysis, we develop baselines of BERT + Distance-rule and Target-fused BERT (TF-BERT) for this task. The former trains a sentence-level opinion term extraction model by BERT, and the target-oriented opinion term is the one nearest to each aspect. The latter utilizes the average pooling of target word embeddings to represent the target information. The word representation at each position is the addition of word embedding and target information, which is fed into BERT to extract target-oriented opinion terms.  have applied some baselines in AOPE, including HAST (Li et al., 2018) + IOG and JERE-MHS (Bekoulis et al., 2018). Besides the above methods, we also employ the following baselines: • IOG (Fan et al., 2019) utilizes an Inward-Outward LSTM and a Global LSTM to capture the information of aspects and global information respectively, then it combines these information for sequence labeling. • SpanMlt  is a span-based multi-task learning framework where the terms are extracted with annotated span boundaries and then the relations between combinations of every two spans are identified. • SDRN (Chen et al., 2020) utilizes BERT as the encoder which consists of an opinion entity extraction unit, a relation detection unit, and a synchronization unit for the AOPE task. In the case of TOWE, this model extracts the target-oriented opinion terms with given correct aspects.

Hyper-parameter Settings
For the TOWE task, Fan et al. (2019) utilize 300dimension GloVe (Pennington et al., 2014) vectors which are pre-trained on unlabeled data of 840 billion tokens to initialize word embedding vectors in IOG. The word embeddings are fixed at the stage of training. For fair comparison, we use the same fixed word embeddings in TSMSA(Base). We randomly select 20% of the training set as the development set for adjusting all hyper-parameters.
The value of d model is 128, and the numbers of attention heads and layers are 4 and 6, respectively. In addition, the dropout rate, learning rate, and maximal sequence length are set to 0.5, 0.001, and 100, respectively. Adam optimizer (Kingma and Ba, 2015) is adopted to optimize our model. Pretrained language models like BERT (Devlin et al., 2019) can be applied to our methods, and we adopt BERT-base 3 model, where d model is 768 and the number of attention heads and layers are both 12.
Other hyper-parameters include the learning rate of BERT and CRF, the maximal sequence length, and the number of epochs. Based on the development set, these hyper-parameters are set to 5e-5, 2e-4, 100, and 8, respectively. Unless otherwise mentioned, λ is set to 1. To be consistent with various baselines (Fan et al., 2019;Chen et al., 2020;, the term-level F1 score is used as the evaluation metric for both TOWE and AOPE tasks. Term-level means that the boundaries of the span are the same as the ground-truth. For the AOPE task, the consistency of a predicted aspect-opinion pair with the labeled pair indicates the correctness of prediction.  based methods are poor because the rules only cover a small number of cases. By utilizing BiL-STM or BERT as the encoder to extract opinion terms, the BiLSTM/BERT + Distance-rule perform much better than other rule-based methods. However, these methods cannot deal with the one-to-many case. Secondly, TC-BiLSTM and TF-BERT extract static word embeddings for aspects and then incorporate them into sentence representation by concatenation or addition. Nevertheless, the results of TC-BiLSTM and TF-BERT are still over 10% lower than IOG/TSMSA(Base) and SDRN/TSMSA(BERT), respectively. It reveals that the static word embedding is not a good representation of the aspect and the concatenation/addition operation is not good enough to represent the specific aspect. Finally, IOG is a state-ofthe-art baseline method for TOWE and the performance of TSMSA(Base) trained by the same word embedding is similar to IOG, which indicates the effectiveness in capturing the representation of a specific aspect with the symbol "[SEP]". Furthermore, the pre-trained language model BERT can be applied to our basic method. The F1 score of TSMSA(BERT) is in average 8% higher than TSMSA(Base) and IOG. SDRN, which also exploits BERT as the encoder, passes the information of the aspect through a synchronization unit and utilizes supervised self-attention to capture this information. Nevertheless, it represents the specific aspect implicitly, which might have an negative impact on capturing the information of targets. In average, the performance of SDRN is 2% lower than TSMSA(BERT). The overall results reveal that our proposed method achieves state-of-the-art performance on TOWE.

Aspect-Opinion Pair Extraction
As mentioned above, our method can be applied to AOPE by combining TOWE with aspect and opinion term extraction. We here compare the performance of our multi-task model (i.e., MT-TSMSA) with the following competitive models: HAST + IOG, JERE-MHS, SpanMlt, and SDRN. The results are shown in Table 4. Note that the overlapping ratios of pairs in 14lap, 14res, and 15res are 78.8%, 92%, and 99.8% for (Fan et al., 2019), and 87.1%, 86.2%, and 86.4% for (Chen et al., 2020), respectively. Thus, there is a difference (within 2% mostly) between the results on these two datasets.
Models (Fan et al., Table 4: F1 scores (%) of aspect-opinion pairs extraction. The results with '*' are reproduced by us, and others are released from  and Chen et al. (2020). Best results are marked in bold.
The performance of JERE-MHS is better than HAST + IOG, which indicates that the degree of error propagation in the separate training model might be smaller than it in the model of joint training. Moreover, SpanMlt, SDRN, and MT-TSMSA(BERT) use powerful pre-trained language models, which have a significant improvement in the performance on AOPE. We observe that SDRN and MT-TSMSA(BERT) perform better than Span-Mlt, showing that selecting top k spans from candidate spans as pairs might miss some correct pairs. Compared to SDRN, MT-TSMSA(BERT) performs better on three datasets and nearly the same on four datasets. Overall, MT-TSMSA achieves quite competitive performance on AOPE by simply incorporating our TSMSA into a multitask structure.

Ablation Experiments
To evaluate the impacts of different word embeddings and training strategies on our models, we conduct ablation experiments by varying the above factors. The results shown in Table 5 indicate that a suitable word embedding is capable of improving the performance of our models. Firstly, BERT embedding shows poor performance when compared to Glove. We conjecture that BERT embedding needs to cooperate with the pre-trained encoder of BERT to perform better on TOWE. Secondly, applying the word embedding and the encoder of BERT without fine-tuning also fails to work on TOWE. The reason may be that the encoder of BERT without fine-tuning cannot capture the information of the specific aspect with the symbol "[SEP]". Furthermore, opinion terms extracted from task 0 help to identify the corresponding opinion terms in task 1, which means that the multi-task structure is able to achieve better results than the single-task structure on TOWE. Although the improvement is not significant in average, we observe that the former structure can achieve more stable performance than the latter one.

Convergence and Sensitivity Studies
The results of convergence and sensitivity studies are shown in Figure 2. Figure 2 (a) reveals that our model gradually converges as the number of epochs increases. Although the dropout rate is set to 0.5, it also converges smoothly. Figure 2 (b) shows the effect of the number of attention heads. When the number of attention heads is 4, TSMSA(Base) achieves stable and good performance, and as the value increased, the performance might be better. Figure 2 (c) shows that the best performance is achieved when the number of multi-head self-attention layers is 6, and as the number increased, the model might be confronted with overfitting. Figure 2 (d) indicates the impact of λ on our model which influences the learning of different tasks. Stable and good results can be obtained when λ = 1, and better performance can be achieved when the value is set to 0.5 or 2. Compared with other hyper-parameters, the results also indicate that λ has a relatively small impact on the model performance.

Visualization of Attention
In this part, we apply an open source tool 4 to visualize the attention scores of TSMSA(BERT) and describe two attention heads on the tenth layer in Figure 3 (a) and (b), where attention scores less than 0.1 and unimportant words are not displayed. As we can see, the words "nice" and "great" are both close to the aspect "food", but "nice" will not pay attention to this aspect. In addition, "great" and "reasonable" focus on the special symbol "[SEP]" and the specific aspect "food", as shown in Figure 3 (a). At the same time, "food" gives attention to "great" and "reasonable" on different attention heads, as described in Figure 3 (b). All these instances reveal that multi-head self-attention mechanism is capable of capturing the representation of a  [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] Figure 3: Visualization of multi-head self-attention mechanism. A line represents that a word from the bottom sentence pays close attention to the word from the top sentence. specific aspect.

Case Study
To further compare our MT-TSMSA(BERT) with the best-performing baseline of SDRN, we here conduct a case study by following (Chen et al., 2020). As shown in Table 6, both SDRN and MT-TSMSA(BERT) perform well in extracting aspectopinion pairs from complicated relations. But in some cases like Case 4, SDRN misses the pair of (watching videos, hot). The reason may be that the massive hyper-parameters in SDRN have a great impact on the effect. For example, the threshold β in the relation synchronization mechanism of SDRN will largely affect the results of the model. On the other hand, our method can extract all the pairs because it introduces fewer hyper-parameters, which leads to stable results. However, in Case 5, our method cannot extract the pair. The rea-son is that task 0 of MT-TSMSA(BERT) fails to extract the aspect term "log into the system". Moreover, the in-depth reason is that for the aspect term extraction task, the performance of SDRN (i.e., 83.67%, 89.49%, and 74.05%) is better than that of MT-TSMSA(BERT), i.e., 83.11%, 84.85%, and 72.69% on the datasets from (Chen et al., 2020).

Conclusions
In this paper, we propose a target-specified sequence labeling method based on multi-head selfattention (TSMSA) and a multi-task version (MT-TSMSA) to deal with TOWE and AOPE, respectively. In our methods, the encoder is capable of capturing the information of the specific aspect which is labeled by a special symbol "[SEP]". Experimental results demonstrate that TSMSA and MT-TSMSA achieve quite competitive performance in most cases. When combining aspect and opinion words extraction with TOWE, our MT-TSMSA can slightly improve the performance as compared with TSMSA. In the future, we plan to extend our approaches to sentiment classification of pairs and explore an efficient model with a one-stage inference process to reduce the time complexity on AOPE.