Don’t Eclipse Your Arts Due to Small Discrepancies: Boundary Repositioning with a Pointer Network for Aspect Extraction

The current aspect extraction methods suffer from boundary errors. In general, these errors lead to a relatively minor difference between the extracted aspects and the ground-truth. However, they hurt the performance severely. In this paper, we propose to utilize a pointer network for repositioning the boundaries. Recycling mechanism is used, which enables the training data to be collected without manual intervention. We conduct the experiments on the benchmark datasets SE14 of laptop and SE14-16 of restaurant. Experimental results show that our method achieves substantial improvements over the baseline, and outperforms state-of-the-art methods.


Introduction
Aspect extraction (Hu and Liu, 2004) is a crucial task in the field of real-world aspect-oriented sentiment analysis, where an aspect stands for a sequence of tokens which adhere to a specific sentiment word, in general, serving as the target on which people express their views. For example, the tokens "twist on pizza" is the aspect of the opinion "healthy" in 1). In this paper, we concentrate on the study of aspect extraction conditioned on the unawareness of sentiment words.

1) Their twist on pizza is healthy.
Ground-truth: twist on pizza Predicted: [BOUND] pizza [BOUND] 2) Buy the separate RAM memory and you will have a rocket. Ground-truth: RAM memory Predicted: [BOUND] separate RAM memory [BOUND] What is undoubtedly true is that the existing neural aspect extraction methods (Section 5.3) have achieved remarkable success to some extent. The peak performance on the benchmark datasets, to our best knowledge, is up to 85.61% F1-score (Li et al., 2018). We suggest that further improvements can be made by fine-tuning the boundaries of the extracted aspects. It is so because some incorrectlyextracted aspects result from minor boundary errors, where the boundaries refer to the start and end positions of a token sequence. For example, reinstating the omitted words "twist on" and trimming the redundant word "separate" in 1) and 2) by changing the start positions contributes to the recall of the correct aspects.
We propose to utilize a pointer network for repositioning the boundaries (Section 2). The pointer network is separately trained, and it is only used to post-process the resultant aspects output by a certain extractor (Section 3). Supervised learning is pre-requisite for obtaining a well-trained pointer network. However, so far, there is a lack of boundary-misspecified negative examples to construct the training set. Instead of manually labeling negative examples, we recycle those occurring during the time when the extractor is trained (Section 4). Our contributions in this paper are as follows: • By means of a pointer network, we refine the boundary-misspecified aspects.
• The separately-trained pointer network serves as a post-processor and therefore can be easily coupled with different aspect extractors.
• The use of recycling mechanism facilitates the process of constructing the training set.

Pointer Network Based Boundary Repositioning
We train a pointer network to predict the start and end positions of the correct aspect. What we feed into the network include a candidate aspect and the sentence which contains the candidate (herein called source sentence). The candidate may be a boundary-misspecified aspect, truly-correct aspect or other text span. The network outputs two words w s and w e , one of which is predicted to be the start position, the other the end: where, P(*) denotes the probability that a word serves as the start or end position, and arg max refers to the maximum likelihood estimation. The text span which lies between the start and end positions w s and w e will be eventually selected as the boundary-repositioned aspect. It is noteworthy that, during testing, the status (boundary-misspecified, truly-correct or other) of the candidate aspect is assumed to be unknown. This is derived from the consideration of the practical situation in which the status of the pre-extracted aspect is unforeseeable.
Encoding Assume C={w 1 , ..., w n } represents the candidate aspect, where w c i ∈ R l stands for the combination of the word, position and segment embeddings of the i-th token in C. The source sentence is represented in the same way and denoted by U ={w 1 ,..., w m }. We concatenate C and U to construct the input representation: where, CLS denotes the embedding of a dummy variable, while SEP is that of a separator (Devlin et al., 2019). In our experiments, WordPiece embeddings are used which can be obtained from the lookup table of Wu et al. (2016). The embeddings of position, segment, separator and dummy variable are initialized randomly. We encode each element w i in the input representation W C⊕U by fine-tuning BERT (Devlin et al., 2019): Decoding Due to the use of the multi-head selfattention mechanism (Vaswani et al., 2017), BERT is able to perceive and more heavily weight the attentive words in the source sentence U , according to the information in the candidate aspect C, and vice versa. This property allows the attentionworthy words out of C to be salvaged and meanwhile enables the attention-unworthy words in C to be laid aside. On the other hand, a trainable decoder tends to learn the consistency between the ground-truth aspect and the attentive words. Therefore, we suppose that the decoder is able to leave the boundaries of C unchanged if C aligns with the ground-truth aspect, otherwise redefine the boundaries in U in terms of the attentive words.
Following the practice in prior research (Vinyals et al., 2015), we decode the representation h i with a linear layer and the softmax function, where W ∈ R 2×l and b ∈ R 2 are trainable parameters: Training Our goal is to assign higher probabilities to the start and end positionsŵ s andŵ e for all the ground-truth aspects in the training set. Therefore, we measure loss by calculating the average negative log-likelihood for all pairs ofŵ s andŵ e : (4) where, N B is the number of ground-truth aspects. During training, we obtain the parameters W and b in equation (3) by minimizing the loss L B .

BiLSTM-CRF based Pre-Extraction
We use the pointer network to post-process the pre-extracted aspects (which are referred to the candidate aspects in section 2). In our experiments, we employ a BiLSTM-CRF model to obtain the candidate aspects.
In this case, we solve aspect pre-extraction as a sequence labeling task. BIO labeling space y={B, I, O} (Xu et al., 2018) is specified as the output for each token in the source sentence, in which B, I and O respectively signal the beginning of an aspect, inside of an aspect and non-aspect word.
First of all, we represent the tokens in the source sentence using GloVe embeddings (Pennington et al., 2014). On the basis, we use a bidirectional recurrent neural network with Long-Short Term Memory (BiLSTM for short) (Liu et al., 2015) to encode each token, so as to obtain the initial hidden state vector h lstm i . Self-attention mechanism (Vaswani et al., 2017) is utilized for the resolution of long-distance dependency, by which we obtain the attention-weighted hidden state h att i . We concatenate h lstm i and h att i to produce the final feature vector for the i-th token:ĥ i =h lstm i ⊕ h att i . Conditioned on the feature vectorĥ i emitted by BiLSTM with attention, we estimate the emission probabilities that the i-th token may serve as B, I and O respectively. The fully-connected dense layer is used to mapĥ i to the BIO labeling space: p i (BIO) = f den (ĥ i ). Over the emission probabilities of all the tokens in the source sentence, we utilized a linear-chain Conditional Random Field (CRF)  to predict the optimum label sequence of BIO. Eventually, the tokens labeled with B and I will be taken as the aspects.
We train the extractor by maximizing the loglikelihood of sequence labeling (Luo et al., 2019): where, N E denotes the number of tokens in the training set,Ŵ is a trainable parameter which plays a role of transition matrix in CRF andb is the bias.

Recycling Mechanism
The extractor can be trained on the benchmark datasets provided by the SemEval tasks (Pontiki et al., 2016). However, it is impractical to separately train the positioner because there is a lack of boundary-misspecified negative examples. To solve the problem, we recycle the negative examples occurring during the training of the extractor. We define a negative example to be a text span which partially overlaps with the ground-truth aspect. The text spans which are completely inconsistent with the ground-truth are not considered. For example, "Fresh ingrediants" in 3) is an eligible negative example, but "super tasty" is ineligible.

3) Fresh ingrediants and super tasty.
Ground-truth: ingrediants Eligible: Fresh ingrediants Ineligible: super tasty We maintain a table that maps each ground-truth aspect to a list of negative examples. We initialize the mapping table by taking ground-truth aspects as entries and assigning an empty list to each of them.

Datasets
We evaluate the proposed methods on the laptop and restaurant datasets provided by SemEval 2014-2016 aspect-based sentiment analysis tasks (SE14-16 for short) (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016. For comparison purpose, we follow the previous work to randomly select 20% of the official training data to form the validation set.

Hyperparameter Settings
For the aspect pre-extraction model, we initialize all word embeddings by 100-dimensional GloVe word embeddings (Pennington et al., 2014). Each of BiLSTM units is of 100 dimensions and the number of hidden states in the self-attention layer is set to 200. We employ dropout on the output layer of BiLSTM (i.e., penultimate layer) and the dropout rate is set to 0.5. The learning rate for parameter updating is set to 1e-3.
For the boundary reposition model, we employ basic BERT (Devlin et al., 2019) as the encoder which contains 12 transformer encoding blocks. Each block holds 768 hidden units and 12 selfattention heads. During training, the maximum length of the input sequence is set to 180 and the batch size is set to 10. The learning rate is set to 3e-5 and the number of training epochs is set to 5.

Compared Models
We compare with the state-of-the-art models. By taking learning framework as the criterion, we divide the models into two classes: Single-task Learning In the family of aspectoriented single-task learning, the traditional CRF 1 is used at the earliest time which is based on feature engineering. On the basis, HIS RD (Chernyshevich, 2014) additionally utilizes the part-of-speech and named entity features. NLANGP (Toh and Su, 2016)  Multi-task Learning For aspect-oriented multitask learning, Li and Lam (2017) design a triple-LSTM model (MIN) to share the features which are generated toward extraction and classification tasks. CMLA (Wang et al., 2017) uses a multilayer attention mechanism for the joint extraction of aspect terms and sentiment words. HAST (Li et al., 2018) strengthens the joint model using truncated history-attention and selective transformation network. RINANTE (Dai and Song, 2019) shares features in the bottom layer of BiLSTM-CRF and uses distant supervision to expand the training data. Similar to RINANTE, our aspect pre-extraction model (Baseline) is based on BiLSTM-CRF. However, we force it to work in the single-task learning framework. More importantly, instead of distant supervision, we use recycling mechanism to acquire local boundary-misspecified examples, and instead of retraining BiLSTM-CRF for use, we only reposition the boundaries of the resultant aspects.

Main Results
We show the performance difference over test sets in Table 2. It can be observed that the single-task BiLSTM-CRF based extractor either achieves a comparable performance to some of the current state-of-the-art methods, or performs worse than  others. Nevertheless, refining the pre-extracted aspects by boundary repositioning yields substantial improvements and achieves the best performance. Figure 1 provides further insight into the test results. It shows that there are 41% of boundarymisspecified aspects in average can be successfully salvaged. On the contrary, there are only 1.7% of correctly-extracted aspects in average have been misjudged. Besides, there are few completely erroneous extraction results can be rectified.

Adaptation to BERT
In a separate experiment, we examine the adaptation performance of boundary repositioning. The original pre-extraction model is replaced by the fine-tuning BERT and a more sophisticated model. The former is coupled with a dense layer and a softmax layer. The latter is constructed by coupling the fine-tuning BERT and the BiSELF-CRF network. On the contrary, the set of negative examples which are recycled in the earlier experiment remains unchanged. Table 3 shows the test results. It can be observed that boundary repositioning still achieves considerable improvements in performance. This demonstrates the robust adaptation ability.

Statistical Significance
We follow Johnson (1999) to use the samplingbased P-values for examining the significance. Johnson (1999) suggest that the ideal threshold of Pvalue is 0.05. It indicates that a system achieves significant improvements over others only if P-values are less than 0.05, otherwise insignificant. Besides, it has been proven that the smaller the P-value, the higher the significance (Dror et al., 2018). We form the updated versions of BiSELF-CRF and DE-CNN by coupling them with boundary repositioning. On the basis, we compute P-values by comparing the extraction results of the two models to that of the updated versions. Table 5 shows the P-values. It can be observed that the P-values are much lower than the threshold. This demonstrates that boundary repositioning produces significant improvements.
In brief, we prove that boundary repositioning can be used as a reliable post-processing method for aspect extraction. The source code of boundary repositioning to reproduce the above experiments has been made publicly available. We submit the source code and instruction along with this paper.

Conclusion
Our experimental results demonstrate that boundary repositioning can be used as a simple and robust post-processing method to improve aspect extraction. Our findings reveal that illustrative aspects in scientific literature are generally long-winded. Extracting these aspects suffers more severely from boundary errors. In the future, we will develop a syntax-based multi-scale graph convolutional network to deal with both short and long aspects.