Enhancing Aspect Term Extraction with Soft Prototypes

Aspect term extraction (ATE) aims to extract aspect terms from a review sentence that users have expressed opinions on. Existing studies mostly focus on designing neural sequence tag-gers to extract linguistic features from the to-ken level. However, since the aspect terms and context words usually exhibit long-tail distributions, these taggers often converge to an inferior state without enough sample exposure. In this paper, we propose to tackle this problem by correlating words with each other through soft prototypes . These prototypes, generated by a soft retrieval process, can introduce global knowledge from internal or external data and serve as the supporting evidence for discovering the aspect terms. Our proposed model is a general framework and can be combined with almost all sequence tag-gers. Experiments on four SemEval datasets show that our model boosts the performance of three typical ATE methods by a large margin.


Introduction
Aspect term extraction (ATE) is a fundamental subtask in aspect-based sentiment analysis. Given a review sentence, ATE aims to extract all aspect terms that users have expressed opinions on. For example, from the review "The Bombay style bhelpuri is very palatable.", ATE aims to extract "bhelpuri".
ATE has been widely studied in the last twenty years. Early researches are devoted to design rulebased (Popescu and Etzioni, 2005) and feature engineering-based (Li et al., 2010) methods. With the development of deep learning techniques, recent researches mostly regard ATE as a sequence labeling task and focus on developing various types of neural models (Liu et al., 2015;Xu et al., 2018;Ma et al., 2019) to generate a tag sequence for the review.  Though achieving impressive progress, current sequence taggers mentioned still face a serious challenge: the taggers may converge to an inferior state due to the lack of samples for tail words. As shown in Figure 1, about 80% aspect terms and context words (i.e., non-aspect terms) appear no more than five times in the commonly-used SemEval datasets. Without enough sample exposure, neural models can hardly achieve an optimal performance (He et al., 2018;Chen and Qian, 2019).
To tackle this challenge, correlating samples with each other may offer helping hands. For example, if we correlate the rare aspect term "bhelpuri" with a frequent one like "food", there will be more abundant samples for "bhelpuri" than ever. The problem then becomes how to build such a tokenlevel correlation. Retrieving synonyms is an intuitive approach to this problem, but it has two limitations. Firstly, synonyms only exist for a small number of words in the vocabulary. This will make the correlations incomplete. Though we can calculate the nearest neighbors for a certain word based on the pre-trained word embeddings, it is not guaranteed that they have a similar semantic meaning. Secondly, in ATE, the existence of an aspect term depends on whether there are opinions on it. That is to say, we need to build a dynamic correlation for a certain word based on its entire contexts rather than the word itself. Indeed, if the retrieval is con- ducted based on an individual token, the above two limitations always exist.
In this paper, we propose a soft retrieval method to build the token-level correlation for both aspect terms and context words. Rather than conducting a hard retrieval for individual tokens, we turn to retrieve the tokens' counterparts according to their contexts. As shown in Figure 2, after conducting the soft retrieval, we can obtain a generated sample strictly corresponds to the input sample in every position. We name the generated sample "soft prototype" since it is actually a simplified prototype that can build a reference point for guiding the tagging process for the input sample.
We resort to the language models (LMs) to implement the soft retrieval and generate high-quality soft prototypes. As a self-supervised task, language modeling needs no extra annotations and can absorb data-specific global knowledge. Moreover, LMs tend to generate frequent outputs , which exactly meets our needs for correlating a rare word in the input sample with a frequent one in the soft prototype. Specifically, we first pretrain bi-directional LMs using the given training samples on ATE datasets. Alternatively, we can take advantage of large-scale unlabeled data like Yelp and Amazon reviews to pre-train LMs. Then, after fixing the pre-trained LMs, we can infer each token's prototype according to its contexts for both the training and testing samples.
We regard the generated soft prototypes as the supporting evidence for tagging aspect terms, and design a simple and effective gating mechanism to fuse the knowledge embedded in both samples before sending them to a sequence tagger. The soft prototypes can be combined with almost all existing sequence taggers. To demonstrate the effectiveness of our proposed model, we conduct experiments on four SemEval datasets by adding the generated soft prototypes on three existing sequence taggers. The results prove that our soft prototypes significantly boost the performance of their original counterparts.

Related Work
Aspect Term Extraction Early researches for ATE mainly involve pre-defined rules (Hu and Liu, 2004;Popescu and Etzioni, 2005;Wu et al., 2009;Qiu et al., 2011) and hand-craft features (Li et al., 2010;Liu et al., 2012Liu et al., , 2013. With the development of deep learning techniques, neural methods have become the mainstream. ATE can be viewed as either a supervised or an unsupervised task. For unsupervised ATE, the commonlyused neural methods are based on topic models (He et al., 2017;Liao et al., 2019). For supervised ATE, the researchers focus on developing various types of neural sequence taggers (Liu et al., 2015;Wang et al., 2016;Yin et al., 2016;Wang et al., 2017;Li and Lam, 2017;Xu et al., 2018;Ma et al., 2019). A recent trend is towards the unified framework Luo et al., 2019;He et al., 2019;Hu et al., 2019;Chen and Qian, 2020), where the interactive relations between ATE, opinion term extraction (OTE), or aspect-level sentiment classification (ASC) are exploited to enhance the overall performance. Xu et al. (2019) post-train BERT on domain-specific data to boost its sequence labeling performance. Li et al. (2020) propose to generate additional datasets for improving the performance of ATE.
In this paper, we focus on the supervised scenario. Different from the aforementioned supervised models, we develop a novel model to enhance ATE. By automatically generating and utilizing soft prototypes, we correlate samples with each other, which greatly enhances the learning process of sequence taggers. Moreover, the decoupling of soft prototypes from taggers makes our model flexible and general, i.e., it can be combined with almost all neural sequence taggers.

Prototypes in Neural Networks
The idea of prototypes (or templates) originates from information retrieval (IR) approaches for sentence matching tasks like response generation (Ji et al., 2014;Hu et al., 2014). They aim to retrieve a related sample from the dataset as the counterpart of the input sample. More recently, several studies shed new light in this domain by deeply fusing prototypes with neural networks. Many of them use the taskdependent metrics , common metrics such as Jaccard similarity (Gu et al., 2018;Cao et al., 2018;, or existing tools like Lucene (Cao et al., 2018) to retrieve prototypes, and then input the prototypes into a neural model for generating outputs.  follows another line, where the prototype (the target words related to a source word in machine translation) is generated using a pretrained Seq2Seq model.
The approach of generating words via LMs is inspired by a recent study (Kobayashi, 2018). However, the method in Kobayashi (2018) is developed for text classification and is not suitable for the ATE task here. Concretely, their method randomly replaces a small percentage (typically 10%) of original training words with the generated ones and then discard the original words. This operation may work well for text classification tasks which only require sentence-level information. For token-level tasks like ATE, the original words are however necessary for tagging each token correctly. Moreover, the small percentage of replacement implies that the generated knowledge cannot be fully incorporated into the new sample. In contrast, we generate a prototype for each word in the sentence, and then deeply fuse the original word with its corresponding prototype to make good use of their embedded knowledge for ATE.
To the best of our knowledge, we are the first to introduce the retrieval method to handle the data deficiency problem in ATE. To this end, we propose a new approach to generate and utilize soft prototypes that can build the token-level correlation for aspect terms and context words.

Methodology
In this section, we first illustrate the overall framework for enhancing ATE with soft prototypes. We then detail the generation and utilization of soft prototypes. Lastly, we describe the objective function and the training procedure.

The Overall SoftProto Framework
Aspect term extraction (ATE) aims to extract aspect terms from a review sentence that users have expressed opinions on. Given a sentence S = {w 1 , w 2 , ..., w n }, we formulate ATE as a sequence labeling task that aims to predict a tag sequence Y = {y 1 , y 2 , ..., y n }, i.e., learning the mapping S → Y , where y ∈ {B, I, O} denotes the beginning of, inside of, and outside of an aspect term.
To incorporate soft prototypes to ATE, we slightly modify the traditional learning process. Formally, rather than directly learning the map- ping from S to Y , we additionally introduce a soft prototype P for each S and learn the new mapping [S, P] → Y . Given S, the soft prototype P is automatically generated by a soft retrieval mechanism, and can serve as the supporting evidence to discover the aspect terms. As shown in Figure 3, we summarize the above processes into the SoftProto framework that mainly consists of three modules: • A prototype generator is used for conducting the soft retrieval process and generating the corresponding soft prototype P for S. • A gating conditioner is used for merging S's representation and P into the fused vectors F. • A sequence tagger is used for predicting the tag sequence Y based on F. Next, we will illustrate each module in detail.

Prototype Generator
To efficiently implement the soft retrieval and generate high-quality soft prototypes, we resort to the language models (LMs) to build a prototype generator. Specifically, we first pre-train two LMs, where − − → LM and ← − − LM is the forward and backward language model parameterized by − − → θ LM and ← − − θ LM , respectively. Then we infer soft prototypes based on the pre-trained LMs.
One can use either the ATE training set or other unlabeled external data like Yelp reviews to pretrain LMs, and we will examine the effects of these two types of data in the experiments. The details of pre-training LMs and inferring soft prototypes are as follows.
Pre-training Language Models As shown in Figure 4(a), given S, the forward − − → LM computes the probability of S by modeling the probability of token w i conditioned on the history (w 1 , .., w i−1 ): In the pre-training process, − − → LM tries to maximize the log likelihood of the forward direction: The 1 Bombay 2 style 3 bhelpuri 4 is 5 very 6 palatable 7

trainable LMs
Bombay 2 style 3 bhelpuri 4 is 5 very 6 output:w i input:w 1: i-1 (a) Pre-training a forward language model.  Similarly, the backward ← − − LM tries to maximize the log likelihood of the backward direction: After the pre-training process converges, we can fix − − → θ LM and ← − − θ LM , and infer a soft prototype P conditioned on S, − − → θ LM , and ← − − θ LM for each sample in the training and testing sets in ATE 1 .
Generating Soft Prototypes After getting − − → θ LM and ← − − θ LM , we then infer the soft prototype P. We still take the forward − − → LM as the example. As shown in Figure 4(b), for generating the forward prototype vector − → p i for word w i , we feed the prefix sentence {w 1 , w 2 , ...., w i−1 } to the fixed − − → LM and collect the output probability distribu- To suppress noise, we do not directly select the word o 1 i with the largest output probability. Instead, we preserve the words {o 1 i , o 2 i , ...., o K i } with K-largest output probabilities, and normalize their probabilities to sum 1 as the weighted scores {s 1 i , s 2 i , ...., s K i }. We call the selected words as "oracle words" . Then we map these words with a pretrained embedding lookup table E and obtain their word vectors Finally, we aggregate the oracle words by their weighted scores to calculate − → p i for word w i : Similarly, we can calculate the backward prototype vector ← − p i . To consider the context information in both directions, we use the average of − → p i and ← − p i as the final prototype vector p i for word w i . We then 1 Note that the testing ATE samples are not used for pretraining the LMs, thus there is no data leakage in this process. regard the set of prototype vectors {p 1 , p 2 , ...., p n } as the soft prototype P for the sentence S 2 .

Gating Conditioner
For better discovering the aspect terms, we need to leverage the supporting evidence embedded in the soft prototype P. Intuitively, we have two schemes to incorporate the soft prototypes into ATE: inside or outside the sequence tagger. We choose the latter because we want to decouple the soft prototypes from the sequence taggers, such that we can make the prototypes suitable for all types of taggers. Hence, we introduce an additional upstream module named the gating conditioner to fuse the soft prototype P with the original sentence S.
The soft prototype P provides two kinds of information : (1) P itself has embedded data-specific knowledge that can serve as supporting evidence.
(2) P also helps to refine the original representation of S. Accordingly, the gating conditioner is developed to conduct two types of operations on P. We first map S = {w 1 , w 2 , ..., w n } with the pretrained embedding lookup table E and obtain the corresponding word vectors X= {x 1 , x 2 , ..., x n }. Then, we conduct two types of operations on X and P to obtain the fused vectors F: where σ is the Sigmoid function, W and b are trainable parameters, ⊕ and denotes the concatenation and element-wise multiplication operation, respectively. In Eq. 5, the concatenation of P and X makes the representation more discriminative than before. Moreover, the gating mechanism can help select the important dimensions and further refine the representation. The generated fused vectors F= {f 1 , f 2 , ..., f n } then act as the enhanced representation for S = {w 1 , w 2 , ..., w n }.

Sequence Tagger
The sequence tagger aims to extract high-level semantic features from the low-level tokens, and predicts a tag sequence Y for the review S based on these features. In order to investigate the influence of soft prototypes, we need to control variables in SoftProto. Therefore, we choose three existing sequence taggers as our basic models, including BiLSTM (Liu et al., 2015), DECNN (Xu et al., 2018), and Seq2Seq4ATE (Ma et al., 2019). Readers can refer to the original paper for more details or Section 4.2 for a quick glance. Please note that the only difference between an original sequence tagger and its variant enhanced by our proposed SoftProto is the representation of S. In other words, by comparing the performance of a sequence tagger and its enhanced variant, we can observe that how ATE benefits from soft prototypes.
For training SoftProto, we simply compute the cross-entropy loss L: where n is the length of S, J is the category of labels, y i andŷ i are the predicted tags and ground truth labels. We then train all parameters with back propagation.

ATE Datasets
To evaluate the effectiveness of SoftProto for ATE tasks, we conduct extensive experiments on four datasets from SemEval 2014 (Pontiki et al., 2014), 2015 (Pontiki et al., 2015) and 2016 (Pontiki et al., 2016). These datasets contain review sentences from the restaurant and laptop domains with annotated aspect terms. All of them have a fixed train/test split, and we further randomly hold out 150 training samples as the validation set for tuning hyper-parameters. The statistics of four ATE datasets are summarized in Table 1 3 . Details for Pre-training Language Models As mentioned in section 3.2, we use two types of data to pre-train the LMs: (1) The ATE training sets. In this setting, we directly use the same training/validation samples of each SemEval dataset to pre-train its own LMs. Hence, there are four groups of pre-trained LMs (including − − → LM and ← − − LM ) for four datasets, respectively. We denote this setting as SoftProtoI (I for internal knowledge). (2) The unlabeled external data. In this setting, we additionally collect 100,000 training and 10,000 validation samples from Yelp Review (Zhang et al., 2015) and Amazon Electronics (McAuley et al., 2015) datasets, respectively. LMs pre-trained on Yelp serve as the prototype generator when training and evaluating SoftProto on {Res14, Res15, Res16} datasets, while those pre-trained on Amazon are used for the Lap14 dataset. We denote this setting as SoftProtoE (E for external knowledge). For pretraining the LMs, we adopt the Fairseq 4 toolkit (Ott et al., 2019) and the basic transformer decoder LM architecture (Vaswani et al., 2017) Parameter Settings The only hyperparameter in our SoftProto is the number K of oracle words when generating soft prototypes. We use a grid search to select K in the range [1,10] based on the validation performance, and consequently set K={10, 7, 10, 7} for four datasets, respectively. For other parameters, including the pre-trained word embedding, epoch number, optimizer selection, learning rate, and batch size, we inherit the default settings from the original papers (Liu et al., 2015;Xu et al., 2018;Ma et al., 2019). Models achieving the maximum F1-scores on the validation set are used for evaluation on the testing set. We report the averaged F1 scores over 5 runs with random initialization. We run all methods in a single 2080Ti GPU.

Compared Methods
We choose two kinds of baselines. The first is the SemEval winners for corresponding datasets. In order to discern the impacts of soft prototypes on pure ATE task, we do not choose the hybrid models as the base taggers. Instead, we adapt Soft-Proto to three pure sequence taggers, including BiLSTM (Liu et al., 2015; which is an RNN-based sequence tagger including a vanilla 4 https://github.com/pytorch/fairseq. 5 In practice, we also tried a self-constructed single-layer LSTM architecture and got a similar performance in language modeling. Since the Fairseq toolkit has already integrated the transformer architecture, we directly use it for convenience.  , while other results are the averaged scores of 5 runs with random initialization. The best scores are in bold, and the best baselines are underlined. The subscript denotes the improvement/decrease after enhancing an ATE tagger with a certain method (e.g., BiLSTM + SoftProtoE vs. BiLSTM ). * denotes the statistical significance between the orginal methods and their enhanced counterparts at p < 0.05 level. BiLSTM architecture, DECNN (Xu et al., 2018) which is a CNN-based sequence tagger which uses two types of pre-trained embeddings and stacked convolutional layers to extract context features for tagging aspect terms, and Seq2Seq4ATE (Ma et al., 2019) which is an attention-based sequence tagger and uses a modified encoder-decoder framework to extract aspect terms. We further compare SoftProto with two simple enhancing methods, namely Synonym and Replacement. For Synonym, we substitute the top-K oracle words with top-K nearest synonyms measured by the cosine distance of word vectors while keeping the other settings unchanged. For Replacement, we use the prototype generated by our language models, but replace the training words with the method in Kobayashi (2018). The modified samples are sent to the sequence tagger directly 6 .

Main Results
The comparison results for all methods are shown in Table 2 . Obviously, SoftProto greatly boosts all basic sequence taggers. For example, DECNN achieves an overall best performance among 6 We use a grid search to select the replacement probability and present the best results. Prototype tokens are generated using the LMs pre-trained on the Yelp/Amazon data. baselines, while SoftProtoI and SoftProtoE further achieve {1.28%,0.31%,0.83%,1.49%} and {1.80%,1.35%, 2.09%,2.59%} absolute gains for DECNN on four datasets, respectively. There even exists an amazing 3.30% gain after incorporating SoftProtoE to Seq2Seq4ATE on the Res14 dataset. This strongly demonstrates the effectiveness of proposed soft prototypes for the ATE task. By correlating samples through the soft prototypes, the training of sequence taggers can easily converge to a better state than before.
We also find that the improvements brought by the SoftProto are more remarkable on small datasets (Res15 and Res16) than those on large ones (Res14 and Lap14). This is because there are not enough samples on small datasets to train a well-performed sequence tagger, and the discovery of aspect terms largely relies on the knowledge embedded in the soft prototypes. Moreover, Soft-ProtoE performs much better than SoftProtoI. The reason is that the external unlabeled data from Yelp and Amazon is much bigger and more informative than the original ATE datasets. Accordingly, the pre-trained LMs in SoftProtoE contain more knowledge than those in SoftProtoI and can generate more discriminative soft prototypes.
The performances of Synonym and Replacement are far from satisfactory, and they even result in decreases in some cases. Synonym generates noisy prototypes by only considering the individual tokens, and can hardly handle the unknown (UNK) words. The ineffectiveness of Replacement lies in two issues. Firstly, it simply replaces the original words with the generated ones, which incurs information loss. Secondly, the generated knowledge cannot be fully utilized due to the small percentage of replacement. The inferior results demonstrate that these two methods are not qualified for enhancing the ATE task.

Perplexities of Language Models
In this section, we present the perplexities of language models pre-trained on different datasets. As shown in Table 3, the perplexity is linearly related to the size of datasets. The larger the dataset, the lower the perplexity. Clearly, LMs trained on external Yelp/Amazon datasets have much lower perplexities than original SemEval datasets. Among the SemEval datasets, Lap14 and Res14 have relatively more samples than Res15 and Res16, resulting in relatively lower perplexities. Moreover, language models in forward and backward directions have no significant differences in performance. We will release all pretrained language models in time for encouraging further studies on soft prototypes.

Ablation Study
Without loss of generality, we choose two DECNN +SoftProto models and conduct the ablation study to investigate the effects of different modules in SoftProto. We sequentially remove the forward LM, the backward LM, the concatenation operation, and the gating operation to obtain four simplified variants.
As shown in Table 4, all variants have a performance decrease of the F1-score. The results demonstrate that : (1) Considering both directions in language modeling can generate better soft prototypes.
(2) Both kinds of conditioning operations (i.e., gating and concatenation) can contribute to the utilization of the soft prototypes.

Impacts of Oracle Words
In the prototype generator, the hyper-parameter K controls how many oracle words are taken into account when generating soft prototypes. To investigate the impacts of the oracle words on different datasets, we vary K in the range of [1,10] stepped by 1, and present the results of two DECNN+SoftProto models in Figure 5.  Generally, the F1-scores of DECNN have an overall upward trend when more oracle words are introduced. This is explainable since the oracle words actually provide the data-specific knowledge that can be aggregated into the soft prototypes. Moreover, owing to the high confidence of language models trained on Yelp/Amazon datasets, the curves of SoftProtoE are smoother than those of SoftProtoI. The reason is that language models with high perplexities almost inevitably output noisy oracle words and bring about the high variance when generating soft prototypes.

Case Study
To have a close look, we further select six samples from the testing sets for a case study. Due to the space limitation, we only present the results of the best baseline DECNN and its two variants enhanced by SoftProto in Table 5. S1∼S2 are in similar circumstances. DECNN only extracts a single word as the aspect term and

Performance on Tail Aspect Terms
To prove that SoftProto are indeed beneficial for identifying the tail aspect terms, we keep the training sentences unchanged and only preserve the testing sentences containing the tail aspect terms (appearing no more than 3 times in training sentences). We present the performance of DECNN and its two variants enhanced by SoftProto on these sentences in Table 6. Clearly, SoftProto enhances the ability of DECNN in recognizing the tail aspect terms by a large margin.

Prototypes Generation with BERT
Since BERT (Devlin et al., 2019) is pre-trained as a masked language model (MLM), we wonder if it can serve as the prototype generator. Hence, we regard the generation of prototypes as a cloze test. We sequentially mask each word and collect the top-K output words of the MLM as the oracle words. We name this variant SoftProtoB. The setting of K and the usage of the oracle words remain the same as those in SoftProtoI and SoftProtoE, thus the only difference among all these SoftProto variants is the way of pre-training language models. We conduct experiments on two pre-trained BERT models, where SoftProtoB (BASE) is the officially released BERT-Base-Uncased model, and SoftProtoB (PT) is further post-trained on domainspecific data and released by Xu et al. (2019). Since both SoftProtoB and SoftProtoE make use of the external data, they are fair competitors and we list the results of these two variants in Table 7. From the results in Table 7, we can see that the BERT-based models are also qualified for generating the soft prototypes. In general, SoftPro-toB (BASE) generates domain-independent oracle words and achieves limited improvements over the base model, while SoftProtoB (PT) can generate domain-specific oracle words and achieves a comparable performance with SoftProtoE.

Analysis on Computational Cost
Since we use the pre-trained language models, the cost for generating soft prototypes can almost be ignored. To demonstrate that SoftProto does not incur the high computational cost in utilizing soft prototypes, we run three sequence taggers on the Laptop 2014 dataset, and present the trainable parameter number and running time per epoch of each method before and after introducing SoftProto in Table 8. From Table 8, we can conclude that SoftProto is a lightweight framework and does not add much cost on the original sequence taggers.

Conclusion
In this paper, we present a general SoftProto framework to enhance the ATE task. Rather than designing elaborated sequence taggers, we turn to correlate samples with each other through soft prototypes. For this purpose, we resort to the language models for automatically generating soft prototypes and then design a gating conditioner for utilizing them. The performance of SoftProto can be further improved after introducing the large-scale external unlabeled data like Yelp and Amazon reviews. Extensive experiments on four SemEval datasets demonstrate that SoftProto greatly boosts the performance of the typical ATE methods and introduces small computational cost.