Variational Semi-Supervised Aspect-Term Sentiment Analysis via Transformer

Aspect-term sentiment analysis (ATSA) is a long-standing challenge in natural language process. It requires fine-grained semantical reasoning about a target entity appeared in the text. As manual annotation over the aspects is laborious and time-consuming, the amount of labeled data is limited for supervised learning. This paper proposes a semi-supervised method for the ATSA problem by using the Variational Autoencoder based on Transformer. The model learns the latent distribution via variational inference. By disentangling the latent representation into the aspect-specific sentiment and the lexical context, our method induces the underlying sentiment prediction for the unlabeled data, which then benefits the ATSA classifier. Our method is classifier-agnostic, i.e., the classifier is an independent module and various supervised models can be integrated. Experimental results are obtained on the SemEval 2014 task 4 and show that our method is effective with different the five specific classifiers and outperforms these models by a significant margin.


Introduction
Aspect based sentiment analysis (ABSA) has two sub-tasks, namely aspect-term sentiment analysis (ATSA) and aspect-category sentiment analysis (ACSA). ACSA is to infer the sentiment polarity with regard to the predefined categories, e.g., the aspect f ood, price, ambience. On the other hand, ATSA aims at classifying the sentiment polarity of a given aspect word or phrase in the text. For example, given a review about a restaurant "the [pizza] aspect is the best if you like thin crusted pizza, however, the [service] aspect is awful.", the sentiment implications with regard to "pizza" and "service" are contrary. For the aspect * * : Equal Contribution "pizza", the sentiment polarity is "positive" while "negative" for the aspect "service". In contrast to document-level sentiment analysis, ATSA requires more fine-grained reasoning about the textual context. The task is worthy of investigation as it can obtain the attitude with regard to a specific entity which we are interested in. The task is widely applicated in analyzing the comments, such as opinion generation. Recently, many attempts (Tang et al., 2016b;Pan and Wang, 2018;Liu et al., 2018;Li et al., 2018a) focus on supervised learning and pay much attention to the interaction between the aspect and the context. However, the amount of labeled data is quite limited as the annotation about the aspects is laborious. Currently available data sets, e.g. Se-mEval, only has around 2K unique sentences and 3K sentence-aspect pairs, which is insufficient to fully exploit the power of the deep models. Fortunately, a large amount of unlabeled data is available for free and can be accessed easily from the websites. It will be of great significance if numerous unlabeled samples can be utilized to further facilitate the supervised ATSA classifier. Therefore, the semi-supervised ATSA is a promising research topic.
In ATSA, achieving the sentiment of the aspectterm is semantically complicated and it is nontrivial for a model to capture sentimental similarity of the aspects, which causes the difficulties for semi-supervised learning. In this paper, we proposed a classifier-agnostic framework which named Aspect-term Semi-supervised Variational Autoencoder (Kingma and Welling, 2014) based on Transformer (ASVAET). The variational autoencoder offers the flexibility to customize the model structure. In other words, the proposed framework is compatible with other supervised neural networks to boost their performance. Our proposed model learns the latent representation of the input data and disentangles the representations into two independent parts, i.e., the aspectterm sentiment and the representation of the lexical context. By regarding the aspect sentiment polarity of the unlabeled data as the discrete latent variable, the model implicitly induces the sentiment polarity via the variational inference. Specifically, the representation of the lexical context is extracted by the encoder and the aspect-term sentiment polarity is inferred from the specific ATSA classifier. The decoder takes these two representations as inputs and reconstructs the original sentence by two unidirectional language models. In contrast to the conventional auto-regressive models, the latent representations have their specific meanings and are obtained from the encoder and the classifier to the input examples. Therefore, it is also possible to condition the sentence generation on the sentiment and lexical information w.r.t. a certain target entity. In addition, by separating the representation of the input sentence, the classifier becomes an independent module in our framework, which endows the method with the ability to integrate different classifiers. The method is presented in detail in Sec. 3.
Experimental results are obtained on the two classical datasets from SemEval 2014 task 4 (Pontiki et al., 2014). Five recent available models are implemented as the classifier in ASVAET. Our method is able to utilize the unlabeled data and consistently improve the performance against the supervised models. Compared with other semisupervised methods, i.e., in-domain word embedding pre-training and self-training, the proposed method also demonstrates better performance. We also evaluate the effectiveness of labeled data and sharing embeddings, and show that the structure can provide the separation between lexical context and sentiment polarity in the latent space.

Related Work
Sentiment analysis is a traditional research hotspot in the NLP field (Wang and Manning, 2012). Rather than obtaining the sentimental inclination of the entire text, ATSA instead aims to extract the sentimental expression w.r.t. a target entity. With the release of online completions, abundant methods were proposed to explore the limits of current models. Tang et al. (Tang et al., 2016a) proposed to make use of bidirectional Long Short-Term Memory (LSTM) (Hochreiter and Schmid-huber, 1997) to encode the sentence from the left and right to the aspect-term. This model primarily verifies the effectiveness of deep models for ABSA Tang et al. (Tang et al., 2016b) then put forward a neural reasoning model in analogy to the memory network to perform the reasoning in many steps. There are also many other works dedicating to solve this task (Pan and Wang, 2018;Liu et al., 2018;Zhang and Liu, 2017).
Another related topic is semi-supervised learning for the text classification. Recently, Data augmentation methods (Xie et al., 2019;Berthelot et al., 2019) achieve a greate success on lowresource datasets. Moreover, A simple but efficient method is to use pre-trained modules, e.g., initializing the word embedding or bottom layers with pre-training. Word embedding technique has been wildly used in NLP models, e.g., Glove (Pennington et al., 2014) and ELMo (Peters et al., 2018). Recently, Bidirectional Encoder Representations from Transformer (BERT) (Devlin et al., 2018) replaces the embedding layer to contextdependent layer with the pre-trained bidirectional language model to capture the contextual representation. BERT is complementary to the encoder of the proposed method. To keep our framework neat, these pre-training investigations are not conducted in this paper.
VAE-based semi-supervised methods, on the other hand, are able to cooperate with various kinds of classifiers. VAE has been applied in many semi-supervised NLP tasks, ranging from text classification (Xu et al., 2017), relation extraction (Marcheggiani and Titov, 2016) to sequence tagging (Chen et al., 2018). Different from text classification where sentiment polarity is related to an entire sentence, ATSA just interested in related information of a given aspect-term. To circumvent this problem, a novel structure is proposed.

Method Description
In this section, the problem definition is provided and then the model framework is presented in detail.
The ATSA task aims to classify a data sample with input sentence x = {x 1 , ..., x n } and corresponding aspect 1 a = {a 1 , ..., a m }, where a is a subsequence of x, into a sentiment polarity y, where y ∈ {P, O, N }. P, O, N denotes "positive", "neutral", "negative". For the semisupervised ATSA, we consider the following scenario. Given a dataset consisting of labeled samples S l and unlabeled samples S u , where the , the goal is to utilize S u to improve the classification performance over the supervised model using S l only.
The architecture is depicted in Fig. 1. The method consists of three main components, i.e., the classifier, the encoder, and the decoder. The classifier can be any differentiable supervised ATSA model, which takes x and a as input, and outputs the prediction about y. The encoder transform the data into a latent space that is independent of the label y. And the decoder combines the outputs from the classifier and the encoder to reconstruct the input sentence. For the labeled data, the classifier and the autoencoder are trained with the given label y. For the unlabeled data, the y is regarded as the latent discrete variable and it is induced by maximizing the generative probability. As the classifier can be implemented by various models, the description of the classifier will be omitted. We present a autoencoder structure based on Transformer (Vaswani et al., 2017). In the following, the objective functions are clarified, followed by the model description.

Variational Inference
Using generative models is a common approach for semi-supervised learning, which tries to extract the information from the unlabeled data by modeling the data distribution. In VAE, the data distribution is modeled by optimizing the evidence lower bound (ELBO) of data log-likelihood, which leads to two objectives for labeled data and unlabeled data respectively. For the labeled data, VAE maximizes the ELBO of p(x, y|a). For the unlabeled data, it optimizes the ELBO of p(x|a), where the y is latent and integrated. Specifically, the dependency between variables is illustrated in Fig. 2. The ELBO of log p(x, y|a) can be given as follows: where z is the latent variable which represents lexical information over the sentence and D KL is the KullbackLeibler divergence.
In terms of the unlabeled data, the ELBO of log p(x|a) can be extended from Eq. 1.
where H is the entropy function and q φ (y|x, a) is the classification function.
And q φ (y|x, a) can also be trained in the supervised manner using the labeled data. Combining the above objectives, the overall objective for the entire data set is: where γ is a hyper-parameter which controls the weight of the additional classification loss.
To implement this objective, three components are required to model the q φ (y|x, a), q φ (z|x, a, y) and p θ (x|y, a, z) respectively.

Classifier
Various currently available models can be used as the classifier. For the unlabeled data, the classifier is used to predict the distribution of label y for the decoder, i.e., y ∼ q φ (y|x, a). The distribution q φ (y|x, a) will be tuned during maximizing the objective in Eq. 2. In this work, five classifiers are implemented in ASVAET and they are also used as the supervised baselines for the comparison.

Transformer Encoder
The encoder plays the role of q φ (z|x, a, y). This module attempts to extract the lexical feature that is independent of the label y when given data sample (x, a). In this way, the z and y jointly form the representative vector for the input data.
In our implementation, we use a bidirectional encoder to construct sentences embeddings. It is referred as the Transformer encoder that is actually a sub-graph of the Transformer architechture (Vaswani et al., 2017), the architecture is shown in the left part of the Fig. 2. The encoder employs residual connections around each of the multi-head attention sub-layers, followed by  Figure 1: This is the sketch of our model with bidirectional encoder and decoder. Assuming the aspect-term starts at the k-th position in x. Bottom: When using unlabeled data, the distribution of y ∼ q φ (y|x, a) is provided by the classifier. Left: The sequence is encoded by a Transformer block, which receives the summation of three embeddings, i.e., segment (used to distinguish aspect words) s xn , position p xn and word e xn . The encoding and the label y are used to parameterize the posterior q φ (z|x, a, y). Right: A sample z from the posterior q φ (z|x, a, y) and label y are passed to the generative network which estimates the probability p θ (x|y, a, z) by two unidirectional Transformer decoders. The number of aspect tokens is l a . layer normalization. To capture the aspect-term, we treat the aspect-term and its context differently by segment embeddings. To further emphasize the position of the conditional aspect, the position tag is also included for each token. The position tag indicates the distance from the token to the aspect. And then the position tag is transformed into a vector as defined in (Vaswani et al., 2017), which is added with the word embedding and segment embedding as the input of the Transformer encoder. Let g denote the output of the Transformer encoder after pooling which simply averaging the hidden states of the aspect-terms (the number of tokens is equal or greater than one) of the last layer, y is the indicator vector of the polar-ity. Then the distribution of z can be given as: The sequences are divided into two parts by using segment embedding, the encoder can be aware of the position and the content of the aspect-term a by multi-head attention operation in the Transformer encoder. The information from two sides are aggregated into the aspect-term a, and therefore the resulting z can gather the information related to the aspect.

Transformer Decoders
The decoder is also a sub-graph of Transformer architechture (Vaswani et al., 2017) which focus on reconstructing original text. The main difference from the Transformer encoder is that the Transformer decoder is unidirectional by modifying the self-attention sub-layer to prevent positions from paying attention to subsequent positions. The textual sequence is well-known to be semantically complex and it is non-trivial for a Transformer decoder to capture the high-level semantics. Here we investigate two questions. How to implement p θ (x|y, a, z) without losing the information of a and how to capture the semantic polarity by a sequential model. For the first question, denoting that x is composed of three parts (x l , a, x r ), we use two Transformer decoders to model the left and right content. For the second question, we let each token is generated conditioned on the summation of the variables z and embedding y.
One way to achieve p θ (x|y, a, z) is to separate the sequence into two parts, reversing the process in the two unidirectional decoder. For each decoder, the input state is represented by the summation of the four input i.e., the polarity indicator vector y from the classifier or the labeled dataset, the context vector z from the encoder, the input token embedding e xt and the position embedding p xt : It is equivalent to generate two sequences using two decoders. When decoding left part (or right part), the aspect will first get processed by the decoder and hence the decoder is aware of the aspect-terms. The position tag is also used in the decoder.

Datasets and Preparation
The models are evaluated on two benchmarks: Restaurant (REST) and Laptop (LAPTOP) datasets from the SemEval ATSA challenge (Pontiki et al., 2014). The REST dataset contains the reviews in the restaurant domain, while the LAPTOP dataset contains the reviews of Laptop products. The statistics of these two datasets are listed in Table  1. When processing these two datasets, we follow the same procedures as in another work (Lam et al., 2018). The dataset has a few samples that are labeled as "conflict" and these samples are removed. All tokens in the samples are lowercased without other preprocess, e.g., removing the stop words, symbols or digits. In terms of the unlabeled data, we obtained samples in the same domain for the REST and LAPTOP datasets. For the REST, the unlabeled   The NLTK sentence tokenizer is utilized to extract the sentences from the raw comments. And each sentence is regarded as a sample in our model for both REST and LAPTOP. To obtain the aspects in the unlabeled samples, an open-sourced aspect extractor 4 is pre-trained using labeled data. The resulting test F1 score is 88.42 for the REST and 80.12 for the LAPTOP. Then the unlabeled data is processed by the pre-trained aspect extractor to obtain the aspects. The sentences that have no aspect are removed. And the sentences are filtered with maximal sentence length 80. The statistic of the resulting sentences is given in Table. 2.

Model Configuration & Classifiers
In the experiments, the model is fixed with a set of universal hyper-parameters. The number of units in the encoder and the decoder is 100 and the latent variable is of size 50 and the number of layers of both Transformer blocks is 2, the number of selfattention heads is 8. The KL weight klw should be carefully tuned to prevent the model from trapping in a local optimum, where z carries no useful information. In this work, the KL weight is set to be 1e-4. In term of word embedding, the pre-trained GloVe (Pennington et al., 2014) is used as the in-  Table 3: Experimental results (%). For each classifier, we performed five experiments, i.e., the supervised classifier, the supervised classifier with pre-trained embedding using unlabeled data and our model with the classifier. The results are obtained after 5 runs, and we report the mean and the standard deviation of the test accuracy, and the Macro-averaged F1 score. Better results are in bold. denotes that the results are extracted from the original paper.
put of the encoder and the decoder 5 and the outof-vocabulary words are excluded. And it is fixed during the training. The γ is set to be 10 across the experiments. We implemented and verified four kinds of mainstream ATSA classifiers integrated into our model, i.e., TC-LSTM (Tang et al., 2016a), Mem-Net (Tang et al., 2016b), BILSTM-ATT-G (Zhang and Liu, 2017), IAN (Ma et al., 2017) and TNet (Li et al., 2018b).
• TC-LSTM: Two LSTMs are used to model the left and right context of the target separately, then the concatenation of two representations is used to predict the label.
• MemNet: It uses the attention mechanism over the word embedding over multiple rounds to aggregate the information in the sentence, the vector of the final round is used for the prediction.
• IAN: IAN adopts two LSTMs to derive the representations of the context and the target phrase interactively and the concatenation is fed to the softmax layer.
• BILSTM-ATT-G: It models left and right contexts using two attention-based LSTMs 5 http://nlp.stanford.edu/data/glove.8B.300d.zip and makes use of a special gate layer to combine these two representations. The resulting vector is used for the prediction.
• TNet-AS: Without using an attention module, TNet adopts a convolutional layer to get salient features from the transformed word representations originated from a bidirectional LSTM layer. Among current supervised models, TNet is currently one of the in-domain state-of-the-art methods and the TNet-AS is one of the two variants of TNet.
The configuration of hyper-parameters and the training settings are the same as in the original papers. Various classifiers are tested here to demonstrate the robustness of our method and show that the performance can be consistently improved for different classifiers. Table 3 shows the experimental results on the REST and LAPTOP datasets. Two evaluation metrics are used here, i.e., classification accuracy and Macro-averaged F1 score. The latter is more sensitive when the dataset is class-imbalance. In this table, the semi-supervised results are obtained with 10K unlabeled data. We didn't observe further improvement with more unlabeled data. The mean and the standard deviation are reported over 5 runs. For each classifier clf, we conducted the following experiments:

Main Results
• clf : The classifier is trained using labeled data only.
• clf (EMB): We use CBOW (Mikolov et al., 2013) to train the word embedding vectors using both labeled and unlabeled data. And the resulting vectors, instead of pre-trained GloVe vectors, are used to initialize the embedding matrix of the classifier. This is the embedding-level semi-supervised learning as the embedding layer is trained using in-domain data.
• clf (ST): The self-training (ST) method is a typical semi-supervised learning method. We performed the self-training method over each classifier. At each epoch, we select the 1K samples with the best confidence and give them pseudo labels using the prediction. Then the classifier is re-trained with the new labeled data. The procedure loops until all the unlabeled samples are labeled.
• clf (ASVAET): The proposed method that uses clf as the classifier. Note again that the classifier is an independent module in the proposed model, and the same configuration is used as in the supervised learning.
From the Table 3, the ASVAET is able to improve supervised performance consistently for all classifiers. For the MemNet, the test accuracy can be improved by about 2% by the TSSVAE, and so as the Macro-averaged F1. The TNet-AS outperforms the other three models.
Compared with the other two semi-supervised methods, the ASVAET also shows better results. The ASVAET outperforms the compared semisupervised methods evidently. The adoption of indomain pre-trained word vectors is beneficial for the performance compared with the Glove vectors.

Effect of Labeled Data
Here we investigated whether the ASVAET works with less labeled data. Without loss of general-    ity, the MemNet is used as the basic classifier. We sampled different amount of labeled data to verify the improvement by using ASVAET. The test accuracy curve w.r.t. the amount of labeled data used is shown in Fig. 3. With fewer labeled samples, the test accuracy decreases, however, the improvement becomes more evident. When using 500 labeled samples, the improvement is about 3.2%. With full 3591 labeled samples, 1.5% gain can be obtained. This illustrates that our method can improve the accuracy with limited data.

Effect of Sharing Embeddings
In previous works, the word embedding is shared among all the components. In other words, the word embedding is also tuned in learning to reconstruct the data. It is questionable whether the improvement is obtained by using VAE or multi-  Table 5: Nice sentences that are generated by controlling the sentiment polarity y using the decoder. task learning (text generation and classification). In the aforementioned experiments, the embedding layer is not shared between the classifier and autoencoder. This implementation guarantees that the improvement does not come from learning to generate. To verify if sharing embedding will benefit, we also conducted experiments with sharing embedding, as illustrated in Table. 4. The results indicate that the joint training for the embedding layer is negative for improving the performance in this task. The gradient from the autoencoder may collide with the gradients from the classifier and therefore, interferes with the optimization direction.

Analysis of the Latent Space
Transformer encodes the data into two representations, i.e., y and z. These two latent variable represented sentiment polarity and lexical context individually from the input text. We expect the y and z are fully disentangled and represent different meanings. The scatters of latent variable z (cf. Fig. 4) helps us have a better understanding. As shown in the figure, the distributions of three different polarities are very similar, which indicates that the lexical context reprensetation z is independent of the polarity y.
The generation ability of the decoder is also investigated. Several sentences are generated and selected in the Table 5. By controlling the sentiment polarity y with the same z, the decoder can generate sentences with different sentiment in a similar format. This indicates that the decoder is trained successfully to perceive the y and model the relationship between the y and x.

Conclusion
A VAE-based framework has been proposed for the ATSA task. In this work, the encoder and decoder are constructed from the Transformers. Both analytical and experimental work has been carried out to show the effectiveness of the ASVAET. The method is verified with various kinds of classifiers. For all tested classifiers, the improvement is obtained when equipped with ASVAET, which demonstrates its universality.
In this paper, the aspect-term is assumed to be known and there is an error accumulation problem when using the pre-trained aspect extractor. According to this, in future work, it is also interesting to show if it is possible to learn the aspect and sentiment polarity jointly for the unlabeled data. It will be of great importance if detailed knowledge can be extracted from the unlabeled data, which will shed light on other related tasks.