Attention and Lexicon Regularized LSTM for Aspect-based Sentiment Analysis

Abstract Attention based deep learning systems have been demonstrated to be the state of the art approach for aspect-level sentiment analysis, however, end-to-end deep neural networks lack flexibility as one can not easily adjust the network to fix an obvious problem, especially when more training data is not available: e.g. when it always predicts positive when seeing the word disappointed. Meanwhile, it is less stressed that attention mechanism is likely to “over-focus” on particular parts of a sentence, while ignoring positions which provide key information for judging the polarity. In this paper, we describe a simple yet effective approach to leverage lexicon information so that the model becomes more flexible and robust. We also explore the effect of regularizing attention vectors to allow the network to have a broader “focus” on different parts of the sentence. The experimental results demonstrate the effectiveness of our approach.


Introduction
Sentiment analysis (also called opinion mining) has been one of the most active fields in NLP due to its important value to business and society. It is the field of study that tries to extract opinion (positive, neutral, negative) expressed in natural languages. Most sentiment analysis works have been carried out at document level (Pang et al., 2002;Turney, 2002) and sentence level (Wilson et al., 2004), but as opinion expressed by words is highly context dependent, one word may express opposite sentiment under different circumstances. Thus aspect-level sentiment analysis (ABSA) was proposed to address this problem. It finds the polarity of an opinion associated with a certain aspect, such as food, ambiance, service, or price in a restaurant domain.
Although deep neural networks yield significant improvement across a variety of tasks compared to previous state of the art methods, end-to-end deep learning systems lack flexibility as one cannot easily adjust the network to fix an obvious problem: e.g. when the network always predicts positive when seeing the word disappointed, or when the network is not able to recognize the word dungeon as an indication of negative polarity. It could be even trickier in a low-resource scenario where more labeled training data is simply not available. Moreover, it is less stressed that attention mechanism is likely to over-fit and force the network to "focus" too much on a particular part of a sentence, while in some cases ignoring positions which provide key information for judging the polarity. In recent studies, both Niculae and Blondel (2017) and Zhang et al. (2019) proposed approaches to make the attention vector more sparse, however, it would only encourage the over-fitting effect in such scenario.
In this paper, we describe a simple yet effective approach to merge lexicon information with an attention LSTM model for ABSA in order to leverage both the power of deep neural networks and existing linguistic resources, so that the framework becomes more flexible and robust without requiring additional labeled data. We also explore the effect of regularizing attention vectors by introducing an attention regularizer to allow the network to have a broader "focus" on different parts of the sentence.

Related works
ABSA is a fine-grained task which requires the model to be able to produce accurate prediction given different aspects. As it is common that one sentence may contain opposite polarities associated to different aspects at the same time, attention-based LSTM (Wang et al., 2016) was first proposed to allow the network to be able to as-sign higher weights to more relevant words given different aspects. Following this idea, a number of researches have been carried out to keep improving the attention network for ABSA (Ma et al., 2017;Tay et al., 2017;Cheng et al., 2017;He et al., 2018;Zhu and Qian, 2018).
On the other hand, a lot of works have been done focusing on leveraging existing linguistic resources such as sentiment to enhance the performance; however, most works are performed at document and sentence level. For instance, at document level, Teng et al. (2016) proposed a weighted-sum model which consists of representing the final prediction as a weighted sum of the network prediction and the polarities provided by the lexicon. Zou et al. (2018) described a framework to assign higher weights to opinion words found in the lexicon by transforming lexicon polarity to sentiment degree.
At sentence level, Shin et al. (2017) used two convolutional neural networks to separately process sentence and lexicon inputs. Lei et al. (2018) described a multi-head attention network where the attention weights are jointly learned with lexicon inputs. Wu et al. (2018) proposed a new labeling strategy which breaks a sentence into clauses by punctuation to produce more lower-level examples, inputs are then processed at different levels with linguistic information such as lexicon and POS, and finally merged back to perform sentence level prediction. Meanwhile, some other similar works that incorporate linguistic resources for sentiment analysis have been carried out (Rouvier and Favre, 2016;Qian et al., 2017).
Regarding the attention regularization, instead of using softmax and sparesmax, Niculae and Blondel (2017) proposed fusemax as a regularized attention framework to learn the attention weights; Zhang et al. (2019) introduced L max and Entropy as regularization terms to be jointly optimized with the loss. However, both approaches share the same idea of shaping the attention weights to be sharper and more sparse so that the advantage of the attention mechanism is maximized.
In our work, different from the previously mentioned approaches, we incorporate polarities obtained from lexicons directly into the attentionbased LSTM network to perform aspect-level sentiment analysis, so that the model improves in terms of robustness without requiring extra training examples. Additionally, we find that the at-tention vector is likely to over-fit which forces the network to "focus" on particular words while ignoring positions that provide key information for judging the polarity; and that by adding lexical features, it is possible to reduce this effect. Following this idea, we also experimented reducing the over-fitting effect by introducing an attention regularizer. Unlike previously mentioned ideas, we want the attention weights to be less sparse. Details of our approach are in following sections.

Baseline AT-LSTM
In our experiments, we replicate AT-LSTM proposed by Wang et al. (2016) as our baseline system. Comparing with a traditional LSTM network (Hochreiter and Schmidhuber, 1997), AT-LSTM is able to learn the attention vector and at the same time to take into account the aspect embeddings. Thus the network is able to assign higher weights to more relevant parts of a given sentence with respect to a specific aspect.
Formally, given a sentence S, let {w 1 , w 2 , ..., w N } be the word vectors of each word where N is the length of the sentence; v a ∈ R da represents the aspect embeddings where d a is its dimension; let H ∈ R d×N be a matrix of the hidden states {h 1 , h 2 , ..., h N ∈ R d } produced by LSTM where d is the number of neurons of the LSTM cell. Thus the attention vector α is computed as follows: α is a vector consisting of attention weights and r is a weighted representation of the input sentence with respect to the input aspect. v a ⊗ e N = [v a , v a , ..., v a ], that is, the operator repeatedly concatenates v a for N times. Then, the final representation is obtained and fed to the output layer as below: where, h * ∈ R d , W p and W x are projection parameters to be learned during training; W s and b s are weights and biases in the output layer. The predictionŷ is then plugged into the cross-entropy loss function for training, and L 2 regularization is applied.
where i is the number of classes (three way classification in our experiments); λ is the hyperparameter for L 2 regularization; Θ is the regularized parameter set in the network.

Lexicon Build
Similar to Shin et al. (2017), but in a different way, we build our lexicon by merging 4 existing lexicons to one: MPQA, Opinion Lexicon, Opener and Vader. SentiWordNet was in the initial design but was removed from the experiments as unnecessary noise was introduced, e.g. highly is annotated as negative. For categorical labels such as negative, weakneg, neutral, both, positive, we convert them to values in {−1.0, −0.5, 0.0, 0.0, 1.0} respectively. Regarding lexicons with real value annotations, for each lexicon, we adopt the annotated value standardized by the maximum polarity in that lexicon. Finally, the union U of all lexicons is taken where each word w l ∈ U has an associated vector v l ∈ R n that represents the polarity given by each lexicon. n here is the number of lexicons; average values across all available lexicons are taken for missing values. e.g. the lexical feature for word adorable is represented by [1.0, 1.0, 1.0, 0.55], which are taken from MPQA(1.0), Opener(1.0), Opinion Lexicon(1.0) and Vader(0.55) respectively. For words outside U , a zero vector of dimension n is supplied.

Lexicon Integration
To merge the lexical features obtained from U into the baseline, we first perform a linear transformation to the lexical features in order to preserve the original sentiment distribution and have compatible dimensions for further computations. Later, the attention vector learned as in the baseline is applied to the transformed lexical features. In the end, all information is added together to perform the final prediction.
Formally, let V l ∈ R n×N be the lexical matrix for the sentence, V l then is transformed linearly: Later, the attention vector learned from the concatenation of H and v a ⊗ e N is applied to L: Finally h * is updated and passed to output layer for prediction: where W l ∈ R d×d is a projection parameter as W p and W x . The model architecture is shown in Figure 1.

Attention Regularization
As observed in both Figure 2 and Figure 3, the attention weights in ATLX seem less sparse across the sentence, while the ones in the baseline are focusing only on the final part of the sentence. It is reasonable to think that the attention vector might be over-fitting in some cases, causing the network to ignore other relevant positions, since the attention vector is learned purely on training examples. Thus we propose a simple attention regularizer to further validate our hypothesis, which consists of adding into the loss function a parameterized standard deviation or negative entropy term for the attention weights. The idea is to avoid the attention vector to have heavy weights in few positions, instead, it is preferred to have higher weights for more positions. Formally, the attention regularized loss is computed as: compared to equation (1), a second regularization term is added, where is the hyper-parameter for the attention regularizer; R stands for the regularization term defined in (3) or (4); and α is the distribution of attention weights. Note that during implementation, the attention weights for batch padding positions are excluded from α. We experiment two different regularizers, one uses standard deviation of α defined in equation (3); the other one uses the negative entropy of α defined in equation (4).  To initialize word vectors with pretrained word embeddings, the 300 dimensional Glove vectors trained on 840b tokens are used, as described in the original paper.

Lexicons
As shown in Table 1, we merge four existing and online available lexicons into one. The merged lexicon U as described in section 3.2.1 is used for our experiments. After the union, the following postprocess is carried out: {bar, try, too} are removed from U since they are unreasonably annotated as negative by MPQA and Opener; {n t, not} are added to U with −1 polarity for negation.

Evaluation
Cross validation is applied to measure the performance of each model. In all experiments, the training set is randomly shuffled and split into 6 folds with a fixed random seed. According to the code released by Wang et al. (2016), a development set containing 528 examples is used, which is roughly 1 6 of the training corpus. In order to remain faithful to the original implementation, we thus evaluate our model with a cross validation of 6 folds.
As shown in Table 2, compared to the baseline system, ATLX is not only able to improve in terms of accuracy, but also the variance of the performance across different sets gets significantly reduced. On the other hand, by adding attention regularization to the baseline system without introducing lexical features, both the standard deviation regularizer (base std ) and the negative entropy regularizer (base ent ) are able to contribute positively; where base ent yields largest improvement. By combining attention regularization and lexical features together, although the model is able to further improve, the difference is too small to draw strong conclusion.

ATLX
As described in previously, the overall performance of the baseline gets enhanced by leveraging lexical features independent from the training data, which makes the model more robust and flexible. The example in Figure 2, although the baseline is able to pay relatively high attention to the word disappointed and dungeon, it is not able to recognize these words as clear indicators of negative polarity; while ATLX is able to correctly predict positive for both examples. On the other hand, it is worth mentioning that the computation of the attention vector α does not take lexical features V l into account. Although it is natural to think that adding V l as input for computing α would be a good option, the results of ATLX * in Table 2 suggest otherwise.
In order to understand where does the improvement of ATLX come from, lexical features or the way we introduce lexical features to the system? We conduct a support experiment to verify its impact (base LX ), which consists of naively concate-nating input word vector with its associated lexical vector and feed the extended embedding to the baseline. As demonstrated in Table 2, by comparing baseline with base LX , we see that the simple merge of lexical features with the network without carefully designed mechanism, the model is not able to leverage new information; and in contrast, the overall performance gets decreased.

Attention Regularization
As shown in Figure 3, when comparing ATLX with the baseline, we find that although the lexicon only provides non-neutral polarity information for three words, the attention weights of ATLX are less sparse and less spread out than in the baseline. Also, this effect is general as the standard deviation of the attention weights distribution for the test set in ATLX (0.0219) are significantly lower than in the baseline (0.0354).
Thus it makes us think that the attention weights might be over-fitting in some cases as it is purely learned on training examples. This could cause that by giving too much weight to particular words in a sentence, the network ignores other positions which could provide key information for classifying the polarity. For instance, the example in Figure 3 shows that the baseline which predicts positive is "focusing" on the final part of the sentence, mostly the word easy; while ignoring the bad manners coming before, which is key for judging the polarity of the sentence given the aspect service. In contrast, the same baseline model trained with attention regularized by standard deviation is able to correctly predict negative just by "focusing" a little bit more on the "bad manners" part.
However, the hard regularization by standard deviation might not be ideal as the optimal minimum value of the regularizer will imply that all words in the sentence have homogeneous weight, Parameter name Value base std 1e-3 base ent 0.5 ATLX std 1e-4 ATLX ent 0.006 Table 3: Attention regularization parameter settings which is the opposite of what the attention mechanism is able to gain. Regarding the negative entropy regularizer, taking into account that the attention weights are output of sof tmax which is normalized to sum up to 1, although the minimum value of this term would also imply homogeneous weight of 1 N , it is interesting to see that with almost evenly distributed α, the model remains sensitive to few positions with relatively higher weights; e.g. in Figure 3, the same sentence with entropy regularization demonstrates that although most positions are closely weighted, the model is still able to differentiate key positions even with a weight difference of 0.01 and predict correctly.

Parameter Settings
In our experiments, apart from newly introduced parameter for attention regularization, we follow Wang et al. (2016) and their released code.
More specifically, we set batch size as 25; aspect embedding dimension d a equals to 300, same as Glove vector dimension; number of LSTM cell d as 300; number of LSTM layers as 1; dropout with 0.5 keep probability is applied to h * ; Ada-Grad optimizer is used with initial accumulate value equals to 1e-10; learning rate is set to 0.01; L2 regularization parameter λ is set to 0.001; network parameters are initialized from a random uniform distribution with min and max values as -0.01 and 0.01; all network parameters except word embeddings are included in the L2 regularizer. The hyperparmerter for attention regularization is shown in Table 3.

Conclusion and Future Works
In this paper, we describe our approach of directly leveraging numerical polarity features provided by existing lexicon resources in an aspectbased sentiment analysis environment with an attention LSTM neural network. Meanwhile, we stress that the attention mechanism may over-fit on particular positions, blinding the model from other relevant positions. We also explore two regularizers to reduce this overfitting effect. The experimental results demonstrate the effectiveness of our approach.
For future works, since the lexical features can be leveraged directly by the network to boost performance, a fine-grained lexicon which is domain and aspect specific in principle could further improve similar models. On the other hand, although the negative entropy regularizer is able to reduce the overfitting effect, a better attention framework could be researched, so that the attention distribution would be sharp and spare but at the same time, being able to "focus" on more positions.