Transformation Networks for Target-Oriented Sentiment Classification

Target-oriented sentiment classification aims at classifying sentiment polarities over individual opinion targets in a sentence. RNN with attention seems a good fit for the characteristics of this task, and indeed it achieves the state-of-the-art performance. After re-examining the drawbacks of attention mechanism and the obstacles that block CNN to perform well in this classification task, we propose a new model that achieves new state-of-the-art results on a few benchmarks. Instead of attention, our model employs a CNN layer to extract salient features from the transformed word representations originated from a bi-directional RNN layer. Between the two layers, we propose a component which first generates target-specific representations of words in the sentence, and then incorporates a mechanism for preserving the original contextual information from the RNN layer.


Introduction
Target-oriented (also mentioned as "target-level" or "aspect-level" in some works) sentiment classification aims to determine sentiment polarities over "opinion targets" that explicitly appear in the sentences (Liu, 2012). For example, in the sentence "I am pleased with the fast log on, and the long battery life", the user mentions two targets * The work was done when Xin Li was an intern at Tencent AI Lab. This project is substantially supported by a grant from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Code: 14203414). 1 Our code is open-source and available at https:// github.com/lixin4ever/TNet "log on" and "better life", and expresses positive sentiments over them. The task is usually formulated as predicting a sentiment category for a (target, sentence) pair.
Recurrent Neural Networks (RNNs) with attention mechanism, firstly proposed in machine translation (Bahdanau et al., 2014), is the most commonly-used technique for this task. For example, Wang et al. (2016); Tang et al. (2016b); ; Liu and Zhang (2017); Ma et al. (2017) and  employ attention to measure the semantic relatedness between each context word and the target, and then use the induced attention scores to aggregate contextual features for prediction. In these works, the attention weight based combination of word-level features for classification may introduce noise and downgrade the prediction accuracy. For example, in "This dish is my favorite and I always get it and never get tired of it.", these approaches tend to involve irrelevant words such as "never" and "tired" when they highlight the opinion modifier "favorite". To some extent, this drawback is rooted in the attention mechanism, as also observed in machine translation (Luong et al., 2015) and image captioning .
Another observation is that the sentiment of a target is usually determined by key phrases such as "is my favorite". By this token, Convolutional Neural Networks (CNNs)-whose capability for extracting the informative n-gram features (also called "active local features") as sentence representations has been verified in (Kim, 2014;Johnson and Zhang, 2015)-should be a suitable model for this classification problem. However, CNN likely fails in cases where a sentence expresses different sentiments over multiple targets, such as "great food but the service was dreadful!". One reason is that CNN cannot fully explore the target information as done by RNN-based meth-ods (Tang et al., 2016a). 2 Moreover, it is hard for vanilla CNN to differentiate opinion words of multiple targets. Precisely, multiple active local features holding different sentiments (e.g., "great food" and "service was dreadful") may be captured for a single target, thus it will hinder the prediction.
We propose a new architecture, named Target-Specific Transformation Networks (TNet), to solve the above issues in the task of target sentiment classification. TNet firstly encodes the context information into word embeddings and generates the contextualized word representations with LSTMs. To integrate the target information into the word representations, TNet introduces a novel Target-Specific Transformation (TST) component for generating the target-specific word representations. Contrary to the previous attention-based approaches which apply the same target representation to determine the attention scores of individual context words, TST firstly generates different representations of the target conditioned on individual context words, then it consolidates each context word with its tailor-made target representation to obtain the transformed word representation. Considering the context word "long" and the target "battery life" in the above example, TST firstly measures the associations between "long" and individual target words. Then it uses the association scores to generate the target representation conditioned on "long". After that, TST transforms the representation of "long" into its target-specific version with the new target representation. Note that "long" could also indicate a negative sentiment (say for "startup time"), and the above TST is able to differentiate them.
As the context information carried by the representations from the LSTM layer will be lost after the non-linear TST, we design a contextpreserving mechanism to contextualize the generated target-specific word representations. Such mechanism also allows deep transformation structure to learn abstract features 3 . To help the CNN feature extractor locate sentiment indicators more accurately, we adopt a proximity strategy to scale the input of convolutional layer with positional relevance between a word and the target.
2 One method could be concatenating the target representation with each word representation, but the effect as shown in (Wang et al., 2016) is limited.
3 Abstract features usually refer to the features ultimately useful for the task (Bengio et al., 2013;LeCun et al., 2015).
In summary, our contributions are as follows: • TNet adapts CNN to handle target-level sentiment classification, and its performance dominates the state-of-the-art models on benchmark datasets.
• A novel Target-Specific Transformation component is proposed to better integrate target information into the word representations.
• A context-preserving mechanism is designed to forward the context information into a deep transformation architecture, thus, the model can learn more abstract contextualized word features from deeper networks.
The architecture of the proposed Target-Specific Transformation Networks (TNet) is shown in Fig. 1. The bottom layer is a BiLSTM which transforms the input x = {x 1 , x 2 , ..., x n } ∈ R n×dimw into the contextualized word representations h (0) = {h n } ∈ R n×2dim h (i.e. hidden states of BiLSTM), where dim w and dim h denote the dimensions of the word embeddings and the hidden representations respectively. The middle part, the core part of our TNet, consists of L Context-Preserving Transformation (CPT) layers. The CPT layer incorporates the target information into the word representations via a novel Target-Specific Transformation (TST) component. CPT also contains a contextpreserving mechanism, resembling identity mapping (He et al., 2016a,b) and highway connection (Srivastava et al., 2015a,b), allows preserving the context information and learning more abstract word-level features using a deep network. The top most part is a position-aware convolutional layer which first encodes positional relevance between a word and a target, and then extracts informative features for classification.

Bi-directional LSTM Layer
As observed in Lai et al. (2015), combining contextual information with word embeddings is an effective way to represent a word in convolutionbased architectures. TNet also employs a BiL-STM to accumulate the context information for each word of the input sentence, i.e., the bottom part in Fig. 1. For simplicity and space issue, we denote the operation of an LSTM unit on x i as LSTM(x i ). Thus, the contextualized word representation h (0) i ∈ R 2dim h is obtained as follows: (1)

Context-Preserving Transformation
The above word-level representation has not considered the target information yet. Traditional attention-based approaches keep the word-level features static and aggregate them with weights as the final sentence representation. In contrast, as shown in the middle part in Fig. 1, we introduce multiple CPT layers and the detail of a single CPT is shown in Fig. 2. In each CPT layer, a tailor-made TST component that aims at better consolidating word representation and target representation is proposed. Moreover, we design a context-preserving mechanism enabling the learning of target-specific word representations in a deep neural architecture.

Target-Specific Transformation
TST component is depicted with the TST block in Fig. 2. The first task of TST is to generate the representation of the target. Previous methods (Chen  Liu and Zhang, 2017) average the embeddings of the target words as the target representation. This strategy may be inappropriate in some cases because different target words usually do not contribute equally. For example, in the target "amd turin processor", the word "processor" is more important than "amd" and "turin", because the sentiment is usually conveyed over the phrase head, i.e.,"processor", but seldom over modifiers (such as brand name "amd"). Ma et al. (2017) attempted to overcome this issue by measuring the importance score between each target word representation and the averaged sentence vector. However, it may be ineffective for sentences expressing multiple sentiments (e.g., "Air has higher resolution but the fonts are small."), because taking the average tends to neutralize different sentiments. We propose to dynamically compute the importance of target words based on each sentence word rather than the whole sentence. We first employ another BiLSTM to obtain the target word repre- Then, we dynamically associate them with each word w i in the sentence to tailor-make target representation r τ i at the time step i: where the function F measures the relatedness between the j-th target word representation h τ j and the i-th word-level representation h (l) i : Finally, the concatenation of r τ i and h (l) i is fed into a fully-connected layer to obtain the i-th targetspecific word representationh i (l) : where g( * ) is a non-linear activation function and ":" denotes vector concatenation. W τ and b τ are the weights of the layer.

Context-Preserving Mechanism
After the non-linear TST (see Eq. 5), the context information captured with contextualized representations from the BiLSTM layer will be lost since the mean and the variance of the features within the feature vector will be changed. To take advantage of the context information, which has been proved to be useful in (Lai et al., 2015), we investigate two strategies: Lossless Forwarding (LF) and Adaptive Scaling (AS), to pass the context information to each following layer, as depicted by the block "LF/AS" in Fig. 2. Accordingly, the model variants are named TNet-LF and TNet-AS.
Lossless Forwarding. This strategy preserves context information by directly feeding the features before the transformation to the next layer. Specifically, the input h (l+1) i of the (l + 1)-th CPT layer is formulated as: is the output of TST in this layer. We unfold the recursive form of Eq. 6 as follows: Here, we denoteh we can see that the output of each layer will contain the contextualized word representations (i.e., h (0) i ), thus, the context information is encoded into the transformed features. We call this strategy "Lossless Forwarding" because the contextualized representations and the transformed representations (i.e., TST(h (l) i )) are kept unchanged during the feature combination.
Adaptive Scaling. Lossless Forwarding introduces the context information by directly adding back the contextualized features to the transformed features, which raises a question: Can the weights of the input and the transformed features be adjusted dynamically? With this motivation, we propose another strategy, named "Adaptive Scaling". Similar to the gate mechanism in RNN variants (Jozefowicz et al., 2015), Adaptive Scaling introduces a gating function to control the passed proportions of the transformed features and the input features. The gate t (l) as follows: where t (l) i is the gate for the i-th input of the l-th CPT layer, and σ is the sigmoid activation function. Then we perform convex combination of h Here, denotes element-wise multiplication. The non-recursive form of this equation is as follows (for clarity, we ignore the subscripts): Thus, the context information is integrated in each upper layer and the proportions of the contextualized representations and the transformed representations are controlled by the computed gates in different transformation layers.

Convolutional Feature Extractor
Recall that the second issue that blocks CNN to perform well is that vanilla CNN may associate a target with unrelated general opinion words which are frequently used as modifiers for different targets across domains. For example, "service" in "Great food but the service is dreadful" may be associated with both "great" and "dreadful". To solve it, we adopt a proximity strategy, which is observed effective in Li and Lam, 2017). The idea is a closer opinion word is more likely to be the actual modifier of the target.  Specifically, we first calculate the position relevance v i between the i-th word and the target 4 : where k is the index of the first target word, C is a pre-specified constant, and m is the length of the target w τ . Then, we use v to help CNN locate the correct opinion w.r.t. the given target: Based on Eq. 10 and Eq. 11, the words close to the target will be highlighted and those far away will be downgraded. v is also applied on the intermediate output to introduce the position information into each CPT layer. Then we feed the weighted h (L) to the convolutional layer, i.e., the top-most layer in Fig. 1, to generate the feature map c ∈ R n−s+1 as follows: where h (L) i:i+s−1 ∈ R s·dim h is the concatenated vector ofĥ i+s−1 , and s is the kernel size. w conv ∈ R s·dim h and b conv ∈ R are learnable weights of the convolutional kernel. To capture the most informative features, we apply max pooling (Kim, 2014) and obtain the sentence representation z ∈ R n k by employing n k kernels: Finally, we pass z to a fully connected layer for sentiment prediction: where W f and b f are learnable parameters. 4 As we perform sentence padding, it is possible that the index i is larger than the actual length n of the sentence.

Experimental Setup
As shown in Table 1, we evaluate the proposed TNet on three benchmark datasets: LAPTOP and REST are from SemEval ABSA challenge (Pontiki et al., 2014), containing user reviews in laptop domain and restaurant domain respectively. We also remove a few examples having the "conflict label" as done in ; TWITTER is built by Dong et al. (2014), containing twitter posts. All tokens are lowercased without removal of stop words, symbols or digits, and sentences are zero-padded to the length of the longest sentence in the dataset. Evaluation metrics are Accuracy and Macro-Averaged F1 where the latter is more appropriate for datasets with unbalanced classes. We also conduct pairwise t-test on both Accuracy and Macro-Averaged F1 to verify if the improvements over the compared models are reliable.
TNet is compared with the following methods. • CNN-ASP: It is a CNN-based model implemented by us which directly concatenates target representation to each word embedding; • TD-LSTM (Tang et al., 2016a): It employs two LSTMs to model the left and right contexts of the target separately, then performs predictions based on concatenated context representations; • MemNet (Tang et al., 2016b): It applies attention mechanism over the word embeddings multiple times and predicts sentiments based on the top-most sentence representations; • BILSTM-ATT-G (Liu and Zhang, 2017): It models left and right contexts using two attention-based LSTMs and introduces gates to measure the importance of left context, right context, and the entire sentence for the prediction; • RAM : RAM is a multilayer architecture where each layer consists of attention-based aggregation of word features and a GRU cell to learn the sentence representation.
We run the released codes of TD-LSTM and BILSTM-ATT-G to generate results, since their papers only reported results on TWITTER. We also rerun MemNet on our datasets and evaluate it with both accuracy and Macro-Averaged F1. 5 We use pre-trained GloVe vectors (Pennington et al., 2014) to initialize the word embeddings and the dimension is 300 (i.e., dim w = 300). For out-of-vocabulary words, we randomly sample their embeddings from the uniform distribution U(−0.25, 0.25), as done in (Kim, 2014). We only use one convolutional kernel size because it was observed that CNN with single optimal kernel size is comparable with CNN having multiple kernel sizes on small datasets (Zhang and Wallace, 2017). To alleviate overfitting, we apply dropout on the input word embeddings of the LSTM and the ultimate sentence representation z. All weight matrices are initialized with the uniform distribution U(−0.01, 0.01) and the biases are initialized as zeros. The training objective is cross-entropy, and Adam (Kingma and Ba, 2015) is adopted as the optimizer by following the learning rate and the decay rates in the original paper.
The hyper-parameters of TNet-LF and TNet-AS are listed in Table 2. Specifically, all hyperparameters are tuned on 20% randomly held-out training data and the hyper-parameter collection producing the highest accuracy score is used for testing. Our model has comparable number of parameters compared to traditional LSTM-based models as we reuse parameters in the transformation layers and BiLSTM. 6 Table 3, both TNet-LF and TNet-AS consistently achieve the best performance on all datasets, which verifies the efficacy of our whole TNet model. Moreover, TNet can perform well for different kinds of user generated content, such as product reviews with relatively formal sentences in LAPTOP and REST, and tweets with more ungrammatical sentences in TWITTER. The reason is the CNN-based feature extractor arms TNet with more power to extract accurate features from ungrammatical sentences. Indeed, we can also observe that another CNN-based baseline, i.e., CNN-ASP implemented by us, also obtains good results on TWITTER.

As shown in
On the other hand, the performance of those comparison methods is mostly unstable. For the tweet in TWITTER, the competitive BILSTM-ATT-G and RAM cannot perform as effective as they do for the reviews in LAPTOP and REST, due to the fact that they are heavily rooted in LSTMs and the ungrammatical sentences hinder their ca-6 All experiments are conducted on a single NVIDIA GTX 1080. The prediction cost of a sentence is about 2 ms.  Table 3: Experimental results (%). The results with symbol" " are retrieved from the original papers, and those starred ( * ) one are from Dong et al. (2014). The marker † refers to p-value < 0.01 when comparing with BILSTM-ATT-G, while the marker ‡ refers to p-value < 0.01 when comparing with RAM.
pability in capturing the context features. Another difficulty caused by the ungrammatical sentences is that the dependency parsing might be errorprone, which will affect those methods such as AdaRNN using dependency information. From the above observations and analysis, some takeaway message for the task of target sentiment classification could be: • LSTM-based models relying on sequential information can perform well for formal sentences by capturing more useful context features; • For ungrammatical text, CNN-based models may have some advantages because CNN aims to extract the most informative n-gram features and is thus less sensitive to informal texts without strong sequential patterns.

Performance of Ablated TNet
To investigate the impact of each component such as deep transformation, context-preserving mechanism, and positional relevance, we perform comparison between the full TNet models and its ablations (the third group in Table 3). After removing the deep transformation (i.e., the techniques introduced in Section 2.2), both TNet-LF and TNet-AS are reduced to TNet w/o transformation (where position relevance is kept), and their results in both accuracy and F1 measure are incomparable with those of TNet. It shows that the integration of target information into the word-level representations is crucial for good performance.
Comparing the results of TNet and TNet w/o context (where TST and position relevance are kept), we observe that the performance of TNet w/o context drops significantly on LAPTOP and REST 7 , while on TWITTER, TNet w/o context performs very competitive (p-values with TNet-LF and TNet-AS are 0.066 and 0.053 respectively for Accuracy). Again, we could attribute this phenomenon to the ungrammatical user generated content of twitter, because the contextpreserving component becomes less important for such data. TNet w/o context performs consistently better than TNet w/o transformation, which verifies the efficacy of the target specific transformation (TST), before applying context-preserving.
As for the position information, we conduct statistical t-test between TNet-LF/AS and TNet-LF/AS w/o position together with performance comparison. All of the produced p-values are less than 0.05, suggesting that the improvements brought in by position information are significant.

CPT versus Alternatives
The next interesting question is what if we replace the transformation module (i.e., the CPT layers in Fig.1) of TNet with other commonly-used components? We investigate two alternatives: attention mechanism and fully-connected (FC) layer, resulting in three pipelines as shown in the second group of Table 3 (position relevance is kept for them).
LSTM-ATT-CNN applies attention as the alternative 8 , and it does not need the contextpreserving mechanism. It performs unexceptionally worse than the TNet variants. We are surprised that LSTM-ATT-CNN is even worse than TNet w/o transformation (a pipeline simply removing the transformation module) on TWITTER. More concretely, applying attention results in negative effect on TWITTER, which is consistent with the observation that all those attention-based state-of-the-art methods (i.e., TD-LSTM, Mem-Net, BILSTM-ATT-G, and RAM) cannot perform well on TWITTER.
LSTM-FC-CNN-LF and LSTM-FC-CNN-AS are built by applying FC layer to replace TST and keeping the context-preserving mechanism (i.e., LF and AS). Specifically, the concatenation of word representation and the averaged target vector is fed to the FC layer to obtain targetspecific features. Note that LSTM-FC-CNN-LF/AS are equivalent to TNet-LF/AS when processing single-word targets (see Eq. 3). They obtain competitive results on all datasets: comparable with or better than the state-of-the-art methods. The TNet variants can still outperform LSTM-FC-CNN-LF/AS with significant gaps, e.g., on LAPTOP and REST, the accuracy gaps between TNet-LF and LSTM-FC-CNN-LF are 0.42% (p < 0.03) and 0.38% (p < 0.04) respectively.

Impact of CPT Layer Number
As our TNet involves multiple CPT layers, we investigate the effect of the layer number L. Specifically, we conduct experiments on the held-out training data of LAPTOP and vary L from 2 to 10, increased by 2. The cases L=1 and L=15 are also included. The results are illustrated in Figure 3. We can see that both TNet-LF and TNet-AS achieve the best results when L=2. While increasing L, the performance is basically becoming worse. For large L, the performance of TNet-AS 8 We tried different attention mechanisms and report the best one here, namely, dot attention (Luong et al., 2015). generally becomes more sensitive, it is probably because AS involves extra parameters (see Eq 9) that increase the training difficulty. Table 4 shows some sample cases. The input targets are wrapped in the brackets with true labels given as subscripts. The notations P, N and O in the table represent positive, negative and neutral respectively. For each sentence, we underline the target with a particular color, and the text of its corresponding most informative n-gram feature 9 captured by TNet-AS (TNet-LF captures very similar features) is in the same color (so color printing is preferred). For example, for the target "resolution" in the first sentence, the captured feature is "Air has higher". Note that as discussed above, the CNN layer of TNet captures such features with the size-three kernels, so that the features are trigrams. Each of the last features of the second and seventh sentences contains a padding token, which is not shown.

Case Study
Our TNet variants can predict target sentiment more accurately than RAM and BILSTM-ATT-G in the transitional sentences such as the first sentence by capturing correct trigram features. For the third sentence, its second and third most informative trigrams are "100% . PAD" and "' s not", being used together with "features make up", our models can make correct predictions. Moreover, TNet can still make correct prediction when the explicit opinion is target-specific. For example, (P, P, P) (P, P, P) (P, P, P) (P, P, P) 7. The [staff] N should be a bit more friendly . P P P P Table 4: Example predictions, color printing is preferred. The input targets are wrapped in brackets with the true labels given as subscripts. indicates incorrect prediction.
"long" in the fifth sentence is negative for "startup time", while it could be positive for other targets such as "battery life" in the sixth sentence. The sentiment of target-specific opinion word is conditioned on the given target. Our TNet variants, armed with the word-level feature transformation w.r.t. the target, is capable of handling such case.
We also find that all these models cannot give correct prediction for the last sentence, a commonly used subjunctive style. In this case, the difficulty of prediction does not come from the detection of explicit opinion words but the inference based on implicit semantics, which is still quite challenging for neural network models.

Related Work
Apart from sentence level sentiment classification (Kim, 2014;Shi et al., 2018), aspect/target level sentiment classification is also an important research topic in the field of sentiment analysis. The early methods mostly adopted supervised learning approach with extensive hand-coded features (Blair-Goldensohn et al., 2008;Titov and McDonald, 2008;Jiang et al., 2011;Kiritchenko et al., 2014;Wagner et al., 2014;Vo and Zhang, 2015), and they fail to model the semantic relatedness between a target and its context which is critical for target sentiment analysis. Dong et al. (2014) incorporate the target information into the feature learning using dependency trees. As observed in previous works, the performance heavily relies on the quality of dependency parsing. Tang et al. (2016a) propose to split the context into two parts and associate target with contextual features separately. Similar to (Tang et al., 2016a), Zhang et al. (2016) develop a three-way gated neural network to model the in-teraction between the target and its surrounding contexts. Despite the advantages of jointly modeling target and context, they are not capable of capturing long-range information when some critical context information is far from the target. To overcome this limitation, researchers bring in the attention mechanism to model target-context association (Tang et al., 2016a,b;Wang et al., 2016;Liu and Zhang, 2017;Ma et al., 2017;Tay et al., 2017). Compared with these methods, our TNet avoids using attention for feature extraction so as to alleviate the attended noise.