Learning Explicit and Implicit Structures for Targeted Sentiment Analysis

Targeted sentiment analysis is the task of jointly predicting target entities and their associated sentiment information. Existing research efforts mostly regard this joint task as a sequence labeling problem, building models that can capture explicit structures in the output space. However, the importance of capturing implicit global structural information that resides in the input space is largely unexplored. In this work, we argue that both types of information (implicit and explicit structural information) are crucial for building a successful targeted sentiment analysis model. Our experimental results show that properly capturing both information is able to lead to better performance than competitive existing approaches. We also conduct extensive experiments to investigate our model’s effectiveness and robustness.


Introduction
Targeted sentiment analysis (TSA) is an important task useful for public opinion mining (Pang and Lee, 2008;Liu, 2010;Ortigosa et al., 2014;Smailović et al., 2013;Li and Wu, 2010). The task focuses on predicting the sentiment information towards a specific target phrase, which is usually a named entity, in a given input sentence. Currently, TSA in the literature may refer to either of the two possible tasks under two different setups: 1) predicting the sentiment polarity for a given specific target phrase (Dong et al., 2014;Wang et al., 2016;Zhang et al., 2016;Xue and Li, 2018); 2) jointly predicting the targets together with the sentiment polarity assigned to each target (Mitchell et al., 2013;Zhang et al., 2015;Li and Lu, 2017;Ma et al., 2018). In this paper, we focus on the latter setup which was originally proposed by Mitchell  Figure 1 presents an example sentence containing three targets. Each target is associated with a sentiment, where we use + for denoting positive polarity, 0 for neutral and − for negative.
Existing research efforts mostly regard this task as a sequence labeling problem by assigning a tag to each word token, where the tags are typically designed in a way that capture both the target boundary as well as the targeted sentiment polarity information together. Existing approaches (Mitchell et al., 2013;Zhang et al., 2015;Ma et al., 2018) build models based on conditional random fields (CRF) (Lafferty et al., 2001) or structural support vector machines (SSVM) (Taskar et al., 2005;Tsochantaridis et al., 2005) to explicitly model the sentiment information with structured outputs, where each targeted sentiment prediction corresponds to exactly one fixed output. While effective, such models suffer from their inability in capturing certain long-distance dependencies between sentiment keywords and their targets. To remedy this issue, Li and Lu (2017) proposed their "sentiment scope" model to learn flexible output representations. For example, three text spans with their corresponding targets in bold are presented in Figure 1, where each target's sentiment is characterized by the words appearing in the corresponding text span. They learn from data for each target a latent text span used for attributing its sentiment, resulting in flexible output structures.
However, we note there are two major limi-tations with the approach of Li and Lu (2017). First, their model requires a large number of handcrafted discrete features. Second, the model relies on a strong assumption that the latent sentiment spans do not overlap with one another. For example, in Figure 1, their model will not be able to capture the interaction between the target word "OZ" in the first sentiment span and the keyword "amazing" due to the assumptions made on the explicit structures in the output space. One idea to resolve this issue is to design an alternative mechanism to capture such useful structural information that resides in the input space.
On the other hand, recent literature shows that feature learning mechanisms such as self-attention have been successful for the task of sentiment prediction when targets are given (Wang and Lu, 2018;He et al., 2018;Fan et al., 2018) (i.e., under the first setup mentioned above). Such approaches essentially attempt to learn rich implicit structural information in the input space that captures the interactions between a given target and all other word tokens within the sentence. Such implicit structures are then used to generate sentiment summary representation towards the given target, leading to the performance boost.
However, to date capturing rich implicit structures in the joint prediction task that we focus on (i.e., the second setup) remains largely unexplored. Unlike the first setup, in our setup the targets are not given, we need to handle exponentially many possible combinations of targets in the joint task. This makes the design of an algorithm for capturing both implicit structural information from the input space and the explicit structural information from the output space challenging.
Motivated by the limitations and challenges, we present a novel approach that is able to efficiently and effectively capture the explicit and implicit structural information for TSA. We make the following key contributions in this work: • We propose a model that is able to properly integrate both explicit and implicit structural information, called EI. The model is able to learn flexible explicit structural information in the output space while being able to efficiently learn rich implicit structures by LSTM and self-attention for exponentially many possible combinations of targets in a given sentence. • We conducted extensive experiments to vali-+ + 0 OZ and Shim Lim perform amazing magic on AGT 2018 Figure 2: The structured output for representing entities and their sentiments with boundaries. date our claim that both explicit and implicit structures are indispensable in such a task, and demonstrate the effectiveness and robustness of our model.

Approach
Our objective is to design a model to extract targets as well as their associated targeted sentiments for a given sentence in a joint manner. As we mentioned before, we believe that both explicit and implicit structures are crucial for building a successful model for TSA. Specifically, we first present an approach to learn flexible explicit structures based on latent CRF, and next present an approach to efficiently learn the rich implicit structures for exponentially many possible combinations of targets.

Explicit Structure
Motivated by Li and Lu (2017), we design an approach based on latent CRF to model flexible sentiment spans to capture better explicit structures in the output space. To do so, we firstly integrate target and targeted sentiment information into a label sequence by using 3 types of tags in our EI model: B p , A p , and E ,p , where p ∈ {+, −, 0} indicates the sentiment polarity and ∈ {B,M,E,S} denotes the BMES tagging scheme 2 . We explain the meaning of each type of tags as follows.
• B p is used to denote that the current word is part of a sentiment span with polarity p, but appears before the target word or exactly as the first word of the target.
• A p is used to denote that the current word is part of a sentiment span with polarity p, but appears after the target word or exactly as the last word of the target. + + 0 OZ and Shim Lim perform amazing magic on AGT 2018 Figure 3: An alternative structured output for the same example with different sentiment boundaries.
• E ,p is used to denote the current word is part of a sentiment span with polarity p, and is also a part of the target. The BMES sub-tag denotes the position information within the target phrase. For example, E B,+ represents that the current word appears as the first word of a target with the positive polarity.
We illustrate how to construct the label sequence for a specific combination of sentiment spans of the given example sentence in Figure 2, where three non-overlapping sentiment spans in yellow are presented. Each such sentiment span encodes the sentiment polarity in blue for a target in bold in pink square. At each position, we allow multiple tags in a sequence to appear such that the edge A p B p in red consistently indicates the boundary between two adjacent sentiment spans.
The first sentiment span with positive (+) polarity contains only one word which is also the target. Such a single word target is also the beginning and the end of the target. We use three tags B + , E S,+ and A + to encode such information above.
The second sentiment span with positive (+) polarity contains a two-word target "Shin Lim". The word "and" appearing before such target takes a tag B + . The words "perform amazing magic" appearing after such target take a tag A + at each position. As for the target, the word "Shin" at the beginning of the target takes tags B + and E B,+ , while the word "Lim" at the end of the target takes tags E E,+ and A + .
The third sentiment span with neutral (0) polarity contains a single-word target "AGT". Similarly, we use three tags B 0 , E S,0 and A 0 to represent such single word target. The word "on" appearing before such target takes a tag B 0 . The word "2018" appearing afterwards takes a tag A 0 .
Note that if there exists a target with length larger than 2, the tag E M,p will be used. For example in Figure 2, if the target phrase "Shin Lim" is replaced by "Shin Bob Lim", we will keep the tags at "Shin" and "Lim" unchanged. We assign a tag E M,+ at the word "Bob" to indicate that "Bob" appears in the middle of the target by following the BMES tagging scheme.
Finally, we represent the label sequence by connecting adjacent tags sequentially with edges. Notice that for a given input sentence and the output targets as well as the associated targeted sentiment, there exist exponentially many possible label sequences, each specifying a different possible combinations of sentiment spans. Figure 3 shows a label sequence for an alternative combination of the sentiment spans. Those label sequences representing the same input and output construct a latent variable in our model, capturing the flexible explicit structures in the output space.
We use a log-linear formulation to parameterize our model. Specifically, the probability of predicting a possible output y, which is a list of targets and their associated sentiment information, given an input sentence x, is defined as: where s(x, y, h) is a score function defined over the sentence x and the output structure y, together with the latent variable h that provides all the possible combinations of sentiment spans for the (x, y) tuple. We define E(x, y, h) as a set of all the edges appearing in all the label sequences for such combinations of sentiment spans. To compute s(x, y, h), we sum up the scores of each edge in E(x, y, h): where φ x (e) is a score function defined over an edge e for the input x.
The overall model is analogous to that of a neural CRF (Peng et al., 2009;Do et al., 2010); hence the inference and decoding follow standard marginal and MAP inference procedures. For example, the prediction of y follows the Viterbi-like MAP inference procedure.

Implicit Structure
We propose a design for EI to efficiently learn rich implicit structures for exponentially many combinations of targets to predict. To do so, we explain the process to assign scores to each edge e from our neural architecture. The three yellow boxes in Figure 4 compute scores for rich implicit structures from the neural architecture consisting of LSTM and self-attention.
Given an input token sequence x = {x 1 , x 2 , · · · , x n } of length n, we first compute the concatenated embedding e k = [w k ; c k ] based on word embedding w k and character embedding c k at position k.
As illustrated on the left part in Figure 4, we then use a Bi-directional LSTM to encode context features and obtain hidden states h k = BiLSTM(e 1 , e 2 , · · · , e n ). We use two different linear layers f t and f s to compute scores for target and sentiment respectively. The linear layer f t returns a vector of length 4, with each value in the vector indicating the score of the corresponding tag under the BMES tagging scheme. The linear layer f s returns a vector of length 3, with each value representing the score of a certain polarity of +, 0, −. We assign such scores to each type of edge as follows: Note that the subscript p and at the right hand side of above equations denote the corresponding index of the vector that f t or f s returns. We apply f t on edges E k ,p E k+1 ,p and E k ,p A k p , since words at these edges are parts of the target phrase in a sentiment span. Similarly, we apply f s on edges and A k p B k+1 p , since words at these edges contribute the sentiment information for the target in the sentiment span.
As illustrated in Figure 4, we calculate a k , the output of self-attention at position k: where α k,j is the normalized weight score for β k,j , and β k,j is the weight score calculated by target Figure 4: Neural Architecture representation at position k and contextual representation at position j. In addition, W and b as well as the attention matrix U are the weights to be learned. Such a vector a k encodes the implicit structures between the word x k and each word in the remaining sentence. Motivated by the character embeddings (Lample et al., 2016) which are generated based on hidden states at two ends of a subsequence, we encode such implicit structures for a target similarly. For any target starting at the position k 1 and ending at the position k 2 , we could use a k 1 and a k 2 at two ends to represent the implicit structures of such a target. We encode such information on the edges B k 1 p E k 1 ,p and E k 2 ,p A k 2 p which appear at the beginning and the end of a target phrase respectively with sentiment polarity p. To do so, we assign the scores calculated from the self-attention to such two edges: where g s returns a vector of length 3 with scores of three polarities.
Note that h k and a k could be pre-computed at every position k and assigned to the corresponding edges. Such an approach allows us to maintain the inference time complexity O(T n), where T is the maximum number of tags at each position which is 9 in this work and n is the number of words in the input sentence. This approach enables EI to efficiently learn rich implicit structures from LSTM and self-attention for exponentially many combinations of targets.

Experimental Setup Data
We mainly conduct our experiments on the datasets released by Mitchell et al. (2013). They

Evaluation Metrics
Following the previous works, we report the precision (P.), recall (R.) and F 1 scores for target recognition and targeted sentiment. Note that a correct target prediction requires the boundary of the target to be correct, and a correct targeted sentiment prediction requires both target boundary and sentiment polarity to be correct.

Hyperparameters
We adopt pretrained embeddings from Pennington et al. (2014) and Cieliebak et al. (2017) for English data and Spanish data respectively. We use a 2-layer LSTM (for both directions) with a hidden dimension of 500 and 600 3 for English data and Spanish data respectively. The dimension of the attention weight U is 300. As for optimization, we use the Adam (Kingma and Ba, 2014) optimizer to optimize the model with batch size 1 and dropout rate 0.5. All the neural weights are initialized by Xavier (Glorot and Bengio, 2010).

Training and Implementation
We train our model for a maximal of 6 epochs. We select the best model parameters based on the best F 1 score on the development data after each epoch. Note that we split 10% of data from the training data as the development data 4 . The selected model is then applied to the test data for 3 We use a larger LSTM hidden size for Spanish since dimension of Spanish word embedding (200) is larger than dimension of English word embedding (100). 4 Detailed split information is released with our code. evaluation. During testing, we map words not appearing in the training data to the UNK token. Following the previous works, we perform 10-fold cross validation and report the average results. Our models and variants are implemented using Py-Torch (Paszke et al., 2017).

Baselines
We consider the following baselines: • Pipeline (Zhang et al., 2015) and Collapse (Zhang et al., 2015) both are linearchain CRF models using discrete features and embeddings. The former predicts targets first and calculate targeted sentiment for each predicted target. The latter outputs a tag at each position by collapsing the target tag and sentiment tag together.
• Joint (Zhang et al., 2015) is a linear-chain SSVM model using both discrete features and embeddings. Such a model jointly produces target tags and sentiment tags.
• Bi-GRU (Ma et al., 2018) and MBi-GRU (Ma et al., 2018) are both linear-chain CRF models using word embeddings. The former uses bi-directional GRU and the latter uses multilayer bi-directional GRU.
• HBi-GRU (Ma et al., 2018) and HMBi-GRU (Ma et al., 2018) are both linear-chain CRF models using word embeddings and character embedding. The former uses bidirectional GRU and the latter uses multilayer bi-directional GRU.
• SS (Li and Lu, 2017) and SS + emb (Li and Lu, 2017) are both based on a latent CRF model to learn flexible explicit structures. The former uses discrete features and the latter uses both discrete features and word embeddings.
• SA-CRF is a linear-chain CRF model with self-attention. Such a model concatenates the hidden state from LSTM and a vector constructed by self-attention at each position, and feeds them into CRF as features. The model attempts to capture rich implicit structures in the input space, but it does not put effort on explicit structures in the output space.
• E-I is a weaker version of EI. Such a model removes the BMES sub-tags in the E tag,  causing the model to learn less explicit structural information in the output space.
• EIis a weaker version of EI. Such a model removes the self-attention from EI, causing the model to learn less expressive implicit structures in the input space.

Main Results
The main results are presented in Table 2, where explicit structures as well as implicit structures are indicated for each model for clear comparisons.
In general, our model EI outperforms all the baselines. Specifically, it outperforms the strongest baseline EIsignificantly with p < 0.01 on the English and Spanish datasets in terms of F 1 scores 5 . Note that EIwhich models flexible explicit structures and less implicit structural information, achieves better performance than most of the baselines, indicating flexible explicit structures contribute a lot to the performance boost. Now let us take a closer look at the differences based on detailed comparisons. First of all, we compare our model EI with the work proposed by Zhang et al. (2015). The Pipeline model (based on CRF) as well as Joint and Collapse models (based on SSVM) in their work capture fixed explicit structures. Such two models rely on multilayer perceptron (MLP) to obtain the local context features for implicit structures. These two models do not put much effort to capture better explicit structures and implicit structures. Our model EI (and even EI-) outperforms these two models significantly. We also compare our work with mod- 5 We have conducted significance test using the bootstrap resampling method (Koehn, 2004). els in Ma et al. (2018), which also capture fixed explicit structures. Such models leverage different GRUs (single-layer or multi-layer) and different input features (word embeddings and character representations) to learn better contextual features. Their best result by HMBi-GRU is obtained with multi-layer GRU with word embeddings and character embeddings. As we can see, our model EI outperforms HMBi-GRU under all evaluation metrics. On the English data, EI obtains 6.50 higher F 1 score and 2.50 higher F 1 score on target recognition and targeted sentiment respectively. On Spanish, EI obtains 5.16 higher F 1 score and 0.50 higher F 1 score on target recognition and targeted sentiment respectively. Notably, compared with HMBi-GRU, even EIcapturing the flexible explicit structures achieves better performance on most of metrics and obtains the comparable results in terms of precision and F 1 score on Spanish. Since both EI and EImodels attempt to capture the flexible explicit structures, the comparisons above imply the importance of modeling such flexible explicit structures in the output space.
We also compare EI with E-I. The difference between these two models is that E-I removes the BMES sub-tags. Such a model captures less explicit structural information in the output space. We can see that EI outperforms E-I. Such results show that adopting BMES sub-tags in the output space to capture explicit structural information is beneficial. Now we compare EI with SA-CRF which is a linear-chain CRF model with self-attention. Such a model attempts to capture rich implicit structures, and fixed explicit structures. The difference between EI and SA-CRF is that our model EI captures flexible explicit structures in the output space  Table 3: Results on subjectivity as well as non-neutral sentiment analysis on the Spanish dataset. Subj(+/-,o): subjectivity for all polarities. SA(+,-): sentiment analysis for non-neutral polarities.
which model output representations as latent variables. We can see that EI outperforms SA-CRF on all the metrics. Such a comparison also implies the importance of capturing flexible explicit structures in the output space.
Next, we focus on the comparisons with SS (Li and Lu, 2017) and SS + emb (Li and Lu, 2017). Such two models as well as our models all capture the flexible explicit structures. As for the difference, both two SS models rely on hand-crafted discrete features to capture implicit structures, while our model EI and EIlearn better implicit structures by LSTM and self-attention. Furthermore, our models only require word embeddings and character embeddings as the input to our neural architecture to model rich implicit structures, leading to a comparatively simpler and more straightforward design. The comparison here suggests that LSTM and self-attention neural networks are able to capture better implicit structures than handcrafted features.
Finally, we compare EI with EI-. We can see that the F 1 scores of targeted sentiment for both English and Spanish produced by EI are 0.95 and 0.97 points higher than EI-. The main difference here is that EI makes use of self-attention to capture richer implicit structures between each target phrase and all words in the complete sentence. The comparisons here indicate the importance of capturing rich implicit structures using self-attention on this task.

Robustness
Overall, all these comparisons above based on empirical results show the importance of capturing both flexible explicit structures in the output space and rich implicit structures by LSTM and selfattention in the input space.
We analyze the model robustness by assessing the performance on the targeted sentiment for tar-gets of different lengths. For both English and Spanish, we group targets into 4 categories respectively, namely length of 1, 2, 3 and ≥ 4. Figure 5 reports the F 1 scores of targeted sentiment for such 4 groups on Spanish 6 . As we can see EI outperforms all the baselines on all groups. Furthermore, following the comparisons in Zhang et al. (2015), we also measure the precision, recall and F 1 of subjectivity and non-neutral polarities on the Spanish dataset. Results are reported in Table 3 7 . The subjectivity measures whether a target phrase expresses an opinion or not according to Liu (2010). Comparing with the best-performing system's results reported in Zhang et al. (2015) and Li and Lu (2017), our model EI can achieve higher F 1 scores on subjectivity and non-neutral polarities.

Error Analysis
We conducted error analysis for our main model EI. We calculate F 1 scores based on the partial match instead of exact match. The F 1 scores for target partial match is 76.04 and 83.82 for English and Spanish respectively. We compare these two numbers against 63.48 and 71.17 which are the F 1 scores based on exact match. This comparison indicates that boundaries of many predicted targets do not match exactly with those of the correct targets. Furthermore, we investigate the errors caused by incorrect sentiment polarities. We found that the major type of errors is to incorrectly predict positive targets as neutral targets. Such errors contribute 64% and 36% of total errors for English and Spanish respectively. We believe they are mainly caused by challenging expressions in the tweet input text. Such challenging expressions such as "below expectations" are very sparse in the data, which makes effective learning for such phrases difficult.

Effect of Implicit Structures
In order to understand whether the implicit structures are truly making contributions in terms of the overall performance, we compare the performance among four models: EI and EIas well as two variants EI (i:MLP) and EI (i:Identity) (where i indicates the implicit structure). Such two variants replace the implicit structure by other components:   • EI (i:MLP) replaces self-attention by multilayer perceptron (MLP) for implicit structures. Such a variant attempts to capture implicit structures for a target phrase towards words restricted by a window of size 3 centered at the two ends of the target phrase.
• EI (i:Identity) replaces self-attention by an identity layer 8 as implicit structure. Such a variant attempts to capture implicit structures for a target phrase towards words at the two ends of the target phrase exactly.
Overall, those variants perform worse than EI on all the metrics. When the self-attention is replaced by MLP or the identity layer for implicit structures, the performance drops a lot on both target and targeted sentiment. Such two variants EI (i:MLP) and EI (i:Identity) consider the words within a small window centered at the two ends of the target phrase, which might not be capable of capturing the desired implicit structures. The EImodel capturing less implicit structural infor- 8 The identity layer returns the identical input data. mation achieves worse results than EI, but obtains better results than the two variants discussed above. This comparison implies that properly capturing implicit structures as the complement of explicit structural information is essential.

Qualitative Analysis
We present an example sentence in the test data in Figure 6, where the gold targets are in bold, the predicted targets are in the pink boxes, the gold sentiment is in blue and predicted sentiment is in red. EI makes all correct predictions for three targets. EIpredicts correct boundaries for three targets and the targeted sentiment predictions are highlighted in Figure 6. As we can see, EIincorrectly predicts the targeted sentiment on the first target as neural (0). The first target here is far from the sentiment expression "sound good" which is not in the first sentiment span, making EInot capable of capturing such a sentiment expression. This qualitative analysis helps us to better understand the importance to capture implicit structures using both LSTM and self-attention.

Additional Experiments
We also conducted experiments on multi-lingual Restaurant datasets from SemEval 2016 Task 5 (Pontiki et al., 2016), where aspect target phrases and aspect sentiments are provided. 9 We regard each aspect target phrase as a target and assign such a target with the corresponding aspect sentiment polarity in the data. Note that we remove all the instances which contain no targets in the training data. Following the main experiment, we split 10% of training data as development set for the selection of the best model during training.
We report the F 1 scores of target and targeted sentiment for English, Dutch and Russian 10 respectively in Table 5. The results show that EI achieves the best performance. The performance of SS (Li and Lu, 2017) is much worse on Russian due to the inability of discrete features in SS to capture the complex morphology in Russian.

Related Work
We briefly survey the research efforts on two types of TSA tasks mentioned in the introduction. Note that TSA is related to aspect sentiment analysis which is to determine the sentiment polarity given a target and an aspect describing a property of related topics.
Predicting sentiment for a given target Such a task is typically solved by leveraging sentence structural information, such as syntactic trees (Dong et al., 2014), dependency trees (Wang et al., 2016) as well as surrounding context based on LSTM (Tang et al., 2016a), GRU (Zhang et al., 2016) or CNN (Xue and Li, 2018). Another line of works leverage self-attention (Liu and Zhang, 2017) or memory networks (Tang et al., 2016b) to encode rich global context information. Wang and Lu (2018) adopted the segmental attention (Kong et al., 2016) to model the important text segments to compute the targeted sentiment.  studied the issue that the different combinations of target and aspect may result in different sentiment polarity. They proposed a model to distinguish such different combinations based on memory networks to produce the representation for aspect sentiment classification.
Jointly predicting targets and their associated sentiment Such a joint task is usually regarded as sequence labeling problem. Mitchell et al. (2013) introduced the task of open domain targeted sentiment analysis. They proposed several models based on CRF such as the pipeline model, the collapsed model as well as the joint model to predict both targets and targeted sentiment information. Their experiments showed that the collapsed model and the joint model could achieve better results, implying the benefit of the joint learning on this task. Zhang et al. (2015) proposed an approach based on structured SVM (Taskar et al., 2005;Tsochantaridis et al., 2005) integrating both discrete features and neural features for this joint task. Li and Lu (2017) proposed the sentiment scope model motivated from a linguistic phenomenon to represent the structure information for both the targets  and their associated sentiment polarities. They modelled the latent sentiment scope based on CRF with latent variables, and achieved the best performance among all the existing works. However, they did not explore much on the implicit structural information and their work mostly relied on hand-crafted discrete features. Ma et al. (2018) adopted a multi-layer GRU to learn targets and sentiments jointly by producing the target tag and the sentiment tag at each position. They introduced a constraint forcing the sentiment tag at each position to be consistent with the target tag. However, they did not explore the explicit structural information in the output space as we do in this work.

Conclusion and Future Work
In this work, we argue that properly modeling both explicit structures in the output space and the implicit structures in the input space are crucial for building a successful targeted sentiment analysis system. Specifically, we propose a new model that captures explicit structures with latent CRF, and uses LSTM and self-attention to capture rich implicit structures in the input space efficiently. Through extensive experiments, we show that our model is able to outperform competitive baseline models significantly, thanks to its ability to properly capture both explicit and implicit structural information.
Future work includes exploring approaches to capture explicit and implicit structural information to other sentiment analysis tasks and other structured prediction problems.