bunji at SemEval-2016 Task 5: Neural and Syntactic Models of Entity-Attribute Relationship for Aspect-based Sentiment Analysis

This paper describes a sentiment analysis sys-tem developed by the bunji team in SemEval-2016 Task 5. In this task, we estimate the sentimental polarity of a given entity-attribute (E#A) pair in a sentence. Our approach is to estimate the relationship between target entities and sentimental expressions. We use two different methods to estimate the relationship. The ﬁrst one is based on a neural attention model that learns relations between tokens and E#A pairs through backpropagation. The second one is based on a rule-based system that examines several verb-centric relations related to E#A pairs. We conﬁrmed the effectiveness of the proposed methods in a target estimation task and a polarity estimation task in the restaurant domain, while our overall ranks were modest.


Introduction
Sentiment analysis is an important technology for understanding users' intentions from review texts. Such technologies are also useful for argumentation mining because it is necessary for readers to capture targets of interest and their polarities (Sato et al., 2015). Shared tasks of aspect-based sentiment analysis (ABSA) in SemEval provide a test bed for fine-grained analysis of sentiment polarities (Pontiki et al., 2014;Pontiki et al., 2015;Pontiki et al., 2016).
We participate in all four slots (Slot 1, 2, 1 & 2, and 3) of the restaurant domain and laptop domain in English. We focus on two types of models to capture the entity-attribute relationship, especially in Slot 2 and Slot 3. The first one is a neural network based model. The second one is a rule-based approach. Now, we explain the problem settings of the slots and our approaches. The following is an example of sentences that provide positive opinions to the FOOD#QUALITY aspect: Pizza here is good.
Slot 1 is an extraction of all aspects mentioned in a sentence. In this example, the goal is to choose FOOD#QUALITY among many other aspects. We formulate the problem as a multi-label classification problem and use a neural network-based model. Slot 2 is an extraction of opinion target expressions. The expected output is "Pizza" in the above example. We use a pattern matching based approach and focus on gathering resources such as dictionaries. For Slot 1 & 2, we simply combine the prediction results of Slot 1 and Slot 2. Slot 3 is an estimation of sentiment polarities. In this example, we estimate the polarity of this sentence from the aspect of FOOD#QUALITY. For Slot 3, we take two approaches. The first approach is a neural attention model (Luong et al., 2015) that considers the entity attention (FOOD) and attribute attention (QUAL-ITY) of each token. The second approach is a pattern matching-based model that examines the relationship between "Pizza" and "good" that is also used in Slot 2.
The remainder of this paper is structured as follows: In Section 2, we describe our system of phase A. In Section 3, we explain our system of phase B. Finally, Section 4 summarizes our work. We formulate Slot 1 as a multi-label classification problem. In this problem, an entity-attribute pair is considered as a label. We use a neural model to solve this problem. The model is illustrated in Figure 1. Given a sequence of word vectors X = (x 1 , x 2 , ..., x T ), this model calculates a vector y whose element represents probability of each label as: At first, we apply Stanford Core NLP (Manning et al., 2014) to each document to obtain word sequences. Then, we use word embedding generated by Skip-gram with Negative Sampling (Mikolov et al., 2013) to convert words into word vectors. Three hundred dimensional vectors trained with Google News Corpus 1 are used in Slot 1.
Then, a word vector sequence X is inputted to a recurrent neural network (RNN). The RNN calculates an output vector s t for each x t as: where h t denotes a hidden state of the RNN at position t. We use Long Short-Term Memory (LSTM) (Sak et al., 2014) and Gated Recurrent Unit (GRU) (Cho et al., 2014) as implementations of RNN units. We use a bi-directional RNN (BiRNN) (Schuster and Paliwal, 1997) in order to consider both forward context and backward context. A BiRNN consists of 1 Word embedding is available at https://code.google.com/archive/p/word2vec/ a forward RNN that processes tokens from head to tail and a backward RNN that processes tokens from tail to head. We concatenate forward output − → s t and backward output ← − s t into s t . The sentence vector v is then computed as a mean of RNN outputs s t : Finally, the probabilities in y are calculated by using a single layered perceptron: where W, b denote a weight matrix and a bias vector, respectively. We determine that a sentence contains the i-th aspect if its output y i is greater than a threshold θ. The threshold θ is determined by using development data that is randomly sampled from training data.
We modify aspect names to a suitable format for our neural model. Low-frequency aspects in training datasets are replaced by a new aspect "OTHER". The most common 10 aspects are preserved in the restaurant domain; the most common 16 aspects are preserved in the laptop domain. "NONE" labels are assigned to sentences that do not have any labels. The probability y i in an example of a training dataset is defined as y i = 1/k when a target sentence has the i-th aspect and a total of k aspects, otherwise y i = 0. We train the model by using backpropagation. The loss is calculated by using cross entropy. We use a minibatch stochastic gradient descent (SGD) algorithm together with an AdaGrad optimizer (Duchi et al., 2011). We add Dropout (Srivastava et al., 2014) layers to the input and output of the RNN. We clip the gradient norm when it exceeds 5.0 to improve the stability of training. The model parameters and θ are trained by the training dataset of the ABSA 2015, and the hyperparameters are tuned by test dataset of the ABSA 2015. We use random sampling to tune the hyperparameters. The best settings are shown in Table 1. We implement our neural systems by using Tensorflow (Abadi et al., 2015).

Slot 2: Opinion Target Expression
In Slot 2, we extract text spans corresponding to target entities. The procedure of our proposed method is as follows: 1. Creating dictionaries of food names and drink names by extracting targets in a training dataset, 2. Collecting food names and drink names in Knowledge Base and adding them to dictionaries, 3. Applying dictionary matching to sentences in a test dataset, 4. Extracting restaurant names by using syntactic rules, and 5. Checking relationship between targets extracted by step 3 and step 4 and attribute expressions.
Three key features of our method are the dictionary creation in step 2, the syntactic rules in step 4, and the estimation of the entity-attribute relationship in step 5.

Dictionary Creation
Coverage of dictionaries is crucial to improve recall metrics. In the training dataset, we observe various instances of FOOD entities such as bread, focaccia and gazpacho. Therefore, we try to import world knowledge written in Knowledge Base. We use DBpedia 2 as Knowledge Base to expand the dictionaries. We write a SPARQL query to retrieve labels (rdfs:label) of entities as dictionary entries. First we prepare a list of target types. For examples, we use http://dbpedia.org/ontology/Food and http://dbpedia.org/ontology/Fish as types of FOOD entities. We also prepare a list of types to be ignored such as "dbo:Beverage". Names of DRINK are also retrieved in the same manner as FOOD.

Restaurant Name Extraction
We use syntactic rules to extract restaurant names. We define a set of verb-centric rules such as "A1 visited A2" where A1 is a subject, and A2 is an object. A2 is likely to be restaurant names. We manually create 15 rules from training data.

Entity-Attribute Relationship Estimation
We observe entities not related to sentimental expressions in dictionary-match results, which decrease precision scores. Therefore, we filter entities related to sentiment expressions. We use the same method as that in Slot 3. Table 2 shows the results of Slot 1. Our system marked the highest recall score among all of the teams in the restaurant domain, while our precision score is lower than that of the baseline system. This is partly because of the determination of threshold values that may be overfitted to the development sets. One possible solution is to use cross validation to estimate more reliable threshold values. Table 3 shows the results of Slot 2. We can observe improvement of both the precision score and the recall score from those of the baseline system. The recall score is comparable to that of the ranked 1st team, while there is much room for improvement of the precision scores. Table 4 shows the results of Slot 1 & 2. We can observe the similar tendency to Slot 1's results because we simply merged the results of Slot 1 and Slot 2, and Slot 1 performs worse than Slot 2.    3 System Description of Phase B

Slot 3: Sentiment Polarity Neural Approach
Our method is inspired by a Deep Learning method proposed by Wang and Liu (Wang and Liu, 2015). They used estimated probabilities of Slot 1 as weights of a target entity-attribute pair, and then they inputted weighted tokens to a convolutional neural network. Instead of probabilities of Slot 1, we directly calculate entity attention and attribute attention at each token by using a neural attention model (Luong et al., 2015). The model is illustrated in Figure 2. We calculate a vector y p that represents probabilities of polarities (positive, negative,  and neutral) as: where v e and v a denote vectors corresponding to a target entity and a target attribute. At first we calculate RNN outputs s t with Eq. 2 similarly to Slot 1. Then, attention weights for both entity and attribute are computed at attention layers. An entity-attention layer calculates weights ε t at position t. At each position, e t is computed to measure the relationship between s t and v e : Then, we transform the scale of e t and obtain an entity-attention weight ε t as: Similarly, the attribute attention layer has weights α t at position t as follows: .
Then, we calculate a sentence vector r that is a weighted sum of RNN output with entity attention weights and attribute attention weights as: where || denotes a concatenation operator that creates a vector in R 2d from two vectors in R d . Finally, we calculate y p by using a single layered perceptron: We train the Slot 3 model by using backpropagation. We use a minibatch stochastic gradient descent (SGD) algorithm together with the ADAM optimizer (Kingma and Ba, 2015). Hyperparameters are tuned similarly to Slot 1. The hyperparameter settings in Slot 3 are shown in Table 6. We add Dropout (Srivastava et al., 2014) layers to the input and output of the RNN. We also apply L2regularization to two attention layers and a softmax 292 Parameter REST (U) REST (C) LAPT (C) Dropout p k 0.9 0.9 0.9 Learning rate 1.7 × 10 −3 1.4 × 10 −3 5.8 × 10 − 4  RNN state size  64  128  64  minibatch size  16  32  16  max epochs  12  19  6  L2 coef 1.9 × 10 −4 1.9 × 10 −4 3.3 × 10 −4 Table 6: Hyperparameter setting for Slot 3. p k denotes a ratio to keep values in a Dropout layer.
layer. The attention unit size is 300. In an unconstrained setting, we use the same word embedding that has 300 dimensional vectors as Slot 1. In a constrained setting, we use 128-dimensional vectors that are initialized by uniform distribution. We clip the gradient norm when it exceeds 5.0 to improve stability of training. We set the maximum token length as 40. Initial values of entity vectors are created by averaging word vectors in sentences that have target entities. Attribute vectors are also initialized in the same manner as the entity vectors.

Relation-Features Approach
This approach trains a linear classifier using relations of a given entity and a given attribute as features. In the first step, we annotate the following 11 annotations of relations to all documents: believe Showing someone's belief such as "X likes Y" and "X avoids Y", significant Showing X's significance such as "X is impressive" and "X is terrible", require Showing requirement such as "X needs Y", equivalent Showing X is equivalent to Y, such as "X viewed Y" and "X regarded Y", include Showing inclusion or possession such as "X has Y" and "X equips Y". contrast Comparing X with Y such as "Y is ... than X" and "Y is ... compared to X", affect Showing X affects Y such as "X increases Y" and "X causes Y", state Showing statement such as "X doubts that Y", negation Showing negations such as "not X" and "no X", shift Reversing X's polarity such as "X ban" and "X shortage", and absolutize Fixing polarity of X such as "X problem" and "X risk".
These annotations were originally developed for an argument-generation system (Sato et al., 2015).
In the second step, we identify an entity expression and an attribute expression that correspond  to a given entity-attribute pair. We use a simple dictionary-matching approach. In the restaurant domain, we use a given target annotation as an entity expression. In the laptop domain, we prepare a list of entities extracted from a training dataset. For both domains, we create an attribute dictionary. Entries of the dictionary are manually extracted from training datasets. Then, we assign a sentimental polarity (positive, negative, or neutral) to each entry.
In the third step, we create features for a linear classifier. Those features are generated by combining annotations to capture various relations of a target entity-attribute pair. For example, we examine whether an affect annotation is negated or not and whether a target entity is a subject of an affect annotation or an object.
Finally, we classify a sentimental polarity by using a linear classifier. We use a linear SVM in scikitlearn (Pedregosa et al., 2011) as an implementation of the classifier, on the parameter C = 0.1, loss = squared epsilon insensitive, and penalty = l2. Table 7 shows the results for Slot 3. We select a suitable method from the neural method and the rulebased method for each domain by comparing scores in the ABSA15 dataset. In the restaurant domain, we can observe that the proposed method improves the accuracy by 10 percentage points compared with the baseline system.

Results
We merged sentence-wise estimation and created document-wise estimation. We gathered polarities of an entity-attribute pair. If a result was both positive and negative, then we judged it as conflicting.  a similar tendency to the results in Subtask 1.

Conclusions
In this paper, we described the participation of the bunji team in SemEval-2016. We used both a neural approach and a rule-based approach to model an entity-attribute relationship. We confirmed the effectiveness of the proposed methods in a target estimation task and a polarity estimation task in the restaurant domain, while our overall ranks were modest. As a future work, we plan to investigate network structures that are simple enough to be trained with a relatively small dataset. For the rule-based system, we plan to add more rules to improve precision scores.