Entity-Aware Dependency-Based Deep Graph Attention Network for Comparative Preference Classification

This paper studies the task of comparative preference classification (CPC). Given two entities in a sentence, our goal is to classify whether the first (or the second) entity is preferred over the other or no comparison is expressed at all between the two entities. Existing works either do not learn entity-aware representations well and fail to deal with sentences involving multiple entity pairs or use sequential modeling approaches that are unable to capture long-range dependencies between the entities. Some also use traditional machine learning approaches that do not generalize well. This paper proposes a novel Entity-aware Dependency-based Deep Graph Attention Network (ED-GAT) that employs a multi-hop graph attention over a dependency graph sentence representation to leverage both the semantic information from word embeddings and the syntactic information from the dependency graph to solve the problem. Empirical evaluation shows that the proposed model achieves the state-of-the-art performance in comparative preference classification.


Introduction
Given a sentence that contains two entities of interest, the task of Comparative Preference Classification is to decide whether there is a comparison between the two entities and if so, which entity is preferred (Jindal and Liu, 2006a;Ganapathibhotla and Liu, 2008;Liu, 2012;Panchenko et al., 2019). For example, considering sentence s 1 (shown in Table 1), there is a comparison between the two underlined entities, and "golf" is preferred over "baseball". This sentence contains explicit comparative predicate "easier". The task seems straightforward but is quite challenging due to many counterexamples. For example, s 2 shows that "better" may not indicate a comparison. s 3 , another counterexample, shows that "slower" indeed indicates a ID Sentences s 1 Golf is easier to pick up than baseball. s 2 I'm considering learning Python and more PHP if any of those would be better. s 3 The tools based on Perl and Python is much slower under Windows than K9. comparison, but not between "Perl" and "Python", but between "tools" and "K9". Problem statement. Given a sentence s = w 1 , w 2 , ..., e 1 , ..., e 2 , ...w n , where e 1 and e 2 are entities consisting of a single word or a phrase, and e 1 appears before e 2 in the sentence, our goal is to classify the comparative preference direction between these two entities into one of the three classes: {BETTER, WORSE, NONE}. BETTER (WORSE) means e 1 is preferred (not preferred) over e 2 . NONE means that there is no comparative relation between e 1 and e 2 .
Although closely related, Comparative Preference Classification (CPC) is different from Comparative Sentence Identification (CSI), which is a 2-class classification problem that classifies a sentence as a comparative or a non-comparative sentence. In previous work, Jindal and Liu (2006a) did CSI without considering which two entities are involved in a comparison. Tkachenko and Lauw (2015) employed some dependency graph features to approach the CSI task given two entities of interest. In this entity-aware case, syntactic features are crucial. However, not using word embeddings in the model makes the model harder to generalize with a good performance given various ways of expressing comparisons. Panchenko et al. (2019) gave the state-of-the-art result on the CPC task by using a pretrained sentence encoder to produce sentence embeddings as a feature for classification. However, this model is not entity-aware and does not use the dependency graph information.  For the CPC task, building a model that is entityaware and also explicitly uses the dependency graph information is vital. We explain the reason as follows. For example, the dependency graph information gives a clue that the underlined entities in s 2 of Table 1 are not involved in a comparison, although there is a comparative indicator "better" in the sentence. s 3 (also refer to Figure 1) has two entity pairs, which make an entity-aware model necessary. The pair of entities, tools and K9, are far away from each other in the sequence. But in the dependency graph, they are just two hops away from each other and one hop away from the key comparative predicate "slower". For the pair of entities, Perl and Python, although both are sequentially near to the word "slower", the dependency graph information does not indicate they are involved in a comparison. We see that an entity-aware model can avoid the mistake of taking comparative predicates not associated with the entity pair as an evidence. Also, the dependency graph of a sentence contains important clues that can benefit the comparative preference classification. Methods, which are not entity-aware and do not model dependency structures, are not capable of dealing with the cases in s 2 and s 3 .
To address the limitations of the previous models, we propose a novel Entity-aware Dependencybased Deep Graph Attention Network (ED-GAT) for comparative preference classification. We represent a sentence by its dependency graph. This Graph Attention Network (GAT) (Veličković et al., 2018) based model can naturally fuse word semantic information and dependency information within the model. By building a deep graph attention network stacking several self-attention layers, the model can effectively capture long-range dependencies, which is beneficial for identifying the comparison preference direction between two entities. We have applied this model on a real-world benchmark dataset, and the results show that incorporating the dependency graph information greatly helps this task. It outperforms strong and latest baselines, as discussed in the experiments.

Proposed Model
In this section, we first give a brief introduction to the GAT model. We then present the proposed ED-GAT model and discuss how to apply it to the CPC task.

Graph Attention Network (GAT)
The critical component of our model is the Graph Attention Network (GAT) (Veličković et al., 2018), which fuses the graph-structured information and node features within the model. Its masked selfattention layers allow a node to attend to neighborhood features and learn different attention weights for different neighboring nodes.
The node features fed into a GAT layer are where n is the number of nodes, F is the feature size of each node. The attention mechanism of a typical GAT can be summarized by equation (1).
Here, given the node feature vectors in GAT, node i attends over its 1-hop neighbors j ∈ N i . K k=1 denotes the concatenation of K multi-head attention outputs, h out i ∈ R F is the output of node i at the current layer, α k ij is the k-th attention between nodes i and j, W k ∈ R F K ×F is linear transformation, a k ∈ R 2F K is the weight vector, and f (·) is LeakyReLU non-linearity function.
Overall, the input-output for a single GAT layer is summarized as H out = GAT (X, A; Θ l ). The input is X ∈ R n×F and the output is H out ∈ R n×F , where n is the number of nodes, F is the node feature size, F is GAT hidden size, and A ∈ R n×n is the adjacency matrix of the graph.

ED-GAT for CPC task
We use the dependency parser in (Chen and Manning, 2014) to convert a sentence into a dependency parse graph. Each word corresponds to a node in the graph. The node features are the word embedding vectors, denoted as x i ∈ R F corresponding to node i. The input node feature matrix is X ∈ R n×F . Note that an entity is either a single word or a multi-word phrase. To treat each entity as one node, we replace the whole entity word/phrase with "EntityA" or "EntityB" before parsing. A multi-word entity embedding is obtained by averaging the embeddings of the words in the entity. We observe that for a given node in the dependency parse graph, both its parents and children contain useful information for the task. To make the ED-GAT model treat both its parents and children as neighbors, we simplify the original directed dependency graph into an undirected graph. The structure of the graph is encoded into an adjacency matrix A ∈ R n×n . ED-GAT does not attend to all neighbors of a given node on an equal basis. The attention weights to the neighbors are automatically learned during training based on their usefulness to the task, regardless of whether they are parents or children in the dependency graph. The higher the attention weight given to a neighbor, the more useful this neighbor is to the task.
In a single GAT layer, a word or an entity in a graph only attends over the local information from 1-hop neighbors. To enable the model to capture long-range dependencies, we stack L layers to make a deep model, which allows information from L-hops away to propagate to this word. Our model is thus a deep graph attention network.
As illustrated in Figure 2, the stacking architecture is represented as H l+1 = GAT (H l , A; Θ l ), l ≥ 0, H 0 = XW 0 + b 0 . The output of the GAT layer l, H l out = GAT (H l , A; Θ l ), is the input for layer (l + 1), denoted by H l+1 . H 0 is the initial input. W 0 ∈ R F ×F and b 0 are the projection matrix and bias vector. For a L layer ED-GAT model, the output of the final layer is H L out ∈ R n×F . We use a mask layer to fetch the two hidden vectors from H L out , which corresponds to the two entities of interest: (h e 1 , h e 2 ) = Masklayer(H L out ).
Next, we concatenate these two vectors as: v = [h e 1 h e 2 ] and use a feed-forward layer with softmax function to project v into classes for prediction. Here using h e 1 and h e 2 makes the ED-GAT model entity-aware as they are the output of the nodes corresponding to entities e 1 and e 2 , each of which attends over its neighbors' features in L hops in the graph and leverages both the word semantics and dependency structure information in learning.
The ED-GAT model is trained by minimizing the standard cross-entropy loss over training examples.

Related Works
Many papers have been devoted to exploring comparisons in text. For the CSI task, early works include those in (Jindal and Liu, 2006a;Ganapathibhotla and Liu, 2008). More recently, Park and Blake (2012) employed handcrafted syntactic rules to identify comparative sentences in scientific articles. For other languages such as Korean and Chinese, related works include (Huang et al., 2008), (Yang and Ko, 2009) and (Zhang and Jin, 2012). Other works are interested in identifying entities, aspects and comparative predicates in comparative sentences, e.g., (Jindal and Liu, 2006b), (Hou and Li, 2008), (Kessler and Kuhn, 2014), (Kessler and Kuhn, 2013), and (Feldman et al., 2007). Ganapathibhotla and Liu (2008) used lexicon properties to determine the preferred entities given the output of (Jindal and Liu, 2006b), which is quite different from our task.
There are also works related to product ranking using comparisons, such as those in (Kurashima et al., 2008), (Zhang et al., 2013), (Tkachenko and Lauw, 2014) and (Li et al., 2011). All these related works solve very different problems in comparison analysis than our CPC task.
Works in NLP that use Graph Neural Networks and dependency graph structures include (Huang and Carley, 2019), (Guo et al., 2019). But their tasks and models are different from ours.

Dataset
We perform experiments using the benchmark CompSent-19 dataset (Panchenko et al., 2019), where each sentence has an entity pair (e 1 , e 2 ) and its comparative preference label. The original dataset is split into an 80% training set and a 20% test set. During the experiment, we further   (Panchenko et al., 2019).

Model Implementation Details
The Stanford Neural Network Dependency Parser (Chen and Manning, 2014) is used to build the dependency parse graph for each sentence. In our experiment, we use two pretrained word embeddings: GloVe embeddings (Pennington et al., 2014) 1 and BERT embedding (Devlin et al., 2019) 2 . The input of BERT is formatted as the standard BERT input format, with "[CLS]" before and "[SEP]" after the sentence tokens. For this, we employ the BERT tokenizer to tokenize each word into word pieces (tokens). The output of the pretrained-BERT model is a sequence of embeddings, each of size 768, and corresponds to a word piece. We average the word piece embeddings of the original word to get the embedding for each word (node in the dependency graph). Note that, word embeddings are kept frozen and not fine-tuned by the subsequent model structure.
For the ED-GAT model, we set the hidden size as 300. The features of the nodes, which are the word embeddings, are first transformed into vectors of the hidden size and then fed into the ED-GAT model. We use 6 attention heads, training batch size of 32, Adam optimizer (Kingma and Ba, 2014) with learning rate 5e-4, word embedding dropout rate (Srivastava et al., 2014) 0.3 and GAT attention dropout rate 0. The implementation of the model is based on PyTorch Geometric (PyG) (Fey and Lenssen, 2019) and NVIDIA GPU GTX 1080 ti.

Compared Models
We compare models from the previous literature with several variations of our proposed model.
Majority-Class assigns the majority label in the training set to each instance in the test set.
SentEmbed given in (Panchenko et al., 2019) obtains sentence embeddings from a pretrained Sentence Encoder (Conneau et al., 2017;Bowman et al., 2015). The sentence embedding 3 is then fed to XGBoost (Chen and Guestrin, 2016) for classification. For a fair comparison, we also feed the sentence embedding into a linear layer. They are represented as SentEmbed XGBoost and SentEmbed Linear .
SVM-Tree 4 given in (Tkachenko and Lauw, 2015) uses convolution kernel methods and dependency tree features to approach the CSI task. We use the one-vs-rest technique to adapt this model to our three-class CPC task.
WordEmbed-Avg first constructs a sentence embedding by averaging the word embeddings of all words in a sentence, and then feeds it to a linear classifier. Glove-Avg and BERT-Avg, respectively are the methods that use GloVe embeddings from GloVe.840B (Pennington et al., 2014) and static BERT embeddings (Devlin et al., 2019).
BERT-FT appends a linear classification layer on the hidden state corresponding to the first token "[CLS]" of the BERT sequence output and then finetunes the pretrained BERT weights on our task.
ED-GAT is the proposed model in this paper (Section 2.2). We use both GloVe embeddings and BERT embeddings. We use (L) to represent model variants with different numbers of layers and use the subscript to denote the type of embedding. For example, ED-GAT GloVe (8) is the ED-GAT model using GloVe embedding, and the depth of the model is 8 layers. We also add the LSTM BERT baseline, which uses the sequence output of a static BERT model to train an LSTM model. The final hidden vector is used for classification.

Results and Analysis
As we see in Table 3, the state-of-the-art (SOTA) baseline is SentEmbed XGBoost . SentEmbed Linear performs much worse than SentEmbed XGBoost . This result shows that XGBoost classifies sentence embeddings much better than a linear layer. Simply using word embedding average, GloVe-Avg  and BERT-Avg do not perform well. The result of LSTM BERT shows that using BERT embedding sequentially is not suitable for our task. BERT-FT fine-tunes BERT on our task, but its performance is below SOTA. During experiments, we also found that the performance of BERT-FT is unstable. The training process of the model quickly overfits the pretrained BERT weights.
For the ED-GAT model, we first tried to train embeddings only on this dataset by randomly initializing word embeddings as input. As expected, the results were significantly poorer than those using the pre-trained embeddings, in part because our training data is very small (see Table 2). As the baselines all use pretrained embeddings, we thus report the results of using pre-trained word embeddings in Table 3. When employing Glove embeddings, surprisingly, ED-GAT GloVe (10) performs better than BERT-FT, which is based on a language model pretrained on a huge corpus. We also tried to employ word2vec 5 for ED-GAT. It got very similar results to those using the GloVe embeddings. The Micro-F1 scores of using word2vec embeddings for the number of layers 8, 9, and 10 are 83.12, 83.33, and 84.86, respectively. To be concise, we did not include these results in Table 3.
Our model also uses the static BERT embedding, which further improves the result. Using static BERT embedding avoids overfitting. On the one hand, it incorporates the rich semantic information with the BERT pretrained weights. On the other hand, ED-GAT's ability to leverage dependency graph features greatly helps the model in capturing 5 GoogleNews-vectors-negative300.bin.gz (https:// code.google.com/archive/p/word2vec/) Figure 3: Effects of the number of layers in ED-GAT the comparison between the entities and classifying the preference direction. Our ED-GAT BERT (8) reports the new state-of-the-art results for CPC task considering F1-Micro and all class-wise F1.
Effects of Model Depth. From Figure 3, we see that increasing the number of stacked layers improves the performance of the model. For ED-GAT GloVe , as GloVe does not contain the context information, the GAT structure based on the dependency graph greatly improves the result. Even the 2-layer model achieves a good result. ED-GAT BERT does not have the same effect because the BERT embedding already contains rich semantic information. But still, when the number of layers increases, ED-GAT BERT becomes more powerful as it captures longer range dependencies.

Conclusion
This paper proposes a novel model called ED-GAT for Comparative Preference Classification. It naturally leverages dependency graph features and word embeddings to capture the comparison and to classify the preference direction between two given entities. Experimental results show that it outperforms all strong baselines and even BERT pretrained using a huge corpus.
Our future work aims to improve the CPC performance further. Apart from that, we also plan to design novel models to perform the related tasks of entity extraction and aspect extraction from comparative sentences. Performing all these tasks jointly in a multitask learning framework is a promising direction as well because it can exploit the shared features and the inherent relationships of these tasks to perform all tasks better.