Syntactic Dependency Representations in Neural Relation Classification

We investigate the use of different syntactic dependency representations in a neural relation classification task and compare the CoNLL, Stanford Basic and Universal Dependencies schemes. We further compare with a syntax-agnostic approach and perform an error analysis in order to gain a better understanding of the results.


Introduction
The neural advances in the field of NLP challenge long held assumptions regarding system architectures. The classical NLP systems, where components of increasing complexity are combined in a pipeline architecture are being challenged by endto-end architectures that are trained on distributed word representations to directly produce different types of analyses traditionally assigned to downstream tasks. Syntactic parsing has been viewed as a crucial component for many tasks aimed at extracting various aspects of meaning from text, but recent work challenges many of these assumptions. For the task of semantic role labeling for instance, systems that make little or no use of syntactic information, have achieved state-of-theart results (Marcheggiani et al., 2017). For tasks where syntactic information is still viewed as useful, a variety of new methods for the incorporation of syntactic information are employed, such as recursive models over parse trees (Socher et al., 2013;Ebrahimi and Dou, 2015) , tree-structured attention mechanisms (Kokkinos and Potamianos, 2017), multi-task learning (Wu et al., 2017), or the use of various types of syntactically aware input representations, such as embeddings over syntactic dependency paths (Xu et al., 2015b).
Dependency representations have by now become widely used representations for syntactic analysis, often motivated by their usefulness in downstream application. There is currently a wide range of different types of dependency representations in use, which vary mainly in terms of choices concerning syntactic head status. Some previous studies have examined the effects of dependency representations in various downstream applications (Miyao et al., 2008;Elming et al., 2013). Most recently, the Shared Task on Extrinsic Parser Evaluation (Oepen et al., 2017) was aimed at providing better estimates of the relative utility of different types of dependency representations and syntactic parsers for downstream applications. The downstream systems in this previous work have, however, been limited to traditional (non-neural) systems and there is still a need for a better understanding of the contribution of syntactic information in neural downstream systems.
In this paper, we examine the use of syntactic representations in a neural approach to the task of relation classification. We quantify the effect of syntax by comparing to a syntax-agnostic approach and further compare different syntactic dependency representations that are used to generate embeddings over dependency paths. Figure 1 illustrates the three different dependency representations we compare: the socalled CoNLL-style dependencies (Johansson and Nugues, 2007) which were used for the 2007, 2008, and 2009 shared tasks of the Conference on Natural Language Learning (CoNLL), the Stanford 'basic' dependencies (SB) (Marneffe et al., 2006) and the Universal Dependencies (v1.3) (UD; McDonald et al., 2013;Marneffe et al., 2014;Nivre et al., 2016). We see that the analyses differ both in terms of their choices of heads vs. dependents and the inventory of dependency types. Where CoNLL analyses tend to view functional words as heads (e.g., the auxiliary verb are), the Stanford scheme capitalizes more on content words as heads (e.g., the main verb treated). UD takes the tendency to select contentful heads one step further, analyzing the prepositional complement functions as a head, with the preposition as itself as a dependent case marker. This is in contrast to the CoNLL and Stanford scheme, where the preposition is head.

Dependency representations
For syntactic parsing we employ the parser described in Bohnet and Nivre (2012), a transitionbased parser which performs joint PoS-tagging and parsing. We train the parser on the standard training sections 02-21 of the Wall Street Journal (WSJ) portion of the Penn Treebank (Marcus et al., 1993). The constituency-based treebank is converted to dependencies using two different conversion tools: (i) the pennconverter software 1 (Johansson and Nugues, 2007), which produces the CoNLL dependencies 2 , and (ii) the Stanford parser using either the option to produce basic dependencies 3 or its default option which is Universal Dependencies v1.3 4 . The parser achieves a labeled accuracy score of 91.23 when trained on the CoNLL08 representation, 91.31 for the Stanford basic model and 90.81 for the UD representation, when evaluated against the standard evaluation set (section 23) of the WSJ. We acknowledge that these results are not state-of-the-art parse results for English, however, the parser is straight-forward to use and re-train with the different dependency representations. We also compare to another widely used parser, namely the pre-trained parsing model for English included in the Stanford CoreNLP toolkit , which outputs Universal Dependencies only. However, it was clearly outperformed by our version of the Bohnet and Nivre (2012) parser in the initial development experiments.

Relation extraction system
We evaluate the relative utility of different types of dependency representations on the task of semantic relation extraction and classification in scientific papers, SemEval Task 7 (Gábor et al., 2018). We make use of the system of Nooralahzadeh et al. (2018): a CNN classifier with dependency paths as input, which ranked 3rd (out of 28) participants in the overall evaluation of the shared task. Here, the shortest dependency path (sdp) connecting two target entities for each relation instance is provided by the parser and is embedded in the first layer of a CNN. We extend on their system by (i) implementing a syntax-agnostic approach, (ii) implementing hyper-parameter tuning for each dependency representation, and (iii) adding Universal Dependencies as input representation. We thus train classifiers with sdps extracted from the different dependency representations discussed above and measure the effect of this information by the performance of the classifier.

Dataset and Evaluation Metrics
We use the SemEval-2018, Task 7 dataset (Gábor et al., 2018) from its Subtask 1.1. The training data contains abstracts of 350 papers from the ACL Anthology Corpus, annotated for concepts and semantic relations. Given an abstract of a scientific paper with pre-annotated domain concepts, the task is to perform relation classification. The classification sub-task 1.1 contains 1228 entity pairs that are annotated based on five asymmetric relations (USAGE, RESULT, MODEL-FEATURE, PART WHOLE, TOPIC) and one symmetric relation (COMPARE). The relation instance along with its directionality are provided in both the training and the test data sets. The official evaluation metric is the macro-averaged F1-scores for the six semantic relations, therefore we will compare the impact of different dependency representations on the macro-averaged F1-scores.  (2014)).
The training set for Subtask 1.1 is quite small, which is a challenge for end-to-end neural methods. To overcome this, we combined the provided datasets for Subtask 1.1 and Subtask 1.2 (relation classification on noisy data), which provides additional 350 abstracts and 1248 labeled entity pairs to train our model. This yields a positive impact (+16.00% F1) on the classification task in our initial experiments.

Pre-processing
Sentence and token boundaries are detected using the Stanford CoreNLP tool . Since most of the entities are multi-word units, we replace the entities with their codes in order to obtain a precise dependency path. Our example sentence All knowledge sources are treated as feature functions, an example of the USAGE relation between the two entities knowledge sources and feature functions, is thus transformed to All P05 1057 3 are treated as P05 1057 4.
Given an encoded sentence, we find the sdp connecting two target entities for each relation instance using a syntactic parser. Based on the dependency graph output by the parser, we extract the shortest dependency path connecting two entities. The path records the direction of arc traversal using left and right arrows (i.e. ← and →) as well as the dependency relation of the traversed arcs and the predicates involved, following Xu et al. (2015a). The entity codes in the final sdp are replaced with the corresponding word tokens at the end of the pre-processing step.
For the sentence above, we thus extract the path:

CNN model
The system is based on a CNN architecture similar to the one used for sentence classification in Kim (2014). Figure 2 provides an overview of the proposed model. It consists of 4 main layers as follows: 1) Look-up Table and Embedding layer: In the first step, the model takes a shortest dependency path (i.e., the words, dependency edge directions and dependency labels) between entity pairs as input and maps it into a feature vector using a look-up table operation. Each element of the dependency path (i.e. word, dependency label and arrow) is transformed into a embedding layer by looking up the embedding matrix M ∈ R d×V , where d is the dimension of CNN embedding layer and V is the size of vocabulary. Each column in the embedding matrix can be initialized randomly or with pre-trained embeddings. The dependency labels and edge directions are always initialized randomly. 2) Convolutional Layer: The next layer performs convolutions with ReLU activation over the embeddings using multiple filter sizes and extracts feature maps. 3) Max pooling Layer: By applying the max operator, the most effective local features are generated from each feature map. 4) Fully connected Layer: Finally, the higher level syntactic features are fed to a fully connected softmax layer which outputs the probability distribution over each relation.

Experiments
We run all the experiments with a multi-channel setting 5 in which the first channel is initialized with pre-trained embeddings 6 in static mode (i.e. it is not updated during training) and the second one is initialized randomly and is fine-tuned during training (non-static mode). The macro F1score is measured by 5-fold cross validation and to deal with the effects of class imbalance, we weight the cost by the ratio of class instances, thus each observation receives a weight, depending on the class it belongs to.

Effect of syntactic information
To evaluate the effects of syntactic information in general for the relation classification task, we compare the performance of the model with and without the dependency paths. In the syntaxagnostic setup, a sentence that contains the participant entities is used as input for the CNN. We keep the value of hyper-parameters equal to the ones that are reported in the original work (Kim, 2014).
To provide the sdp for the syntax-aware version we compare to, we use our parser with Stanford dependencies. We find that the effect of syntactic structure varies between the different relation types. However, the sdp information has a clear positive impact on all the relation types (Table 1). It can be attributed to the fact that the contextbased representations suffer from irrelevant subsequences or clauses when target entities occur far from each other or there are other target entities in the same sentence. The sdp between two entities in the dependency graph captures a condensed representation of the information required to assert a relationship between two entities (Bunescu and Mooney, 2005).

Comparison of different dependency representations
To investigate the model performance with various parser representations, we create a sdp for each training example using the different parse models and exploit them as input to the relation classification model. With the use of default parameters there is a chance that these favour one of the representations. In order to perform a fair comparison, we make use of Bayesian optimization (Brochu et al., 2010) in order to locate optimal hyper parameters for each of the dependency representations. We construct a Bayesian optimization procedure using a Gaussian process with 100 iterations and Expected Improvement (EI) for its acquisition functions. We set the objective function to maximize the macro F1 score over 5-fold cross validation on the training set. Here we investigate the impact of various system design choices with the following parameters: 7 : I) Filter region size: ∈ {3, 4, 5, 6, 7, 8, 9, 3-4, 4-5, 5-6, 6-7, 7-8, 8-9, 3-4-5, 4-5-6, 5-6-7, 6-7-8, 7-  showing the importance of tuning for these types of comparisons. The results furthermore show that the sdps based on the Stanford Basic (SB) representation provide the best performance, followed by the CoNLL08 representation. We observe that the results for the UD representation are quite a bit lower than the two others. Table 3 presents the effect of each parser representation in the classification task, broken down by relation type. We find that the UD-based model falls behind the others on the most relation types (i.e, COMPARE, MODEL-FEATURE, PART WHOLE, TOP-ICS). To explore these differences in more detail, we manually inspect the instances for which the CoNLL/SB-based models correctly predict the relation type in 5-fold trials, whereas the UD-based model has an incorrect prediction. Table 4 shows some of these examples, marking the entities and the gold class of each instance and also showing the sdp from each representation. We observe that the UD paths are generally shorter. A striking similarity between most of the instances is the fact that one of the entities resides in a prepositional phrase. Whereas the SB and CoNLL paths explicitly represent the preposition in the path, the UD representation does not. Clearly, the difference between for 8 The probability that each element is kept, in which 1 implies that none of the nodes are dropped out best F1 (in 5-fold)

Relation
Frq.  instance the USAGE and PART WHOLE relation may be indicated by the presence of a specific preposition (X for Y vs. X of Y). This is also interesting since this particular syntactic choice has been shown in previous work to have a negative effect on intrinsic parsing results for English (Schwartz et al., 2012).

Conclusion
This paper has examined the use of dependency representations for neural relation classification and has compared three widely used representations. We find that representation matters and that certain choices have clear consequences in downstream processing. Future work will extend the study to neural dependency parsers and other relation classification data sets.