SIRIUS-LTG-UiO at SemEval-2018 Task 7: Convolutional Neural Networks with Shortest Dependency Paths for Semantic Relation Extraction and Classification in Scientific Papers

This article presents the SIRIUS-LTG-UiO system for the SemEval 2018 Task 7 on Semantic Relation Extraction and Classification in Scientific Papers. First we extract the shortest dependency path (sdp) between two entities, then we introduce a convolutional neural network (CNN) which takes the shortest dependency path embeddings as input and performs relation classification with differing objectives for each subtask of the shared task. This approach achieved overall F1 scores of 76.7 and 83.2 for relation classification on clean and noisy data, respectively. Furthermore, for combined relation extraction and classification on clean data, it obtained F1 scores of 37.4 and 33.6 for each phase. Our system ranks 3rd in all three sub-tasks of the shared task.


Introduction
Relation extraction and classification can be defined as follows: given a sentence where entities are manually annotated, we aim to identify the pairs of entities that are instances of the semantic relations of interest and classify them based on a pre-defined set of relation types.A range of different approaches have been applied to solve this task in previous work.Conventional classification approaches have made use of contextual, lexical and syntactic features combined with richer linguistic and background knowledge such as Word-Net and FrameNet (Hendrickx et al., 2010;Rink and Harabagiu, 2010).
Recently, the re-emergence of deep neural networks provides a way to develop highly automatic features and representations to handle complex interpretation tasks.These approaches have yielded impressive results for many different NLP tasks.The use of deep neural networks for relation classification has been investigated in several recent studies (Socher et al., 2012;Lin et al., 2016;Zhou et al., 2016).Convolutional neural networks (CNNs) have been effectively applied to extract lexical and sentence level features for relation classification (Zhang and Wang, 2015;Lee et al., 2017;Nguyen and Grishman, 2015).However, these works consider whole sentences or the context between two target entities as input for the CNN.Such representations suffer from irrelevant sub-sequences or clauses when target entities occur far from each other or there are other target entities in the same sentence.To avoid negative effects from irrelevant chunks or clauses and capture the relation between two entities, Xu et al. (2015a); Liu et al. (2015) and Xu et al. (2015b) employ a CNN to learn more robust and effective relation representations from the shortest dependency path (sdp) between two entities.The sdp between two entities in the dependency graph captures a condensed representation of the information required to assert a relationship between two entities (Bunescu and Mooney, 2005).In this work, we continue this line of work and present a system based on a CNN architecture over shortest dependency paths combined with domain-specific word embeddings to extract and classify semantic relations in scientific papers.

System description
In this section, we describe the various components of our system.
Text pre-processing.For each relation instance in the training data set, we assign a sentence that contains the participant entities.Sentence and token boundaries are detected using the Stanford CoreNLP tool (Manning et al., 2014).Since most of the entities are multi-word units, in order to obtain a precise dependency path between entities, we replace the entities with their codes.The example sentence in (1) below is thus transformed to (2).
(1) Syntax-based statistical machine translation (MT) aims at applying statistical models to structured data .
Given an encoded sentence, we find the sdp connecting two target entities for each relation instance using a syntactic parser, see below.
For syntactic parsing we employ the parser described in Bohnet and Nivre (2012), a transitionbased parser which performs joint PoS-tagging and parsing.We train the parser on the standard training sections 02-21 of the Wall Street Journal (WSJ) portion of the Penn Treebank (Marcus et al., 1993).The constituency-based treebank is converted to dependencies using two different conversion tools: (i) the pennconverter software1 (Johansson and Nugues, 2007), which produces the so-called CoNLL-style dependencies employed in the CoNLL08 shared task on dependency parsing (Surdeanu et al., 2008)  Based on the dependency graphs output by the parser, we extract the shortest dependency path connecting two entities.The path records the direction of arc traversal using left and right arrows (i.e.← and →) as well as the dependency relation of the traversed arcs and the predicates involved, following Xu et al. (2015a).The entity codes in the final sdp are replaced with the corresponding word tokens at the end of the pre-processing step.
For the sentence in (1) and the two entities statistical models and structured data we thus extract the path in (3) below.
( Word embeddings.In our system, two different sets of pre-trained word embeddings are used for initialization.One is the 300-d pre-trained embeddings provided by the NLPL repository4 (Fares et al., 2017), trained on English Wikipedia data with word2vec (Mikolov et al., 2013), here dubbed wiki-w2v.In addition, we train a second set of domain-specific embeddings on the ACL Anthology corpus.We obtain the XML versions of 22,878 articles from ACL Anthology5 .After extracting the raw texts, for training of the 300-d word embeddings (acl-w2v), we exploit the available word2vec (Mikolov et al., 2013) implementation gensim ( Řehůřek and Sojka, 2010) for training.
Classification Model Our system is based on a Convolutional Neural Network (CNN) architecture similar to the one used for sentence classification in Kim (2014).Figure 1 provides an overview of the proposed model.It consists of 4 main layers as follows: Look-up Table and Embedding layer: In the first step, the model takes a dependency path, as in (3) as input and transforms it into a matrix representation by looking up the pre-trained word embeddings.
Convolutional Layer: The next layer performs convolutions with the ReLU activation to the embedding layer using multiple filter sizes (filter sizes ∈ [3, 4, 5]) and extracts feature maps over the tokens.
Max pooling Layer: By applying the max operator, the most effective local features are generated from each feature map.
Fully connected Layer: Finally, the higher level syntactic features are fed to a fully connected softmax layer which outputs the probability distribution over each relation.

Experiments
Dataset For each sub-task, the training data includes abstracts of papers from the ACL Anthology corpus with pre-annotated entities.For subtask 1.1 and 2, the training datasets are the same.It contains entities that are manually annotated and they represent domain concepts specific to Natural Language Processing (NLP).In sub-task 1.2 the entities are automatically assigned and therefore contain a fair amount of noise (verbs, irrelevant words).The terms include high-level terms (e.g."algorithm", "paper", "method") and are not always full NPs (Gábor et al., 2018).Since the related entity pairs and the relation types are provided for the full dataset, we extend the dataset for sub-task 1.1 and 2 by extracting the related entities and their corresponding sdp from the sub-task 1.2 dataset.In order to train a model for sub-task 2, we also augment the dataset by extracting NONE relation instances (see Section 2), extracted from the corresponding dataset.Table 1 shows the number of instances for each relation class.As we can see, the class distribution is clearly unbalanced.

Model settings
We keep the value of hyperparameters equal to the ones that are reported in the original work (Kim, 2014), i.e., 128 filters for each window size, a dropout rate of ρ = 0.5 and l 2 regularization of 3. To deal with the effects of class imbalance, we weight the cost by the ratio of class instances, thus each observation receives a weight, depending on the class it belongs to.The effect of the minority class observations is thereby increased simply by a higher weight of these instances and is decreased for majority class observations.Furthermore, to guarantee that each fold in n-fold cross validation will have the proportion of same classes during training, evaluation and test, we apply the stratification technique proposed by Sechidis et al. (2011).We use the validation set to detect when overfitting starts during the training of our model; using early stopping, training is then stopped before convergence to avoid overfitting (Prechelt, 1998).The official evaluation metric is the macro-averaged F1-score, therefore we implement early-stopping (patience= 20) based on macro-F1 score in the development set.

Model variants
We run experiments with several variants of the model as follows: cnn.rand:A baseline model, where all elements in the embedding layer are randomly initialized and updated in the training process.cnn.wiki-w2v:The embedding layer is initialized with the pre-trained Wikipeida word embeddings and fine-tuned the target task.cnn.acl-w2v:The embedding layer is initialized with the pre-trained ACL Anthology word embeddings and fine-tuned for the target task.cnn.multi.rand:There are two embedding layers as a 'channel' in the CNN architecture.
Both channels are initialized randomly and only one of them is updated during training while the other remains static.cnn.multi.wiki-w2v:Same as before, but the channels are initialized with Wikipedia embedding vectors.cnn.multi.acl-w2v:The two channels are initialized with ACL embedding vectors.
cnn.multi.wiki-w2v.rand:First channel is initialized with Wikipedia embeddings in static mode and the second initialized randomly with a non-static mode.cnn.multi.acl-w2v.rand:Same as previous setting, but the first channel makes use of ACL embeddings.

Results
During development, we investigate the performance of different configurations; different dependency representations (CoNLL08 and Stanford basic) and model variants (see above); by running 5-fold cross validation (i.e. 3 folds for training, 1 fold for evaluation and 1 fold for test).The experiments show that, the multi-channel mode performs better only in the classification sub-tasks compared to the single channel setting.The results suggest that having a significant amount of instances per relation assists the model to classify better.The use of the pre-trained embeddings helps the model in class assignment.Particularly, the domain-specific embeddings (i.e.acl-w2v) provide higher performance gains when used in the model.Table 2 presents the F1-score of the best performing model for each sub-task via 5-fold cross validation on the training data.In the evaluation period, we re-run 5-fold cross validation using selected model for each sub-task.However, in this setting we use 4 folds as training and 1 fold as development set, and we apply the output model to the evaluation dataset.We select the 1st and 2nd best performing models on the development datasets as well as the majority vote (mv) of 5 models for the final submission.The final results are shown in Table 3

Conclusion
We present a CNN model over shortest dependency paths between entity pairs for relation extraction and classification.We examine various architectures for the proposed model.The experiments demonstrate the effectiveness of domainspecific word embeddings for all sub-tasks as well as sensitivity to the specific dependency representation employed in the input layer.Our future work includes: 1) to perform error analysis for the different sub-tasks, and 2) to investigate the effects of different dependency representations in relation extraction and classification.
2 , and (ii) the Stanford parser using the option to produce basic Stanford dependencies (de Marneffe et al., 2014) 3 .The parser achieves a labeled accuracy score of 91.23 when trained on the CoNLL08 representation and 91.31 for the Stanford basic model, when evaluated against the standard evaluation set (section 23) of the WSJ.We also experimented with the pre-trained parsing model for English included in the Stanford CoreNLP toolkit (Manning et al., 2014), which outputs Universal Dependencies.However, it was clearly outperformed by our version of the Bohnet and Nivre (2012) parser in the initial development experiments.

Figure 1 :
Figure 1: Model architecture with two channels for an example shortest dependency path (CNN model from Kim (2014)).

Table 1 :
Number of instances for each relation in the final dataset.

Table 3 :
. Official evaluation results of the submitted runs on the test set.