MIT at SemEval-2017 Task 10: Relation Extraction with Convolutional Neural Networks

Over 50 million scholarly articles have been published: they constitute a unique repository of knowledge. In particular, one may infer from them relations between scientific concepts. Artificial neural networks have recently been explored for relation extraction. In this work, we continue this line of work and present a system based on a convolutional neural network to extract relations. Our model ranked first in the SemEval-2017 task 10 (ScienceIE) for relation extraction in scientific articles (subtask C).


Introduction and related work
The number of articles published every year keeps increasing (Druss and Marcus, 2005;Larsen and Von Ins, 2010) and well over 50 million scholarly articles have been published so far (Jinha, 2010).While this repository of human knowledge contains invaluable information, it has become increasingly difficult to take advantage of all available information due to its sheer amount.
One challenge is that the knowledge present in scholarly articles is mostly unstructured.One approach to organize this knowledge is to classify each sentence (Kim et al., 2011;Amini et al., 2012;Hassanzadeh et al., 2014;Dernoncourt et al., 2016).Another approach is to extract entities and relations between them, which is the focus of the ScienceIE shared task at SemEval-2017 (Augenstein et al., 2017).
Relation extraction can be seen as a process comprising two steps that can be done jointly (Li and Ji, 2014) or separately: first, entities of interest need to be identified, then the relation among each possible set of entities has to be determined.
In this work, we concentrate on the second step (often referred to as relation extraction or classification) and on binary relations, i.e. relations between two entities.Extracted relations can be used for a variety of tasks such as question-answering systems (Ravichandran and Hovy, 2002), ontology extension (Schutz and Buitelaar, 2005), and clinical trials (Frunza and Inkpen, 2011).
In this paper, we describe the system that we submitted for the ScienceIE shared task.Our system is based on convolutional neural networks and ranked first for relation extraction (subtask C).
Existing systems for relation extraction can be classified into five categories (Zettlemoyer, 2013): systems based on hand-built patterns (Yangarber and Grishman, 1998), bootstrapping methods (Brin, 1998), unsupervised methods (Gonzalez and Turmo, 2009), distant supervision (Snow et al., 2004), and supervised methods.We focus on supervised methods, as the ScienceIE shared task provides a labeled training set.
More recently, a few studies have investigated the use of artificial neural networks for relation extraction (Socher et al., 2012;Nguyen and Grishman, 2015;Hashimoto et al., 2013).Our approach follows this line of work.

Model
Our model for relation extraction comprises three parts: preprocessing, convolutional neural network (CNN), and postprocessing.

Preprocessing
The preprocessing step takes as input each raw text (i.e., in ScienceIE, a paragraph of a scientific article) as well as the location of all entities present in the text, and output several examples.Each example is represented as a list of tokens, each with four features: the relative positions of the two entity mentions, and their entity types and part-ofspeech (POS) tags.Figure 1 shows an example from the ScienceIE corpus in the table on the left.Sentence and token boundaries as well as POS tags are detected using the Stanford CoreNLP toolkit (Manning et al., 2014), and every pair of entity mentions of the same type within each sentence boundary are considered to be of a potential relation.We also remove any references (e.g.[1,2]) that are irrelevant to the task, and ensure that the sentences are not too long by eliminating the tokens before the beginning of the first entity mention and after the end of the second entity mention.

CNN architecture
Second, the CNN takes each preprocessed sentence as input, and predicts the relation between the two entities.The CNN architecture, illustrated in Figure 1, consists of four main layers, similar to the one used in text classification (Collobert et al., 2011;Kim, 2014;Lee and Dernoncourt, 2016;Gehrmann et al., 2017).
1. the embedding layer converts each feature (word, relative positions 1 / 2, type of entity, and POS tag) into an embedding vector via a lookup table and concatenates them.2. the convolutional layer with ReLU activation transforms the embeddings into feature maps by sliding filters over the tokens.3. the max pooling layer takes the most effective feature in each feature map by applying the max operator.4. the fully connected layer with softmax activation outputs the probability of each relation.

Rule-based postprocessing
Finally, the postprocessing step uses the rules in Table 1 to correct the relations detected by the CNN, or to detect additional relations.These rules were developed from the examples in the training set, to be consistent with common sense.(Xu et al., 2015), "rel": relation, "arg": argument."Syn", "Hypo", "Hyper", and "None" refers to the "Synonym-of", "Hyponym-of", "Hypernym-of", and "None' relations.Note that the "Hypernym-of" relation is the reverse of the "Hyponym-of" relation, introduced in addition to the relations annotated for the dataset.

Implementation
During training, the objective is to maximize the log probability of the correct relation type.The model is trained using stochastic gradient descent with minibatch of size 16, updating all parameters, i.e., token embeddings, feature embeddings, CNN filter weights, and fully connected layer weights, at each gradient descent step.For regularization, dropout is applied before the fully connected layer, and early stop with a patience of 10 epochs is used based on the development set.
The token embeddings are initialized using publicly available 1 pre-trained token embeddings, namely GloVe (Pennington et al., 2014) trained on Wikipedia and Gigaword 5 (Parker et al., 2011).The feature embeddings and the other parameters of the neural network are initialized randomly.
To deal with class imbalance, we upsampled the synonym and hyponym classes by duplicating the examples in the positive classes so that the upsampling ratio, i.e., the ratio of the number of positive examples in each class to that of the negative examples, is at least 0.5.Without the upsampling, it was impossible to train the model.

Dataset
We evaluate our model on the ScienceIE dataset (Augenstein et al., 2017), which consists of 500 journal articles evenly distributed among the domains of computer science, material sciences and physics.Three types of entities are annotated: process, task, and material.The relation between each pair of entity of the same type within 1 http://nlp.stanford.edu/projects/glove/ a sentence are annotated as either "Synonym-of", "Hyponym-of", or "None".Table 3

Argument ordering strategies
One of the main challenges in relation extraction is the ordering of arguments in relations, as many relations are order-sensitive.For example, consider the sentence "A dog is an animal."If we set "dog" be the first argument and "animal" the second, then the corresponding relation is "Hyponym-of"; however, if we reverse the argument order, then the "Hyponym-of" relation does not hold any more.Therefore, it is crucial to ensure that 1) the CNN is provided with the information about the argument order, and 2) it is able to utilize the given information efficiently.In our work, the former point is addressed by providing the CNN with the two relative position features compared to the first and the second argument of the relation respectively.In order to certify the latter point, we experimented with four strategies for argument ordering, outlined in Table 2.

Results and Discussion
Table 5 shows the results from experimenting with various argument ordering strategies.The correct order strategy performed the worst, but the negative sampling improved over it slightly, while the fixed order and any order strategies performed the best.The latter two strategies performed almost equally well in terms of micro-averaged F1-score.This implies that for relation extraction it may be advantageous to use both the original relation classes as well as their "reverse" relation classes for training, instead of using only the original relation classes with the "correct" argument ordering (with or without the negative sampling).More-over, ordering the argument as the order of appearance in the text and training once per relation (i.e., fixed order) is as efficient as training each relation as two examples in two possible argument ordering, one with the original relation class and the other with the reverse relation class (i.e., any order), despite the small size of the dataset.
The difference in performance between the correct order versus the fixed or any order strategies is more prominent for the "Hyponym-of" relation than for the "Synonym-of" relation.This is expected, since the argument ordering strategy is different only for the order-sensitive "Hyponymof" relation.It is somewhat surprising though, that the correct order strategy performs worse then the other strategies even for order-insensitive "Synonym-of" relation.This may be due to the fact that the model does not see any training examples with the reversed argument ordering for the "Synonym-of" relation.In comparison, the negative sampling strategy, which learns from both the original and reversed argument ordering for the "Synonym-of" relation, the performance is comparable to the two best performing strategies.
We have also experimented with different evaluation strategies for the models trained with the any order and fixed order strategies.When the model is trained with the any order strategy, the choice of the evaluation strategy does not impact the performance.In contrast, when the model is trained with the fixed order strategy, it performs better if the same strategy is used for evaluation.This may be the reason that the model trained with the correct order strategy does not perform as well, since it has to be evaluated with a different strategy from training, namely the any order strategy, as we do not know the correct ordering of arguments for examples in the test set.
We have also tried training binary classifiers for the "Hyponym-of" and the "Synonym-of" relations separately and then merging the outputs of the best classifiers for each relations.While the binary classifiers individually performed better than the multi-way classifier for each corresponding relation class, the overall performance based on the micro-averaged F1-score did not improve over the multi-way classifier after merging the outputs of the hyponym and the synonym classifiers.
Based on the results from the argument ordering strategy experiments, we submitted the model trained using the fixed order strategy, which ranked number one in the challenge.The result is shown in Table 6.To quantify the importance of various features of our model, we trained the model by gradually adding more features one by one, from word embeddings, relative positions, and entity types to POS tags in order.The results on the importance of the features as well as postprocessing are shown in Figure 2. Adding the relative position features improved the performance the most, while adding the entity type improved it the least.
Figure 3 quantifies the impact of two preprocessing steps, deleting brackets and cutting sentences, introduced to compensate for the small dataset size.Cutting the sentence before the first entity and after the second entity resulted in a dramatic impact on the performance, while deleting brackets (i.e., removing the reference marks) improve the performance modestly.This implies that the text between the two entities contains most of the information about the relation between them.

Conclusion
In this article we have presented an ANN-based approach to relation extraction, which ranked first in the SemEval-2017 task 10 (ScienceIE) for relation extraction in scientific articles (subtask C).We have experimented with various strategies to incorporate argument ordering for orderingsensitive relations, showing that an efficient strategy is to fix the arguments ordering as appears on the text by introducing reverse relations.We have also demonstrated that cutting the sentence before the first entity and after the second entity is effective for small datasets.

Figure 1 :
Figure 1: CNN architecture for relation extraction.The left table shows an example of input to the model.

Figure 2 :Figure 3 :
Figure 2: Importance of features of CNN and postprocessing rules.w: word embeddings, rp: relative positions to the first and the second arguments, et: entity types, pos: POS tags.

Table 1 :
table shows an example of input to the model.Rules used for postprocessing.

Table 2 :
Argument ordering strategies."w/ neg.smpl.": with negative sampling shows the number of examples for each relation class.

Table 3 :
Number of examples for each relation class in ScienceIE."Dev": Development.

Table 5 :
Results for various ordering strategies on the development set of the ScienceIE dataset, averaged over 10 runs each. 3"corr.w/ n. s.": correct order with negative sampling.Hyp+Syn is obtained by merging the output of the best hyponym classifier and that of the best synonym classifier.

Table 6 :
Result on the test set of the ScienceIE dataset, using the official train/dev/test split.