GU IRLAB at SemEval-2018 Task 7: Tree-LSTMs for Scientific Relation Classification

SemEval 2018 Task 7 focuses on relation extraction and classification in scientific literature. In this work, we present our tree-based LSTM network for this shared task. Our approach placed 9th (of 28) for subtask 1.1 (relation classification), and 5th (of 20) for subtask 1.2 (relation classification with noisy entities). We also provide an ablation study of features included as input to the network.


Introduction
Information Extraction (IE) has applications in a variety of domains, including in scientific literature.Extracted entities and relations from scientific articles could be used for a variety of tasks, including abstractive summarization, identification of articles that make similar or contrastive claims, and filtering based on article topics.While ontological resources can be leveraged for entity extraction (Gábor et al., 2016), relation extraction and classification still remains a challenging task.Relations are particularly valuable because (unlike simple entity occurrences) relations between entities capture lexical semantics.SemEval 2018 Task 7 (Semantic Relation Extraction and Classification in Scientific Papers) encourages research in relation extraction in scientific literature by providing common training and evaluation datasets (Gábor et al., 2018).In this work, we describe our approach using a tree-structured recursive neural network, and provide an analysis of its performance.
There has been considerable previous work with scientific literature due to its availability and interest to the research community.A previous shared task (SemEval 2017 Task 10) investigated the extraction of both keyphrases (entities) and relations in scientific literature (Augenstein et al., 2017).However, the relation set for this shared task was limited to just synonym and hypernym relation-ships.The top three approaches used for relationonly extraction included convolutional neural networks (Lee et al., 2017a), bi-directional recurrent neural networks with Long Short-Term Memory (LSTM, Hochreiter and Schmidhuber, 1997) cells (Ammar et al., 2017), and conditional random fields (Lee et al., 2017b).
There are several challenges related to scientific relation extraction.One is the extraction of the entities themselves.Luan et al. (2017) produce the best published results on the 2017 ScienceIE shared task for entity extraction using a semisupervised approach with a bidirectional LSTM and a CRF tagger.Zheng et al. (2014) provide an unsupervised technique for entity linking scientific entities in the biomedical domain to an ontology.
Contribution.Our approach employs a treebased LSTM network using a variety of syntactic features to perform relation label classification.We rank 9th (of 28) when manual entities are used for training, and 5th (of 20) when noisy entities are used for training.Furthermore, we provide an ablation analysis of the features used by our model.Code for our model and experiments is available.1

Methodology
Syntactic information between entities plays an important role in relation extraction and classification (Mintz et al., 2009;MacAvaney et al., 2017).Similarly, sequential neural models, such as LSTM, have shown promising results on scientific literature (Ammar et al., 2017).Therefore, in our approach, we leverage both syntactic structures and neural sequential models by employing a tree-based long-short term memory cell (tree-LSTM).Tree-LSTMs, originally introduced by Tai et al. (2015), have been successfully used to capture relation information in other domains (Xu et al., 2015;Miwa and Bansal, 2016).On a high level, tree-LSTMs operate very similarly to sequential models; however, rather than processing tokens sequentially, they follow syntactic dependencies; once the model reaches the root of the tree, the output is used to compute a prediction, usually through a dense layer.We use the childsum variant of tree-LSTM (Tai et al., 2015).Formally, let S j = {t 1,j , . . ., t n,j } be a sentence of length n, e 1 = {t i , . . ., t k } and e 2 = {t p , . . ., t q } two entities whose relationship we intend to classify; let H(e 1 ), H(e 2 ) be the root of the syntactic subtree spanning over entities e 1 and e 2 .Finally, let T(e 1 , e 2 ) be the syntactic sub-tree spanning from H(e 1 ) to H(e 2 ).For the first example in Table 1, The proposed model uses word embeddings of terms in T(e 1 , e 2 ) as inputs; the output of the tree-LSTM cell on the root of the syntactic tree is used to predict one of the six relation types (y) using a softmax layer.A diagram of our tree LSTM network is shown in Figure 1.
In order to overcome the limitation imposed by the small amount of training data available for this task, we modify the general architecture proposed in (Miwa and Bansal, 2016) in two crucial ways.First, rather than using the representation of entities as input, we only consider the syntactic head of each entity.This approach improves the generalizability of the model, as it prevents overfitting on very specific entities in the corpus.For example, by reducing 'Bag-of-words methods' to 'methods' and 'segment order-sensitive models' to 'models', the model is able to recognize the COM- PARE relation between these two entities (see Table 1).Second, we experimented with augmenting each term representation with the following features: • Dependency labels (DEP): we append to each term embedding the label representing the dependency between the term and its parent.
• PoS tags (POS): the part-of-speech tag for each term is append to its embedding.
• Entity length (ENTLEN): we concatenate the number of tokens in e 1 and e 2 to embeddings representation of heads H(e 1 ) to H(e 2 ).
For terms that are not entity heads, the entity length feature is replaced by '0'.
• Height: the height of each term in the syntactic subtree connecting two entities.1 for relation label abbreviations.Subtask 1.1 uses manual entity labels, and subtask 1.2 uses automatic entity labels (which may be noisy).

Experimental Setup
our system for subtasks 1.1 and 1.2.In both of these subtasks, participants are given scientific abstracts with entities and candidate relation pairs, and are asked to determine the relation label of each pair.For subtask 1.1, both the entities and relations are manually annotated.For subtask 1.2, the entities are automatically generated using the procedure described in Gábor et al. (2016).This procedure introduces noise, but represents a more realistic evaluation environment than subtask 1.1.In both cases, relations and gold labels are produced by human annotators.All abstracts are from the ACL Anthology Reference Corpus (Bird et al., 2008).We randomly select 50 texts from the training datasets for validation of our system.We provide a summary of the datasets for training, validation, and testing in Table 2. Notice how the proportions of each relation label vary considerably among the datasets.We experiment with two sets of word embeddings: Wiki News and arXiv.The Wiki News embeddings benefit from the large amount of general language, and the arXiv embeddings capture specialized domain language.The Wiki News embeddings are pretrained using fastText with a dimension of 300 (Mikolov et al., 2018).The arXiv embeddings are trained on a corpus of text from the cs section of arXiv.org2using a window of 8 (to capture adequate term context) and a dimension of 100 (Cohan et al., 2018).A third variation of the embeddings simply concatenates the Wiki News and arXiv embeddings, yielding a dimension of 400; for words that appear in only one of the two embedding sources, the available embeddings are concatenated with a vector of appropriate size sampled from N (0, 10 −8 ).
For our official SemEval submission, we train our model using the concatenated embeddings and one-hot encoded dependency label features.We use a hidden layer of 200 nodes, a 0.2 dropout rate, and a training batch size of 16.Syntactic trees were extracted using SpaCy 3 , and the neural model was implemented using MxNet4 .
The official evaluation metric is the macroaveraged F1 score of all relation labels.For additional analysis, we use the macro precision and recall, and the F1 score for each relation label.that our approach is generally more tolerant to the noisy entities given in Subtask 1.2 than most other approaches.Figure 2 is a confusion matrix for the official submission for subtask 1.1.The three most frequent labels in the training data (USAGE, MODEL-FEATURE, and PART WHOLE) are also the most frequently confused relation labels.This behavior can be partially attributed to the class imbalance.

Overall
In Table 4, we examine the effects of various feature combinations on the model.Specifically, we check the macro averaged precision, recall, and F1 scores for both subtask 1.1 and 1.2 with various sets of features on the test set.Of the combinations we investigated, including the dependency labels, part of speech tags, and the token length of entities yielded the best results in terms of overall F1 score for both subtasks.The results by individual relation label are more mixed, with the overall best combination simply yielding better performance on average, not on each label individually.Interestingly, the entity height feature reduces performance, perhaps indicating that it is easy to overfit the model using this feature.
Table 5 examines the effect of the choice of word embeddings on performance.In both subtasks, concatenating the Wiki News and arXiv embeddings yields better performance than using a single type of embedding.This suggests that the two types of embeddings are useful in different cases; perhaps Wiki News better captures the general language linking the entities, whereas the arXiv embeddings capture the specialized language of the entities themselves.

Conclusion
In this work, we investigated the use of a tree LSTM-based approach for relation classification in scientific literature.Our results at SemEval 2018 were encouraging, placing 9th (of 28) at subtask 1.1 (relation classification with manuallyannotated entities), and 5th (of 20) at subtask 1.2 (relation classification using automaticallygenerated entities).Furthermore, we conducted an analysis of our system by varying the system parameters and features.

Table 1 :
Example relations for each type.Entities are underlined, and all relations are from the first entity to the second entity (non-reversed).

Table 2 :
Frequency of relation labels in train, validation, and test sets.See Table MODEL-FEATURE, PART WHOLE, COMPARE, RE-SULT, and TOPIC.Examples of each type of relation are given in Table 1.

Table 4 :
Feature ablation results for subtasks 1.1 and 1.2.DEP are dependency labels, POS are part of speech labels, EntLen is is the length of the input entities, and Height is the height of the entities in the dependency tree.In both subtasks 1.1 and 1.2, the combination of dependency labels, parts of speech, and entity lengths yield the best performance in terms of overall F1 score.

Table 5 :
Performance comparison for subtasks 1.1 and 1.2 when using Wiki News and arXiv embeddings.The concatenated embeddings outperform the individual methods.