SciREL at SemEval-2018 Task 7: A System for Semantic Relation Extraction and Classification

This paper describes our system, SciREL (Scientific abstract RELation extraction system), developed for the SemEval 2018 Task 7: Semantic Relation Extraction and Classification in Scientific Papers. We present a feature-vector based system to extract explicit semantic relation and classify them. Our system is trained in the ACL corpus (BIrd et al., 2008) that contains annotated abstracts given by the task organizers. When an abstract with annotated entities is given as the input into our system, it extracts the semantic relations through a set of defined features and classifies them into one of the given six categories of relations through feature engineering and a learned model. For the best combination of features, our system SciREL obtained an F-measure of 20.03 on the official test corpus which includes 150 abstracts in the relation classification Subtask 1.1. In this paper, we provide an in-depth error analysis of our results to prevent duplication of research efforts in the development of future systems


Introduction
Automatic detection and extraction of semantic relations among the entities from unstructured text has received growing attention in the recent years (Konstantinova, 2014), (Augenstein et al., 2017), (Fundel et al., 2006), (Luo et al., 2016). Text mining is the process of automatically extracting knowledge from unstructured text documents and this idea of text mining is to link extracted information together which possibly results in new facts or hypothesis to be explored further through conventional scientific experimentations (Delen and Crossland, 2008), (Fleuren and Alkema, 2015).
SemEval 2018 Task 7 (Gábor et al., 2018) aims to extract and classify semantic relations to improve the access to scientific literature. Their tasks focus on identifying pairs of entities that are instances of six semantic relation types and classifying those instances into one of the six semantic relation types. To address this challenge, we implemented a supervised machine learning based approach in order to extract explicit semantic relations from the ACL anthology corpus (Bird et al., 2008) for Subtask 1.1.

Methodology
In this section, we describe our relation extraction system (SciREL) which classifies the semantic relations into one of the given six categories of relations. The main steps of our approach can be summarized as follows. First, an abstract with annotated entities is given as the input into our system and all the sentences in the abstract are segmented, preprocessed, and the entity pairs are identified. Second, a set of features are defined and are combined into a feature vector which is used to train a machine learning model. This is the most crucial part of our system, as the idea is to decrease the size of the effective vocabulary which would in turn increase the classification accuracy by eliminating the noise in the features (GuoDong et al., 2005). Relations between entities are extracted and classified into one of the six relations through this learned model. Each step of our approach is discussed in detail in the following subsections.

Preprocessing Steps
All sentences in the abstracts are preprocessed to normalize the text so that the input text is guaranteed to be consistent and feature extraction/classification is simplified. Some of the existing NLP techniques and tools are used for preprocessing. Preprocessing is performed as follows 1 : 1) tokenization; 2) convert text to lower case; 3) removal of special characters; and 4) lemmatization.

Feature selection
The most challenging part of our system is the feature selection and the feature vector generation (Sammons et al., 2016). After preprocessing the input text, a subset of words which contain the respective entity pair are selected from each sentences, a set of features are computed and a feature vector is created by combining the computed features.
After the initial text processing, a separate set of steps are followed where each feature is computed. Some features are extracted in two different scenarios: before removing the stop words and after removing the stop words. Stop words are the most common words of the language that do not contribute to the semantics of the documents or contain any significance but has a high frequency. Filtering out such words prevents from returning vast amount of unnecessary information.
Bigram is a sequence of words formed from two adjacent words, and bigram frequency of the word pairs between entities is calculated in some features. Collocations 2 are words that appear successively and the frequencies of such words appearing in the the context of other words are calculated in some features and the highest value of the bigram collocations is considered during the feature selection. The bag-of-words model which represents a text as the bag of its words, ignoring its grammar and word order is used in some features to group the words from the sentence for further processing (Peng et al., 2016).
Part-of-speech tagging (POS tagging) is applied on words in some features which assigns parts of speech to those words (Fundel et al., 2006). This helps in disambiguating homonyms and improving the efficiency of feature selection. Term frequency-inverse document frequency (TF-IDF) values are calculated for a set of selected words in some features to distinguish important words based on how frequently they appear across multiple documents (GuoDong et al., 2005). During the feature selection, a representative set of features is computed for each entity pair. Features used in building our system are listed below; E1 refers to the first entity and E2 refers to the second entity.

Multi-class classification
In the final step of our approach, a feature vector is generated for each sentence by incorporating the extracted features in the previous step. The generated feature vector is then used to train a classifier which classifies the relation into the given six categories. The following classifiers which represent three main classification algorithms are used to train and evaluate the data set in our approach: 3 Decision Trees, Naive Bayes, and Support Vector Machines (SVMs). The resulting model is then used to classify the extracted semantic relations into one of the six categories below: Usage, Result, Model-feature, Part-whole, Topic, Compare.

Dataset
We evaluated our system on the dataset provided by the SemEval 2018 -Task 7. The dataset contains abstracts from the ACL Anthology Corpus (Bird et al., 2008) with pre-annotated entities that represent concepts. The dataset provided for the evaluation is divided into two subsets: training set and test set. The training set includes 350 abstracts containing 5259 entities and 1228 annotated types of relations between entities. The test set includes 150 abstracts containing 2246 entities and 355 annotated types of relations between entities. During the development, the training set is split into 60/40 and k-fold cross validation was used to evaluate the performance.

Results
Our system was evaluated on both the development corpus and the official test corpus and the set of features are extracted for each entity pair from the training corpus which was used to compute the feature vector. The feature set of our model included 37 features in total which resulted in 2 37 combinations of features. We conducted an ablation study to determine the efficacy of the different combinations of features when run with different classifiers and selected the feature combination that resulted in high performance. Consequently, it was found that the following features produce the best performance:

Error analysis
The performance of our system is quite low therefore, we performed an error analysis to identify some of the mistakes from our system output and find ways to improve it. Our classification model was trained to distinguish between six semantic relations and the confusion matrix displays the results of testing the model for further inspection. Table 2 shows the confusion matrix based on the performance of our classification model trained on the test corpus. We identified three main areas which affected the performance of our system: 1) feature selection; 2) vector representation; and 3) class imbalance.
Feature Selection. We compared the effects of different features and from this analysis, we found several reasons for their poor performance. First, for the lexical information, we are only incorporating the word prior to each of the entities and a single bigram that exists between them. This misses information such as if there is only a single word in between the entities, and in the case were there are more than two words, we miss additional contextual information describing the relationship. Second the syntactic information does not contain an explicit representation of what was seen between the two entities. We focused on the number of unique types of POS tags rather than what type of tags were actually present. In conclusion, we believe that our feature set does contain enough contextual information from between the two entities.
Vector representation. Another major reason for the poor performance of our system is the way the feature vectors are representing the relationship. We generated a feature vector for each entity pair and for all the proposed features which resulted in a feature vector with only 37 features initially. Then, we selected the best set of features that gave the best performance with the model and eliminated the rest, which reduced the size of the feature vector further and we ended up with the feature vector that contained only 7 features. Each feature was represented numerically, therefore if there were more than one bigram, or POS tag sequence between the entities, we were not able to incorporate it into our representation. In addition, analysis of the test instances show that for 100 of the 355 instances, we do not have any contextual or syntactic information due to the stop word removal for three of the features. In conclusion, we believe that this feature vector representation is too compact and does not hold sufficient contextual information to identify patterns between the relationships.
Class Imbalance. From Table 2 Table 3. From the results, we can clearly see that USAGE which is the majority class shows high performance compared to other categories. In conclusion, we can say most of the misclassified instances belong to the category of USAGE indicating that the machine learning algorithm was unable to identify discriminating features between the classes and defaulted to the majority class.

Conclusions
Our goal is to design a system that identify pairs of entities that are instances of any of the given semantic relations. Our system (SciREL) is built to serve this purpose, so that when an input with annotated entities is fed into the model it identifies, extracts and classify the semantic relations. The model selects the set of features that shows the best performance with the classifier and combines the features to compute a feature vector. The classifier then classifies the instances into one of the six semantic relation types. Our system classifies the given ACL anthology corpus with the Fmeasure of 20.03 on the official test corpus with the SVM classifier. Due to the low results, we provide an in-depth error analysis of our results to prevent duplication of research efforts in the development of future systems. We identified three main areas which affected the performance of our system: 1) feature selection; 2) vector representation; and 3) class imbalance. In conclusion, we believe our feature set does contain enough contextual information from between the two entities and the feature vector representation is too compact to hold sufficient contextual information to discriminate between the classes.