NTNU at SemEval-2018 Task 7: Classifier Ensembling for Semantic Relation Identification and Classification in Scientific Papers

The paper presents NTNU’s contribution to SemEval-2018 Task 7 on relation identification and classification. The class weights and parameters of five alternative supervised classifiers were optimized through grid search and cross-validation. The outputs of the classifiers were combined through voting for the final prediction. A wide variety of features were explored, with the most informative identified by feature selection. The best setting achieved F1 scores of 47.4% and 66.0% in the relation classification subtasks 1.1 and 1.2. For relation identification and classification in subtask 2, it achieved F1 scores of 33.9% and 17.0%,


Introduction
Scientific papers are valuable knowledge sources providing authentic insights about certain aspects of the research domains. With the advancement of scientific research, a massive growth of published articles are observed. As per the American Journal Experts (AJE) scholarly publishing report 1 , approximately 2.2 million articles were added to the literature in 2016 only. The sheer volume of the ever increasing literature of any scientific discipline makes it hard for human capability and expertise to quickly process and identify information of interest. Therefore, there is a need to efficiently exploit automatic means of accessing this reliable unstructured knowledge repository.
Semantic relation extraction is one of the main information extraction tasks, and aims to identify a pair of arguments connected by certain predefined relation types based on a target application. The relation arguments are of different types such as Named Entities (Freitas et al., 2009), nominals (Hendrickx et al., 2009), general keyphrases (Gábor et al., 2016;Augenstein et al., 2017), quantitative variables (Marsi et al., 2014) or events , and are syntactically represented by noun phrases, clauses or larger complex structures. A semantic relation may be either symmetric (undirected) or asymmetric (hierarchical).
Supervised machine learning approaches have been successfully used for identifying semantic relations encoded in texts. Broadly, three types of supervised approaches to relation extraction have been investigated: feature-based (Kambhatla, 2004;Jiang and Zhai, 2007), kernel-based (Zelenko et al., 2003), and neural network based (Zeng et al., 2014;Miwa and Bansal, 2016).
In this work, various relation identification and classification subtasks of SemEval 2018 Task 7 (Gábor et al., 2018) were addressed using featurebased approaches. A wide variety of features was explored, including lexical (e.g., bag-ofwords, lemmata, n-grams), syntactic (e.g., part-ofspeech, parsing information), semantic (e.g., dependency information, WordNet (Miller, 1995)), and other binary indicators. A χ 2 -based feature selection technique was used to identify informative features. The class weights and parameters of five different classifiers-Support Vector Machines (SVM), Decision Trees (DT), Random Forests (RF), Multinomial Naïve Bayes (MNB), and k-Nearest Neighbor (kNN)-were optimized for each subtask through grid search and k-fold cross-validation. These classifiers were chosen as they are effective in identifying and classifying semantic relations in feature-based classification scenario . The trained classifiers were ensembled using majority class labels (hard voting) for the final predictions. All classifier, feature selection and classifier ensembling modules used were implemented in the scikit-learn (Pedregosa et al., 2011) machine learning library. The tasks and the datasets are described in Section 2, while Section 3 outlines the experimental setup, system architecture and parameter optimisation. Section 4 discusses the results of the final evaluation of SemEval 2018 Task 7, where the system achieved 47.4% and 66.0% F 1 scores on the relation classification subtasks 1.1 and 1.2. In subtask 2, the system reached 33.9% and 17.0% F 1 scores for relation identification and relation classification, respectively. These results are eloborated on in Section 5, before Section 6 concludes and points to future research. There are six relation types, among which USAGE, RESULT, MODEL-FEATURE, PART_WHOLE, and TOPIC are asymmetric, while COMPARE is the only symmetric relation. All the relations are intrasentential and there are no referring expressions.

Task and Dataset Description
The training dataset consisted of two subsets: D 1 : 350 abstracts of scientific papers that have been manually annotated with entity mentions and relation labels (clean data), and D 2 : 350 abstracts with entity mentions automatically labelled, but with the relations labelled manually (noisy data). Subtask 1.1 and subtask 2 are associated with the clean dataset D 1 , while subtask 1.2 is associated with the noisy D 2 dataset. The test data consisted of 150 abstracts each for subtask 1.1, 1.2 and 2. Table 1 shows the distribution of relation instances into different relation types, and their forward (Fwd) and reverse (Rev) directionalities in datasets D 1 and D 2 . The highest number of instances are of the USAGE type in both datasets, whereas TOPIC is the least frequent relation type (1.46%) in D 1 , but significant (19.47%) in D 2 . The overall forward directionalities of relations are 68% in D 1 and 73.63% in D 2 . The directionalities of individual relation types are similar.
The most frequent lengths of the entity mentions are two and one word(s) in D 1 and D 2 , respectively, with maximum lengths of 13 and 4.
The most frequent context lengths of the relation instances are two words, with a highest length of 31 words (RESULT(I05-3022.6, I05-3022.16)) in D 1 and 24 words (USAGE(E91-1004.30, E91-1004.37)) in D 2 . The average number of entities in the sentences are 3 and 6 in D 1 and D 2 , with highest number of entities being 17 and 29, respectively.
3 Experimental Setup Figure 1 shows the processing pipeline common to both relation identification and classification. The processing steps are elaborated on below.
Inputs to brat annotation: The input training and test files are in xml format with the entity mentions marked.
Each entity mention has an ID with two parts, abstract ID and entity number.
For example, the entity ID H91-1045.18 denotes abstract ID H91-1045 and entity number 18. The relation labels are in a separate file with the format TOPIC(A92-1023.7,A92-1023.8,REVERSE), where the first two arguments of the relation type are entity IDs and the last is the directionality of the relation. The xml and relation label files were converted into 'brat' (Stenetorp et al., 2012) format, with the text content of each abstract ID kept in a text file, and entity and relation information kept in an annotation file. Conversion to brat format helps to visualize and study the annotations of the training and test set output. Also, the text content (without entity tags) is used for preprocessing.
Text Processing: The text content of each abstract is analyzed with the Stanford CoreNLP toolkit (Manning et al., 2014) for sentence boundary detection, tokenization, lemmatization, partof-speech (POS) tagging, and constituent and dependency parsing. Character offset-based brat entity annotations are mapped into word level indices using the tokens' character offsets. Finally, the dependency heads of entity mentions, in between context and the text window representing the relation expression are identified.
Feature Extraction: Given a sentence with more than one entity mention, all possible entity pairs are considered in left to right order. For each entity pair, the text span containing the entities and their middle context is considered as the representation of the relation instance. As word features, unigrams and bigrams of the context and entity mentions (excluding articles, adjectives, cardinals, ordinals, pronouns, brackets and punctuations) are considered. Corresponding to word features, POS, word+POS, and lemma+POS combinations are included, as well as word and POS of entity dependency heads, context dependency heads, and their combinations.
As the shortest dependency path between the entity pair contains major information for relation identification (Bunescu and Mooney, 2005), dependency path features are added for the distance from left entity head to right entity head, words belonging to the dependency path and their rela-tions to the parent node. WordNet synonyms and hyponyms of dependency head of entities and contexts are included. Also, other binary indicators such as adjacent or overlapping entities are included.
Parameters Optimization through Cross-Validation (CV): As there was no development data available for model parameter tuning, 20% of the training data was kept as development data, and the remaining training data was used for parameters optimization with 5-fold crossvalidation. For relation labeling in subtask 1.1 and 1.2, the relation type is predicted against 11 classes (five directed and one undirected relation). Relation instance identification in subtask 2 is a binary classification problem, and the class weights of positive instances are optimized through CV. In the final system, the parameters are optimized on the entire training set.

Classifiers Ensembling and Final Prediction:
The optimized parameters of the classifiers and class weights are set to the classifiers. For each classifier, the χ 2 -based SelectKBest() method selects the top k features from the input feature space, where the k for each classifier is determined through cross-validation. The predictions of the classifiers are then ensembled with (majority) voting where each participating classifier uses its own feature selection method.

Results
Three separate submissions were made on the test data. The first two submissions were on relation classification on clean (subtask 1.1) and noisy data (subtask 1.2). The third submission (subtask 2) consisted of relation identification followed by classification on clean data. In subtask 2, a separate system was created for the relation identification, while the relation classification system of subtask 1.1 was used for the classification. Table 2 shows the performance (precision, recall and F 1 score) of individual classifiers, as well as their combinations in the relation classification subtask 1.1, where the scores are micro-averaged over all (11) classes. Among the individual classifiers, SVM gives the best result (56% F 1 score). Voting with the top-3 classifiers (SVM, DT & MNB) gave a slightly higher F 1 score of 58%. Table 3 shows the scores of the relation classification subtask on noisy training data (subtask 1.2).   As individual classifier, SVM gave the best performance with 69% F 1 score followed by Decision Trees (65%) and Multinomial Naïve Bayes (62%). The best performance of voting classifiers scored 73% using the classifiers SVM, RF and MNB. Table 4 shows the results of the relation identification in subtask 2. Again SVM gave the best single classifier level performance.

Discussion
The total relation instances in the clean data and in the noisy data are almost the same (1228 and 1248, respectively). However, it is interesting to observe that the best performance in relation classification both at the single classifier level and in ensemble voting on noisy data (subtask 1.2) is significantly higher than on clean data (subtask 1.1). This behaviour is consistent also on the test data.
One explanation may be the differences in relation expressions in dataset D 1 and D 2 . In the clean data (D 1 ), 25.66% of the entity mentions have three or more words with a maximum length of 13 words, whereas in the noisy data (D 2 ) only 0.96%  Table 4: Performance of positive class in relation identification on clean data (subtask 2). Ensemble-best is SVM+DT+MNB.
of the mentions have more than three words. The feature-based approach with n-grams as major feature source might not be able to capture the semantics of entity mentions having very large text spans. Furthermore, the context length between entity pairs in the clean data is larger than in the noisy data. Therefore, the shortest dependency paths and context n-grams-which are the two major feature sources-generate many insignificant features. Modeling the relation instances through a neural network could be a better alternative in this scenario. Feature selection has a positive impact on prediction both in relation identification and in classification. SVM gave the best results at the single classifier level on all subtasks, but needs a larger feature space, whereas MNB performed reasonably although needing the smallest number of features for training the classifier.

Conclusion
In this work, we experimented with the relation identification and classification subtasks of Sem-Eval 2018 Task 7 using a feature-based approach. A wide variety of features are explored, including lexical, syntactic, semantic, and other binary features. Two relation classification systems are developed on clean and noisy data and the third system is developed to identify relations in clean data. Five classifiers are trained for each subtask, with the final predictions made through voting based on the corresponding predictions of the individual classifiers. Experimental results shows that the lengths of the entity mentions and the lengths of the context in-between a pair of entities have significant impact on the relation identification and relation classification.