NTNU-2 at SemEval-2017 Task 10: Identifying Synonym and Hyponym Relations among Keyphrases in Scientific Documents

This paper presents our relation extraction system for subtask C of SemEval-2017 Task 10: ScienceIE. Assuming that the keyphrases are already annotated in the input data, our work explores a wide range of linguistic features, applies various feature selection techniques, optimizes the hyper parameters and class weights and experiments with different problem formulations (single classification model vs individual classifiers for each keyphrase type, single-step classifier vs pipeline classifier for hyponym relations). Performance of five popular classification algorithms are evaluated for each problem formulation along with feature selection. The best setting achieved an F1 score of 71.0% for synonym and 30.0% for hyponym relation on the test data.


Problem Description
Task C of ScienceIE at SemEval-2017(Augenstein et al., 2017) concerns identifying sentence level 'SYNONYM-OF' (or 'same-as') and 'HYPONYM-OF' ('is-a') relations among three types of keyphrases: PROCESS (PR), TASK (TA) and MATERIAL (MA) in scientific documents. The 'SYNONYM-OF' relation is symmetric, whereas the 'HYPONYM-OF' relation is directed. Hyponym relation prediction is thus associated with two ordered subtasks: (1) predicting relations between pairs of keyphrases; (2) predicting the direction of the relation. It is assumed that there are no relations between keyphrase of different types. Automatic identification of synonym/hyponym relations is useful for many NLP applications, e.g. knowledge base completion and ontology construction.

Challenges
The relation prediction task of ScienceIE is challenging and quite different from other semantic relation prediction task like SemEval-2010 Task 8 (Hendrickx et al., 2009). In SemEval-2010 Task 8, there are two marked nominals in a sentence and the task is to predict if any of nine semantic relations hold between the nominal pair. Although there are more relations than ScienceIE (9 vs 2), ScienceIE poses different challenges. Instead of single-word nominals, the keyphrases of ScienceIE are arbitrarily large text spans referring to larger syntactico-semantic units. The top part of Table 1 shows the percentage of keyphrases longer than 10 tokens in the training (10.89%), development (8.76%) and test (6.71%) data. The problem with such large text spans is to identify features which best represent the keyphrase and contribute most to the relation prediction task.
Another challenge of ScienceIE is the occurrence of multiple keyphrases in one sentence, producing a large number of possible relations among keyphrase pairs, i.e., n(n − 1)/2 for n keyphrases. As most of these are negative instances, the positive and negative classes are imbalanced.
A third challenge is the potentially long distance between keyphrase pairs. The middle part of Table 1 shows that there are 49.2%, 57.68% and 43.77% keyphrase pairs in training, development and test sets respectively which are separated by more than 19 tokens. In addtion, a number of other keyphrases can occur in between a pair of related keyphrases, as shown in Table 1.
Finally,the number of synonym and hyponym relations in the training and development datasets is limited. The bottom part of Table 2 shows the frequencies of relations in training and development datasets (ignoring inter-sentence keyphrase relations).

Approach
Inspired by the best systems at SemEval-2010 Task 8 (Rink and Harabagiu, 2010), we developed our relation extraction system in a supervised learning framework with the dependency structure of the input sentence as the major resource. The main intuition is that Bunescu and Mooney (2005) showed that the shortest path between two entities in a dependency graph contains most of the information for identifying the relation between them.
In causal relation extraction (Barik et al., 2017), we have experienced that such intuition is effective. We tried two alternative approaches.

Approach-1: Individual vs Single Classifier
As relations only occur between keyphrases of the same type, our first experiment evaluates the performance of separate synonym and hyponym classifiers for each keyphrase type, resulting in six classification problems. The description of System-1 provides more details on the classifiers. The main challenge of developing individual classifiers for each task is the limited number of instances in the dataset. For example, there are only 11 relation instances between TASK (TA) keyphrases in the training data and only a single one in the dev data. Hence individual classifiers might not generalize well enough. Therefore, an alternative approach is to train one synonym classifier and one hyponym classifier for all keyphrase pairs, ignoring their types. This gives a higher number of positive training instances -249 for synonym and 414 for hyponym -as shown in Table 2. This is the approach taken with System-2. In both of these problem formulations, synonym is a binary classification problem, whereas the hyponym relation is considered as ternary classification (i.e., forward relation, backward relation and no relation).

Approach-2:
Hyponym Relation-Direction Prediction Since the hyponym relation is directed, another option is to predict its direction separately. Whereas in Approach-1 hyponym relations and their direction were predicted simultaneously as a three class problem, in Approach-2 we have developed two systems -for relation prediction and direction prediction -and connect them in a pipeline. System-3 thus refers to a pipelined classification of hyponym relations.

Experiments
Preprocessing Input text is linguistically analyzed with the Stanford CoreNLP library (Manning et al., 2014), which includes sentence boundary detection, tokenization, lemmatization, partof-speech (POS) tagging and dependency parsing.
Feature Extraction Features are extracted for every possible keyphrase pair within a sentence. The feature extraction process dependents heavily on contextual information and dependency structures, specifically, the shortest dependency path between two keyphrase heads and the dependency subtree connecting two keyphrases as described in (Liu et al., 2015). The major feature categories are: •  Feature Selection Methods As shown Table 1, the keyhrase length ( ) and the in-between context length (λ) can be arbitrarily large. As a result, the feature extraction process generates a large number of features, many of which are unlikely to provide any useful information. Therefore we investigated three different feature selection techniques, as shown in the bottom half of Table 3. Among these feature selection techniques, χ 2 -based feature selection (X2) gave the best result.
Parameter Optimization through CV The training instances were extracted from 350 training files, indexed by training file name, followed by preprocessing and feature extraction as described above. The class weights, parameters for five classifiers and k (the top-k feature for χ 2based feature selection) were optimized for the three different experimental setups (System 1-3) descibed below using five fold cross validation with grid search, where training instances from the same training file are always in the same fold. Our implementation relied on classifiers, feature selection methods and CV grid search from Scikitlearn 1 .
For each task, we optimized the hyper-parameters of five classifiers as shown in Table 3. The performance of the best classifier was then evaluated on the development dataset. For the hyponym relation, we optimized on the micro-average score over the forward and backward relation.
System-2 System-2 consists of a combination of one synonym classifier and one hyponym classifier.
System-3 Hyponym relations and their directions were predicted by separate classifiers connected in a pipeline. Parameters were therefore optimized for relation and direction prediction separately. The synonym predictions of System-3 result from the combination of the synonym classifier of 1-4 and 2 where any keyphrase pair predicted by either classifier 1-4 or classifier 2 is considered as synonym. Table 4 shows the result of System 1-3 on development data, while Table 5 shows performance on test data. According to Table 4, the combined performance of individual classifiers (of System-1) for synonym (SM-SP-ST) and hyponym (HM-HP-HT) is 77% and 29%, which is slightly lower then the corresponding performance of system-2. This is consistent with performance on the test data.On the other-hand, the pipeline of System-3 shows a lower score than System-1 and System-2 for the hyponym relation.

Error Analysis
We have analyzed the mistakes produced by System 1-3 and found the following frequent error categories: • synonyms -The synonyms with pattern KEYPHRASE1 (KEYPHRASE2 in abbrevi-  • hyponyms with conjunctions -when a list of hyponyms is connected by conjunctions, often some hyponyms are missed. • hyponym to synonym -In some cases hyponym patterns are quite similar to frequent synonym patterns and therefore misclassified. For example, in the sentence fragment, 'xR is the x-position of the receiving element (R)', the keyphrase 'R' is connected with 'receiving element' by a synonym relation, whereas the correct relation is hyponym.
• synonym to hyponym -In some cases a synonym relation is observed instead of a hyponym relation. For example, in 'constituent statistics (SB, SDSD, and LCS)', the keyphrases 'SDSD' and 'LCS' are correctly linked to the 'constituent statistics' by a hyponym relation, but 'SB' is incorrectly linked as a synonym.

Conclusion
We have described our system for predicting synonym and hyponym relations between keyphrases within a feature-based supervised learning framework. We have developed three systems for the synonym and hyponym prediction tasks. Experiments showed that with a relatively small dataset, training a single classifier for synonym and hyponym works slightly better than training separate classifiers for each keyphrase type. We also found that a pipeline of classifiers for relation and direction prediction of hyponym relations is not effective compared with predicting relation and direction simultaneously. As future work, we can investigate the performance of neural network-based relation classification approaches (specifically Convolution and Recurrent Neural Networks).