UC3M-NII Team at SemEval-2018 Task 7: Semantic Relation Classification in Scientific Papers via Convolutional Neural Network

This paper reports our participation for SemEval-2018 Task 7 on extraction and classification of relationships between entities in scientific papers. Our approach is based on the use of a Convolutional Neural Network (CNN) trained on350 abstract with manually annotated entities and relations. Our hypothesis is that this deep learning model can be applied to extract and classify relations between entities for scientific papers at the same time. We use the Part-of-Speech and the distances to the target entities as part of the embedding for each word and we blind all the entities by marker names. In addition, we use sampling techniques to overcome the imbalance issues of this dataset. Our architecture obtained an F1-score of 35.4% for the relation extraction task and 18.5% for the relation classification task with a basic configuration of the one step CNN.


Introduction
Nowadays, there is a high increase in the publication of scientific articles every year, which demonstrates that we are living in an emerging knowledge era. Experts cannot deal with this explosion of information and it is very hard to be up to date about the state-of-the-art techniques in a given field. This arduous task could be reduced if we automatically identify concepts from scientific articles and recognize the semantic relations between them with Natural Language Processing (NLP) techniques.
The Semantic Relation Extraction and Classification in Scientific Papers task at SemEval-2018 task 7 (Gábor et al., 2018) provides a framework for measuring the automatic annotation performance by models which are trained on scientific publications abstracts. The task defines six categories of relations between concepts and two tasks are proposed: (1) the classification of the relations between two entities in the predefined categories, which is divided in two scenarios according to the data used: clean or noisy; and (2) the extraction of the relations given the entities from the clean data, which also could involve their subsequent classification.
In this paper, we describe our participation for SemEval-2018 Task 7 on the extraction of relationships between entities in scientific papers and also the subsequent classification in the predefined classes of this relations with one step classifier. The model is based on the Convolutional Neural Network (CNN) proposed in (Kim, 2014), which was the first work to exploit this architecture for the task of sentence classification. CNN is a robust deep-learning architecture which has exhibited good performance in others NLP tasks such as semantic clustering , sentiment analysis (Dos Santos and Gatti, 2014) and event detection (Nguyen and Grishman, 2015). The model uses as the input of each instance the transformation into real value vectors of the words of the sentence, the distances to the target entities of each word and the Part-of-Speech types. Furthermore, we carry out a sampling technique to alleviate the imbalance issues of the dataset equalizing the number of the instances for all the classes.

Dataset
An annotated corpus for training and testing the participating systems was provided in the SemEval-2018 Task 7. The dataset contains 350 and 150 abstract from scientific articles for training and testing set, respectively.
The relation instances are divided into the following classes: USAGE, RESULT, MODEL, PART WHOLE, TOPIC and COMPARISON. All of them are asymmetrical except COMPARISON, where both entities are involved in the same bidirectional relation. A detailed description and analysis of the corpus and its methodology used to collect and process the scientific abstracts can be found in (Gábor et al., 2018).

Pre-processing phase
The relations between scientific concepts are annotated pair by pair in the abstracts. All annotated relations span within one sentence, thus, we split the paragraphs of the abstracts into sentences with NLTK tool 1 to generate all the possible instances in the corpus.
After that, each instance was tokenized, all words were converted to lower-case and special character were removed in order to clean the sentences as the approach described in (Kim, 2014). In addition, we used entity blinding for each relation to generalize the model, in which the two target entities of the relations were replaced by entity markers as "entity1" and "entity2", and "en-tity0" for the remaining entities. Since relations can be asymmetrical, we considered both directions. In other words, for each pair of candidates entities, we generated two different instances. For the COMPARISON class, which is a bidirectional relationship, we annotated both instances with the same class label. For example, the sentence: 'We suggest a method that mimics the behaviour of the oracle using a neural network or a decision tree.' should be transformed to the relation instances showed in Table 1.
Instances after entity blinding (entity1, entity2) (oracle, neural network) 'We suggest a method that mimics the behaviour of the entity1 using a entity2 or a entity0.' (neural network, oracle) 'We suggest a method that mimics the behaviour of the entity2 using a entity1 or a entity0.' (oracle, decision tree) 'We suggest a method that mimics the behaviour of the entity1 using a entity0 or a entity2.' (decision tree, oracle) 'We suggest a method that mimics the behaviour of the entity2 using a entity0 or a entity1.' (neural network, decision tree) 'We suggest a method that mimics the behaviour of the entity0 using a entity1 or a entity2.' (decision tree, neural network) 'We suggest a method that mimics the behaviour of the entity0 using a entity2 or a entity1.' Table 1: Instances of a sentence in the corpus after applying the pre-processing phase with entity blinding. Table 2 shows the number of the instances extracted in the training set per each class. The None class represents the number of pairs of entities that are not related (negative instances). The number of positive instances is very low compared to the negative ones, 1323 over 19210 (around 7%), mainly because most classes are unidirectional and we annotated the reverse instance as None.
We followed a similar sampling technique described in (Wang et al., 2017) to adjust the same numbers of instances per each class. Therefore, we randomly discard 60% of the negative instances and we duplicate the instances in each class until having the same number as the more representative class, 483 corresponding to US-AGE. Thus, we try to solve possible issues associated with the imbalanced dataset.

Method
In this section, we present a CNN model to detect and classify relationships between scientific concepts. Figure 1 shows the whole process from its input, which is a sentence with blinded entities, until the output, which is the classification of the instance into one of the relation types defined by the task.

Word table layer
Firstly, we determined n as the maximum sentence length in the training dataset. Those sentences with lengths shorter than n are padded with an auxiliary token "0". After that, we assigned a randomly initialized vector for each different word, creating thus a word embedding matrix: W e ∈ R |V |×me where V is the vocabulary size and m e is the word embedding dimension. Finally, we obtained a matrix x = [x 1 ; x 2 ; ...; x n ] for each instance where the words are represented by their corresponding word embedding vectors.
In addition, we used the word position embedding described in (Zeng et al., 2014), which  maps the distances of each word with respect to the two candidate entities into a real value vector using two position embedding matrices W d1 ∈ R (2n−1)×m d and W d2 ∈ R (2n−1)×m d where m d is the position embedding dimension. Moreover, we extracted the Part-of-Speech (POS) feature of each word (entities are marked as common nouns) and create a POS embedding matrix as (Zhao et al., 2016) W P OS ∈ R |P |×m P OS where P is the POS types vocabulary size and m P OS is the POS embedding dimension. Finally, we created an input matrix X ∈ R n×(me+m P OS +2m d ) which is represented by the concatenation of the word embedding, the POS embedding and the two position embeddings for each word in the instance.

Convolutional layer
Once we obtained the input matrix, we applied the convolutional operation with a context window of size w to create higher level features. For each filter in f = [f 1 ; f 2 ; ...; f w ], we created a score matrix for the whole sentence as where b is a bias term and g is a non-linear function (such as tangent or sigmoid) of m number of filters.

Pooling layer
We extracted the most relevant features of each filter using the max function, which produces a single value in each filter as z f = max{s} = max{s 1 ; s 2 ; ...; s n−w+1 }. Thus, we created a vector z = [z 1 , z 2 , ..., z m ], whose dimension is the total number of filters m representing the relation instance. In the end, we concatenated the output values of the different filters in this layer.

Softmax layer
In this layer, we performed a dropout to prevent over-fitting obtaining a reduced vector z d randomly dropping elements in z. After that, we fed this vector into a fully connected softmax layer with weights W s ∈ R m×k to compute the output prediction values for the classification as where d is a bias term. At test time, the vector z of a new instance is directly classified by the softmax layer without a dropout.

Learning
We defined the CNN parameter set to be learned in the training phase as θ = (W e , W P OS , W d1 , W d2 , W s , F m ), where F m are all of the m filters f. For this purpose, we used the conditional probability of a relation r obtained by the softmax operation as p(r|x, θ) = exp(o r ) k l=1 exp(o l ) to minimize the cross-entropy function for all instances (x i ,y i ) in the training set T as follows In addition, we minimized the objective function by using stochastic gradient descent over shuffled mini-batches and the Adam update rule (Kingma and Ba, 2014) to learn the parameters.

Results and Discussion
We define the CNN parameters for the experiments using the values described in Table 3. The number of epochs was fine-tuned in the validation set using the stopping criteria.  Our CNN system obtained an F1-score of 35.4% for the relation extraction task in which only the detection of relation is taken into consideration. The official results obtained for the relation classification task are showed in Table 4. Our model reaches an F1-score in Macro-average of 18.5% with one step classifier, which means that the extraction and classification are considered at the same time. This performance was expected because we reached the similar results with a validation set created from the training set. Furthermore, we correctly predicted 147 instances with correct directionality over 367 (i.e. 40.05% in coverage).
The main problem is the high number of FP in the majority of classes, which are the None instances classified as a class. In some classes such as PART WHOLE and USAGE we have also a high number of FN compared to the total number of instances. We consider that the main reason is that the representation of the two directions of each relation is very similar, only the position distances and the target entity names are inverted, and the CNN cannot distinguish between them.

Conclusions and Future work
A CNN model is used for the Relation Classification task of SemEval 2018 by UC3M-NII Team. Moreover, we balanced the dataset using sampling techniques, blinded the entities in the sentence and aggregated position embedding and POS embedding to the word embedding of each word to have more representation of each instance. This architecture obtained an F1-score of 35.4% and 18.5% for the relation extraction and classification task, respectively. As future work, we proposed to use a two steps model to overcome the extraction of the relationships between two concepts and subsequently classify them in the different semantic classes. In addition, we also plan to rule out the reverse instances of each class as None in order to avoid having very similar representation with different labels. We plan to tackle the directionality problem with post-processing rules after the classification. Furthermore, we will train a CNN with different pre-trained word embedding models instead of using a random initialization.

Funding
This work was supported by the Research Program of the Ministry of Economy and Competitiveness -Government of Spain, (DeepEMR project TIN2017-87548-C2-1-R) and the TEAM project (Erasmus Mundus Action 2-Strand 2 Programme) funded by the European Commission.