Talla at SemEval-2018 Task 7: Hybrid Loss Optimization for Relation Classification using Convolutional Neural Networks

This paper describes our approach to SemEval-2018 Task 7 – given an entity-tagged text from the ACL Anthology corpus, identify and classify pairs of entities that have one of six possible semantic relationships. Our model consists of a convolutional neural network leveraging pre-trained word embeddings, unlabeled ACL-abstracts, and multiple window sizes to automatically learn useful features from entity-tagged sentences. We also experiment with a hybrid loss function, a combination of cross-entropy loss and ranking loss, to boost the separation in classification scores. Lastly, we include WordNet-based features to further improve the performance of our model. Our best model achieves an F1(macro) score of 74.2 and 84.8 on subtasks 1.1 and 1.2, respectively.


Introduction
Classifying the relationship between entities is an important natural language processing (NLP) task that serves as a building block for a variety of NLP applications such as knowledge base construction and question-answering tasks. SemEval-2018 Task 7 (Gábor et al., 2018) provided entitytagged texts from the ACL Anthology corpus and asked participants to identify and classify entity pairs into one of six semantic relationships.
Our approach to this problem consisted of selecting two architectures shown to be successful (Zeng et al., 2014;Nguyen and Grishman, 2015) in this problem domain and adapting them to this particular task. We found that pre-trained word embeddings were effective for this problem, as well as a combined loss function, and using Word-Net features at a later stage of our model.

System Description
Convolutional neural networks (CNNs) have been proven to significantly outperform other methods for Relation Classification (Zeng et al., 2014;dos Santos et al., 2015;Nguyen and Grishman, 2015). Our approach was inspired by Nguyen and Grishman (2015) and dos Santos et al. (2015), both systems being minimally dependent on explicit feature engineering. While Nguyen and Grishman (2015) relied solely on their model architecture to automatically extract useful features, we also included additional features based on part-of-speech tags and WordNet hypernyms. Following dos Santos et al. (2015), we trained our model on a hybrid objective function, a combination of cross entropy loss and ranking loss. Finally, we also trained our model in two stages to utilize large amounts of unlabeled ACL corpus abstracts (Bird et al., 2008). We describe these stages in detail in section 4.1.
Each abstract is first tokenized into sentences. For each sentence, we then formed training examples by taking all combinations of pairs of entities annotated in the sentence. If a pair is annotated with a relation label, we labeled the sentence as this relation. Otherwise, we labeled the sentence as an artificial class called OTHER . Figure  1 shows an example of a training instance in our sentence-level dataset.
For an input sentence, each word was mapped to a word-vector to form a sentence matrix. These sentence matrices were provided as inputs to the 863 CNN. The output of this layer was then fed into a softmax layer to classify the relationship between two entities. Section 3 provides a detailed description of the underlying CNN.

Convolutional Neural Network for Relation Classification
Our model consists of a preprocessing feature generation step followed by a 2D convolutional layer with max pooling and then a fully connected layer with softmax output.

Preprocessing and Feature Generation
The input to our model was a raw sentence marked with entity positions. This raw sentence was first converted into a real-valued sentence matrix by tokenizing the sentence and then replacing each token with a corresponding word embedding. We used three different look-up tables: publicly available pre-trained word embeddings, randomly initialized word positions, and randomly initialized part-of-speech tags. Following Collobert et al. (2011), the final word embedding for each token in the sentence was a concatenation of these three embeddings.
The specific process used to generate the feature vector for a given token in a sentence is as follows. Let n be the number of tokens in a sentence x = [x 1 , x 2 , ..., x n ] with x i 1 and x i 2 being the two head words of the two entities in relationship r. Relative positions between a token x i and two entities are given by (i − i 1 ) and (i − i 2 ). These positions are also mapped into real-valued vectors using a position embeddings look-up table W p . Also, W e , and W t embeddings look-up tables are used to map each word and its part-of-speech tag into a real valued vector. Finally, is the word-vector mapped using the W e lookup table, u i 1 and u i 2 are the word-position vectors mapped using the W p look-up table, and t i is the word-part-of-speech vector mapped using the W t look-up table. Dimension d of the final input vector is given by where d e , d p and d t are the dimensions of pre-trained word-embeddings, word-position embeddings and word-part-of-speech tag embeddings. As a result of these look-up operations, the raw sentence

2D Convolution with Max Pooling
We used multiple window sizes to extract features corresponding to various n-grams. Let w be a window size and n w be the number of unique window sizes, a filter A convolutional operation is then performed using x and f to produce a feature map s = [s 1 , s 2 , ..., s n−w+1 ] as: where b and g are bias and ReLU (Nair and Hinton, 2010) activation function respectively. This convolutional operation was repeated for different filters and window sizes, and then a max pooling strategy Zhang and Wallace (2017) was applied to extract only 1 feature (the one with highest activation) from each feature map. That is, for each feature map s, a max function was applied to produce a single value: p f = max(s).

Classification Layer
We took all the individually selected features from the max pooling operation and concatenated them together, producing z = [p 1 , p 2 , ..., p m ], where m is the number of feature maps and p i is the pooled value for i th feature map. A random proportion of input vector z was set to zero for regularization purposes to produce a drop-out version z d of input vector z. The vector z d was then fed into a dense layer followed by a softmax operation to produce the final classification probability for a relation class r as: o is the output of the dense layer, b is a bias term, L is the number of relation categories, and C is a weight matrix of size (n w m × L) with n w being the number of unique window sizes and m being the number of filters.

Additional Features
In addition to the output of the CNN layer, we explored a variety of additional features derived from the input sentences.
Part-of-speech features, pos We randomly initialized embeddings for each part-of-speech tag and used these embeddings as additional input to our network. Part-of-speech tags for raw sentences were generated using spaCy 1 .
WordNet hypernym features, hyp We incorporated WordNet hypernyms using the implementation 2 provided by Ciaramita and Altun (2006).

Semantic Similarity between two entities, sim
We computed the cosine similarity between the word-embeddings of the head-words of the two entities in a relation instance.
REVERSE flag feature, rev We applied an indicator function on the REVERSE flag of the relationship instance.
We fed hyp, sim and rev features as additional inputs to the classification layer. While pos features were provided as input to the convolution layer.

Training Methods
We evaluated three different loss functions for training our model: cross-entropy loss, ranking loss (dos Santos et al., 2015), and a weighted combination of the two, where L combined = αL ranking + (1 − α)L cross entropy with α as a weighting parameter. The combined loss function was determined to be the most effective.

Two-Staged Training
To make use of unlabeled data for fine-tuning word-position and word-part-of-speech embeddings, we trained our model in two-stages following Severyn and Moschitti (2015): a distant training stage and a supervised training stage.

Distant Training
We first created a distantly supervised training dataset using unlabeled ACL corpus abstracts (Bird et al., 2008) based on the naive assumption that two entities have the same relationship across all aligned sentences. By 1 spacy.io 2 sourceforge.net/projects/supersensetag/ aligned sentences, we mean all sentences which have exactly two entities. In order to create distantly supervised training data based on the above assumption, we performed the following operations: i) All the sentences in the ACL-corpus were indexed in an IR system. Here we used Whoosh 3 .
ii) For each relation instance in the labeled training data, the top 40 sentences which contained both the entity texts in the relation were returned from the IR system.
iii) Result sentences in which the distance (number of characters) between the two entity texts was greater than 170 were removed. This number was derived from distance statistics from given labeled datasets.
iv) The remaining sentences were labeled with the same relationship as the relation instance in (ii).
Following above steps, we created distantdatasets of around 1600 and 11000 training instances for subtasks 1.1 and 1.2, respectively. We also verified that there is no overlap between these generated distant-datasets and the provided test datasets.
Once the distantly supervised training data is created, we train our model using these datasets to fine-tune only word-position and part-of-speech tag embeddings, while keeping word-embeddings fixed.
Supervised Training In the second stage, we initialized our model with the fine-tuned embeddings trained in the distantly supervised training stage and then train our model using the provided labeled training data. In this stage we also train word-embeddings but freeze them for first 10 epochs to prevent any large updates.

Experiments and Results
The class labels for subtasks 1.1 and 1.2 are highly imbalanced (Table 1). To compensate for this imbalance, we trained our models for subtasks 1.1 and 1.2 jointly on a combined dataset and used class-weights to weight our loss function.

Resources and Hyperparameters
We chose all the hyperparameters based on the model performance on our validation set. All experiments below use the hyperparameters as shown in Table 2.  Our final model is a soft-voting ensemble of the best models obtained using 10-fold stratified cross-validation. All the models were implemented using TensorFlow 4 . We trained our models using a stochastic gradient descent optimizer with momentum . Lastly, based on our experiments, we chose the 300-dimensional word2vec pre-trained word embeddings trained on the Google News corpus.   Table 3 shows the results of our ablation studies using different feature sets on subtasks 1.1 and 1.2. It shows that the simple similarity (sim) feature helps in case of subtask 1.2, while it degrades the performance in case of subtask 1.1. Similarly, WordNet features with part-of-speech features boosted the performance only of subtask 1.1. Fine tuning using the two-staged training approach did not yield any performance gain in either the subtasks.

Evaluation
We also evaluated the effect of loss function choice. Table 4   While we did not provide a formal submission for subtask 2, we evaluated our approach on it given the labeled test data. Table 5 shows the re-  sults of our experiments on subtask 2 (relation extraction). While this method did not outperform the top submissions, it still demonstrated competitive results.

Conclusion
Our experiments indicate that pre-training on an unlabeled corpus did not noticeably impact performance on our evaluation set. Our plain CNN model (without any external features) has comparable performance to the competition's best submission. We also observed improved performance of our model on Subtask 1.1 when using the Word-Net features as additional input to the final layer. Finally, when we combine the cross-entropy and ranking loss functions, performance of our model improved on both Subtasks 1.1 and 1.2.