Texterra at SemEval-2018 Task 7: Exploiting Syntactic Information for Relation Extraction and Classification in Scientific Papers

In this work we evaluate applicability of entity pair models and neural network architectures for relation extraction and classification in scientific papers at SemEval-2018. We carry out experiments with representing entity pairs through sentence tokens and through shortest path in dependency tree, comparing approaches based on convolutional and recurrent neural networks. With convolutional network applied to shortest path in dependency tree we managed to be ranked eighth in subtask 1.1 (“clean data”), ninth in 1.2 (“noisy data”). Similar model applied to separate parts of the shortest path was mounted to ninth (extraction track) and seventh (classification track) positions in subtask 2 ranking.


Introduction
Information extraction is an important part of natural language processing. During SemEval-2018 an evaluation devoted to extraction and classification of relations in scientific papers was held (Gábor et al., 2018). The task is described as follows: given abstracts of scientific articles with detected entities, the goal is to choose correct relations for provided source, target entity pairs (subtask 1 -relation classification) and to determine correct relations among all entity pairs (subtask 2 -relation extraction and classification). The target quality metric in classification is macroaverage of F1-scores of every class; for extraction scenario the target metric is F1-score.
Our method is based on multinomial classification of entity pairs and their sentences with neural networks. We experiment with representing entity pairs through all sentence tokens and through tokens along the shortest path between entities in dependency tree (Bunescu and Mooney, 2005). We employ convolutional (CNN) (LeCun et al., 1989) and bidirectional Long Short-Term Memory (biL-STM) (Hochreiter and Schmidhuber, 1997;Tan et al., 2015) neural network based approaches to encode sentences and dependency tree paths.
In this work we mainly focus on relation classification, so most of analysis and experiments are carried out for this task. Slightly modified models which achieve the best results on subtask 1 are adapted for solving relation extraction and classification problem.
The rest of the paper is organized as follows: in section 2 we describe some known approaches for relation extraction and classification. In section 3 our approach is presented in details. Section 4 outlines results of described approach evaluation on official SemEval-2018 task 7 test set. We wrap up with some final thoughts in section 5.

Related Work
The relation extraction and classification problem has a long history. Early approaches were based on manually constructed patterns, used to detect entities in the relation under consideration (Blaschke and Valencia, 2001). Further approaches utilized machine learning algorithms (Zelenko et al., 2003) with various hand-crafted features (GuoDong et al., 2005) -syntactic labels, part of speech tags, morphological properties and so on. A brief overview of such methods is presented in (Bach and Badaskar, 2007). Significant part of recent approaches is based on neural networks, trying to eliminate dependency on natural language processing tools: Lin et al., 2016) use CNNs to extract and classify relations; (Zeng et al., 2014) adapts deep CNNs for the same task. (Socher et al., 2012) introduces recursive neural networks which capture information from phrases and sentences and applies it to relation classification task.
In this work we try to inject extra -syntacticinformation gained from natural language processing tools into neural networks based approaches.

System Description
Our method is based on multinomial classification with neural networks. The decision about the relation being held is made by analysing sentence, which contains examined entities. Each sentence is represented as a sequence of tokens or as a dependency tree.

Modelling Tokens
Tokens are encoded with fasttext (Bojanowski et al., 2017). Each token embedding also contains binary indicators of its belonging to source or target entity or other part of the sentence.

Modelling Entity Pair
The basic way to model entity pair is just to take all tokens of the sentence containing these entities and to encode them with described embedding (section 3.1). Binary indicators of token belonging to entities allow us to distinguish several relations in a single sentence.
Another idea is to use path from source to target entity tokens in dependency tree. In our approach the shortest path is considered: it rises up from source entity directly to the lowest common ancestor and then immediately goes down to target entity tokens ( Figure 1). Note that dependency trees are built automatically and are sometimes inconsistent with layout of entities, which may be represented differently in the tree.
When using shortest path in dependency tree, each token embedding is extended with additional information -fasttext of the parent token, syntactic label and direction indicator (whether token is on path from source or target entity to lowest common ancestor).

Neural Network Architectures
General architecture can be described as follows: some method is utilized to encode sequence of input embeddings into a vector, which is then passed through fully-connected layer L and finally fed into softmax to output predicted label. We experiment with two well-known approaches to encode sequences of input embeddings -biLSTM and CNN.

BiLSTM
BiLSTM-based method is hugely inspired by (Yang et al., 2016). For each sequence item w 0 k we analyse its nearest context -two items to the left (w −2 k , w −1 k ) and to the right (w 1 k , w 2 k ). Instead of using w 0 k directly, its "attentioned" version ω k is used: where w i k are D-dimensional embedding vectors; b and u are A-dimensional attention vectors; W is an A × D-dimensional attention matrix.
Computed ω k are further fed into biLSTM network (hidden layer size B). Its final cells output and hidden state together with attention vector (computed similarly to what has been described earlier, but on all biLSTM outputs) are concatenated to form final sequence coding vector.

CNN
Another method is based on CNN. All input sequences are trimmed or padded to fit the same size. Then a number of filters F are applied. Each filter application yields a vector of dimensionality sequence length − f ilter height + 1; a single maximum value is pulled from each such vector. These values are finally concatenated to form final sequence encoding.

Separate CNNs for Shortest Path Parts
The final method is a modification of CNN one, which is specifically designed to be used when modelling entity pairs with shortest paths. Instead of merging different parts of the path into a single sequence, we use four individual sequences by analogy with (Zeng et al., 2015) -source entity tokens, tokens on path from source to lowest common ancestor, tokens on path from lowest common ancestor to target and target entity tokensand four separate CNNs for them (sCNN). Outputs of all networks are merged into a single final vector.

Relation Extraction
We adapt classification approach to relation extraction subtask. The first idea is to apply the same model for seven-class classification (six known relations and absence of relation). Secondly we try two-step approach with successive classifiers of the same architecture: extraction classifier detects entity pairs associated with any relation and then another classifier assigns relation labels for extracted pairs of entities.
For negative examples generation the following strategies are examined: reflection -reversed correct asymmetric (all except COM P ARE) relations are supposed to be negative examples; in-sentence -some random portion of entity pairs which co-occur in the same sentence is treated as negative examples.
Finally an attempt to filter out excess relations is made (according to guidelines each entity is allowed to participate in not more than a single relation). We employ greedy method that chooses the most confident relation being held using classifier output weights.

Evaluation
We took part in both relation classification and relation extraction subtasks. All results reported in this section are gained on official SemEval-2018 task 7 test data developed by organizers and released after the evaluation phase. Official scores for corresponding submissions are specified in Tables 2 and 3 after the slash sign. The difference is explained by minor parameter variations, typically randomness in variables initialization and number of training epochs.
Relation classification (subtask 1) has two datasets -with manually annotated entities (subtask 1.1 -"clean data") and with automatically detected entities (subtask 1.2 -"noisy data"). We decided to construct a single model merging both datasets into one in order to increase the amount of training examples and to diminish skew in number of sample relations for different types (Table 1).
To encode tokens fasttext (skipgram; minimum length of character n-gram is 1, maximum -5) is used. We build two separate models with different embedding dimensions -100 and 300 -using the English Wikipedia.
Evaluation results are presented in Table 2. The target metric is F1. The first part of method name specifies whether all sentence tokens or tokens from shortest path in dependency tree are used. The second part specifies neural network architecture being utilized. We report results for the following neural network parameter values: attention size A is 400; biLSTM hidden layer size B is 1000; CNN filters F -200 with height 3, 50 with heights 2 and 4, width matches the embedding dimensionality; size of fully-connected layer L is 1000 for biLSTM and 900 for CNN. Specified values are selected during experiments, which are out of this paper scope.
As for subtask 1.1, we conclude that: context attention tends to be beneficial (the only counterexample is sentence biLSTM with fasttext size 300); larger token embeddings are typically better (the only counterexample is sentence biLSTM); syntactic information is helpful for relation classification with neural networks.
For subtask 1.2 the results are more controversial: smaller embeddings sometimes surpass larger ones; utilizing syntactic information seems still beneficial, but the results are not as convincing as in 1.1; in contrast to subtask 1.1 context attention does not tend to improve quality of the approach. From our point of view, such strange behaviour on subtask 1.2 dataset requires further investigation.
Quality evaluations for subtask 2 solutions are presented in Table 3. Target metrics are extrac-   When reflection strategy for negative examples generation is used seven-class approach performs better. With utilization of both strategies two-step approach breaks forward. Post-processing im-proves quality for both approaches, however it is still rather low compared with the results of other participants.

Conclusion
In this work we tried to study how utilization of syntactic information influences the quality of relation extraction and classification in scientific papers. According to our experiments the approach based on shortest path in dependency tree yields the best results. The actual network architecture delivering the best result depends on the subtask being solved.