SeeDev Binary Event Extraction using SVMs and a Rich Feature Set

This paper describes the system details and results of the participation of the team from the University of Melbourne in the SeeDev binary event extraction of BioNLP-Shared Task 2016. This task addresses the extraction of genetic and molecular mechanisms that regulate plant seed development from the natural language text of the published literature. In our submission, we developed a system 1 using a support vector machine classiﬁer with linear kernel powered by a rich set of features. Our system achieved an F1-score of 36 . 4% .


Introduction
One of the biggest research challenges faced by the agricultural industry is to understand the molecular network underlying the regulation of seed development. Different tissues involving complex genetics and various environmental factors are responsible for the healthy development of a seed. A large body of research literature is available containing this knowledge. The SeeDev binary relation extraction subtask of the BioNLP Shared Task 2016 (Chaix et al., 2016) focuses on extracting relations or events that involve two biological entities as expressed in full-text publication articles. The task represents an important contribution to the broader problem of biomedical relation extraction.
Similar to previous BioNLP shared tasks in 2009 and 2011 (Kim et al., 2009;Kim et al., 2011), this task focuses on molecular information extraction. The task organisers provided paragraphs from manually selected full text publications on seed development of Arabidopsis thaliana 1 Source: https://github.com/unimelbbionlp/BioNLPST2016/ annotated with mentions of biological entities like proteins and genes, and binary relations like Exists In Genotype and Occurs In Genotype. The participants are asked to extract binary relations between entities in a given paragraph.
Several approaches have been proposed to extract biological events from text Liu et al., 2013). Broadly, these approaches can be categorized into two main groups, namely rule-based and machine learning (ML) based approaches. Rule-based approaches consist of a set of rules that are manually defined or semi-automatically inferred from the training data (Abacha and Zweigenbaum, 2011). To extract events from text, first event triggers are detected using a dictionary, then the defined rules are applied over rich representations such as dependency parse trees, to extract the event arguments. On the other hand, ML-based approaches (Miwa et al., 2010) are characterized by learning algorithms such as classification to extract event arguments. Further, they employ various features computed from the textual or syntactic properties of the input text.
This article explains our SeeDev binary relation extraction system in detail. We describe the rich feature set and classifier setup employed by our system that helped achieve the second best F1score of 36.4% in the shared task.

Approach
The seedev task involves extraction of 22 different binary events over 16 entity types. Entity mentions within a sentence and the events between them are provided in the gold standard annotations. In the rest of the article, we refer to an event with two entity arguments as simply a binary relation.
We treat relation identification as a supervised classification problem and created 22 separate classifiers denoted as C 1 , C 2 , . . . , C 22 , specific to relations r 1 , r 2 , . . . , r 22 , respectively. This design choice was motivated by two important aspects, namely vocabulary and relation type signature. We describe them below: Vocabulary According to the annotation guidelines document, it is clear that different relations are expressed using different vocabulary.
For example, "encode" is in the vocabulary of "Transcribes Or Translates To" and "phosphorylate" is in the vocabulary of "Regulation Of Molecule Activity". We hypothesize that treating the vocabulary as a set of trigger words for its corresponding relation would be beneficial. Therefore, we built 22 separate classifiers for each relation type, with vocabulary as a relation specific feature. Given an entity pair (e a , e b ), we test it with the classifier C i to detect if the relation r i holds between e a and e b .
Relation type signature Relations are associated with entity type argument signatures, which specify the list of allowed entity types for each argument position. For example, the event "Protein Complex Composition" requires the first entity argument to be one of these four entity types {"Protein", "Protein Family", "Protein Complex", "Protein Domain"} and the second argument to be "Protein Complex". Alternately, relation argument signatures can be used as a filter that specifies the list of invalid relations between an entity pair. We can use this knowledge to prune the training sets of classifier C i of invalid entity pairs. Relation type signatures overlap but are not identical. Therefore, training set of C i is different from training set of C j , j = i.

Training
The steps involved in training the aforementioned classifiers are described below.
1. Extract all pairs of candidates (e a , e b ) that cooccur within a sentence from training documents to form a triple t = (e a , e b , label). If e a and e b are known to be related by the type r c , from the relation annotations, we set label = r c . If they are not related, we set label =NR. NR is a special label to denote no relation. 2. Add the triple t = (e a , e b , label) to the training set of C i , if (e a , e b ) satisifies the type signature for relation r i , i ∈ [1, 22].
We now have classifier specific training sets, which are sets of triples t = (e a , e b , label). To train the classifier, we regard these triples as training examples of class type label and a feature vector constructed for the entity pair (e a , e b ), as explained in section 2.4.

Testing
During the test phase, we generate candidate entity pairs from sentences in the test documents. We look up into the relation argument signatures to identify the list of possible relation types for this entity pair. For each such relation type r i , we test the candidate with the classifier C i . The entity pair (e a , e b ) is considered to have the relation type r i if the predicted label from the classifier C i is r i . A consequence of the above approach is that we may predict multiple relation types for a single entity pair in a sentence. This is a limitation of our system, as it is unlikely for a sentence to express multiple relationships between an entity pair.

Classifier details
The classifiers C i , i ∈ [1, 22] are trained as multiclass classifiers. Note that the training set of each classifier C i may include examples of the form (e a , e b , label), label = r j and j = i, for the reason that (e a , e b ) satisfies the type signature for r i . Therefore, at test time a classifier C i may classify an entity pair (e a , e b ) as r j , j = i. But we note that r i is the dominant class for the classifier C i and other relation types r j are often under represented during its training. Therefore, we discard predictions r j from C i when j = i. For the entity pair (e a , e b ) to be included in the final set of predicted relations with the type r i , we require that the classifier C i label it as r i .
We experimented with classifiers from Scikit (Pedregosa et al., 2011). For each relation type, we selected a classifier type between linear kernel SVMs and Multinomial Naive Bayes. This choice was based on performance over development data. We combine the development dataset with training dataset and use it all for training. No parameter tuning was performed.

Feature Engineering
We developed a set of common lexical, syntactic and dependency parse based features. Relation specific features were also developed. For part of speech tagging and dependency parsing of the text, we used the toolset from Stanford CoreNLP (Manning et al., 2014). These features are described in detail below.
1. Stop word removal: For some relations ("Has Sequence Identical To", "Is Functionally Equivalent To","Regulates Accumulation" and "Regulates Expression" ) we found that it is beneficial to remove stop words from the sentence. 2. Bag of words: Include all words in the sentence as features, prefixed with "pre","mid" or "post" based on their location with reference to entity mentions in the sentence. 3. Part of Speech (POS): Concatenated sequence of POS tags were extracted separately for words before, after and in the middle of entity mentions in the sentence. 4. Entity features: Entity descriptions and entity types were extracted as features. 5. Dependency path features: We compute the shortest path between the entities in the dependency graph of the sentence and then find the neighboring nodes of the entity mentions along the shortest path. The text (lemma) and POS tags of these neighbors are included as features. 6. Trigger words: For each relation, we designate a few special terms as trigger words and flag their presence as a feature. Trigger words were mainly arrived at by examining the annotation guidelines of the task and a few representative examples. 7. Patterns: A common pattern in text documents is to specify equivalent representations using parenthesis. We find if the two entities are expressed in such a way and include it as a special feature for the relations "Is Functionally Equivalent To" and "Regulates Development Phase".

Evaluation
The SeeDev-Binary task objective is to extract all related entity pairs at the document level. The metrics are the standard Precision (P), Recall(R) and F1-score ( 2P R P +R ).

Dataset
The SeeDev-Binary (Chaix et al., 2016) task provides a corpus of 20 full articles on seed development of Arabidopsis thaliana, that have been manually selected by domain experts. This corpus con-sists of a total of 7, 082 entities and 3, 575 binary relations and is partitioned into training, development and test datasets. Gold standard entity and relation annotations are provided for training and development data and for test data only entity annotations have been released. The given set of 16 entity types are categorized into 7 different entity groups and 22 different relation types are defined. Pre-defined event signatures constrain the types of entity arguments for each relation.

Results
In the development mode, we used the training dataset for training the relation specific classifiers and predicted the relations over the development dataset. Finally, we trained our classifiers with the full training and development data together. With this system, the predicted relations over the test dataset was submitted to the task. Performance results over the test dataset was made available by the task organizers at the conclusion of the event. These results are detailed in Table 3.2.

Discussion
We note that the final relation extraction performance is quite low (36.4%), suggesting that SeeDev-Binary event extraction is a challenging problem. Further, for many event types our system was unable to identify any relation mentions. It is not clear as to why our methods are not effective for these relation types, but it is likely that scarcity of training data is the problem. We observed that our system performed poorly on relation types that have < 100 training samples and has generally succeeded on the rest. It is likely that for these sparsely represented relation types, alternate techniques such as rule based methods might be more successful.
We attempted a few alternate techniques and describe the findings from these approaches below.

Alternate approaches
1. Two stage approach: We attempted building a first stage general filter that identifies event pairs as "related" or "not related". For this, we grouped all candidate pairs with any of the 22 given relation types into the "positive" class and the rest into the "negative" class in a SVM classifier. In the second stage, we built a multiclass classifier that was to further tune the label of an entity pair from "related" to one of the 22 relation types. We observed poor performance for the first stage filter and a drop in overall performance. 2. Binary classifiers: We attempted training the classifiers C i , i ∈ [1, 22] as binary classifiers, by modifying the triples (e a , e b , r j ) to (e a , e b , +) if j == i and (e a , e b , −) if j = i. At test time, positive predictions from C i were inferred as relations r i . We observed that this approach of combining many subclasses into one negative class reduced precision and hence overall performance. 3. Co-occurrence: A simple approach to relation extraction is to consider all event pairs that occur within a sentence as related. We tried using this cooccurrence strategy for relation types for which SVM or Naive Bayes classifiers did not work effectively. We abandoned this strategy as we observed that the overall F1 score reduced over the development dataset, even as the recall at the relation level improved.

Kernel methods:
We experimented with the shortest dependency path kernel (Bunescu and Mooney, 2005) and the subset tree kernels (Moschitti, 2006) for classification with SVMs. However their performance was quite low (F1 score < 0.20). It is likely that small training set sizes and multiple entity pairs in most sentences affect the performance of these kernel methods. 5. Dominant class types : In our system we adopted the strategy of only accepting predictions of the dominant class type from each classifier. That is, we filter out predictions of type r j from classifier C i when j = i. This strategy proved very effective when tested over the development dataset. Without this filtering step, we found that our system gets a high recall as expected (0.896) but also too many false positives resulting in low precision (0.027) and F-score (0.053).

Error analysis
In Table 4.2 we show the confusion matrix for 16 classifiers of our system, when evaluated over the development dataset. The remaining 6 classifiers were left out as they have 0 predictions and are discussed separately in Section 4.2.1. The entries of the confusion matrix CM [i, j] are the number of test examples whose true type is i and its predicted label is j. From the confusion matrix we see that the primary source of errors is in predicting a relation where there is none or vice versa. Amongst the related entity pairs, the classifier for "Has Sequence Identical To" makes the most errors when the input examples are of type "Is Functionally Equivalent To". Adding more discriminatory features or keywords to discriminate between these two classes is likely to improve performance. Better handling of unrelated entity pairs is likely to be achieved with more syntactic or dependency parse related features, that specifically target the entity mentions in the sentence.

Unsuccessful classifiers
In Table 3.2, the F-score for some of the relation types has been recorded as not available("NA") as our classifiers failed to predict any relations.
Studying the confusion matrix at the classifier level confirms that the classifier did not have enough evidence to detect a relation in many cases. Also, for most of these unsuccessful relation types we observed that the primary class type is underrepresented in their training set. For example, the training sets for the classifier for "Exists At Stage" has 3X more examples of type "Regulates Development Phase" than examples of type "Exists At Stage". Better ways of handling class imbalance may improve performance.

Conclusion
SeeDev-Binary event extraction was shown to be an important but challenging problem in the BioNLP-Shared Task 2016. This task is also unusual as it calls for the extraction of multiple relation types amongst multiple entity types, often cooccurring in a single sentence. In this paper, we describe our system, which was ranked second with an F1 score of 0.364 in the official results of the task. Our solution was based on a series of supervised classifiers and a rich feature set that contributes to effective relation extraction.