LitWay, Discriminative Extraction for Different Bio-Events

Even a simple biological phenomenon may introduce a complex network of molecular interactions. Scientiﬁc literature is one of the trustful resources delivering knowledge of these networks. We propose LitWay, a system for extracting semantic relations from texts. Lit-Way utilizes a hybrid method that combines both a rule-based method and a machine learning-based method. It is tested on the SeeDev task of BioNLP-ST 2016, achieves the state-of-the-art performance with the F-score of 43.2%, ranking ﬁrst of all participating teams. To further reveal the linguistic characteristics of each event, we test the system solely with syntactic rules or machine learning, and different combinations of two methods. We ﬁnd that it is difﬁcult for one method to achieve good performance for all semantic relation types due to the complication of bio-events in the literatures.


Introduction
Bio-events are founding blocks of bio-networks depicting profound biological phenomena. Automatically extracting bio-events may assist researchers while facing the challenge of growing amount of biomedical information in textual form. A bio-event carries more semantic information biochemical reactions between entities, therefore, is more informative for studying associations between bio-concepts, e.g. gene and phenotype (Li et al., 2013).
A number of methods have been proposed to process the automated extraction of biomedical events including rule-based (Cohen et al., 2009;Kilicoglu and Bergler, 2011;Bui and Sloot, 2011) and machine learning-based (Miwa et al., 2012;Hakala et al., 2013;Munkhdalai et al., 2015) methods. Bui et al. (2013) presented a rule-based method for bio-event extraction by using a dictionary and patterns generated automatically from annotated events. TEES (Björne and Salakoski, 2013) is a SVM based text mining system for the extraction of events and relations from natural language texts, it obtains good performance on a few tasks in BioNLP-ST 2013 (Nédellec et al., 2013). As a major type of biomedical events, a series of methods concentrate on protein-protein interactions (PPI) Papanikolaou et al., 2015). Kernel-based methods are widely used for relation extraction task and obtain good results by leveraging lexical and syntactic information (Airola et al., 2008;Miwa et al., 2009;Li et al., 2015b). Peng et al. (2015) proposed Extended Dependency Graph (EDG) and evaluated it with two kernels on some PPI datasets, obtained good improvements on F-value.
We previously use a set of basic features including word embedding on a classifier for the BioNLP 2013 Genia  dataset, the result is comparable to the state-of-the-art solution (Li et al., 2015a). The system is built with flexibility in mind. It is designed to tackle more types of bio-events. In this paper, we introduce LitWay, which is based on the previous infrastructure and uses a machine learning based method in combination with syntactic rules. The system is tested on a completely different task, the SeeDev of BioNLP-ST 2016. It achieves the best result among all participants with an F-score of 43.2% (recall and precision are 44.8% and 41.7% respectively).

SeeDev Task
As a popular task in unstructured data mining of biomedical interests, BioNLP has successfully With. An entity could participate in several events at the same time or none, such as AtEm6 promoter and lacZ. Noticeably entity span overlap, like Gene AtEm6 and Promoter AtEm6 promoter.
held a series of biomedical event extraction tasks. GE (Genia Event Extraction) is a classic task initiated since the beginning of BioNLP (Kim et al., 2009), it attracts attention and leads to abundant works (Kim et al., 2011;. Be similar to GE and others of BioNLP, SeeDev (Chaix et al., 2016) is a new task proposed in BioNLP-ST 2016, it dedicates to event extraction of genetic and molecular mechanisms involved in plant seed development. It is based on the knowledge model Gene Regulation Network for Arabidopsis (GR-NA) 1 . GRNA model defines 16 different types of entities, and 22 event types that may be combined in complex events. Table 1 shows these entities. Event types are presented in following. Figure 1 gives some examples of event relations 2 .

Proposed Method
LitWays pipeline adopts a hybrid method that uses a classifier or rule-based method for different event types. Figure 2 shows the infrastructure of it. The pipeline consists of 5 steps: pre-processing, entity pair selection, feature extraction, classifier prediction and rule-based filters.
In BioNLP-ST 2013, the top three event extraction systems F-scores differ less than 0.3%  in number (Nédellec et al., 2013;Björne and Salakoski, 2013). Differences of quantitative and syntactic morphology of proteins and chemical entities in the scientific literature might demand different strategies of network extraction to achieve a better performance. In this paper, we utilize a flexible hybrid system to investigate a way to discriminatively treat event types. We first pre-experiment on the development data and divide all event types into two sets: Event-Set-A and Event-Set-B.  After pre-processing the raw text data, candidate entity pairs are constructed within each sentence, and tested by a multiclass classifier. If the classifier predicts that a candidate pair is a event belonging to Event-Set-A, the predication stays. Otherwise, a series of rules are used for deciding a type in Event-Set-B.

Pre-processing
The pre-processing include tokenization, sentence splitter, part-of-speech (POS) tagging, lemmatization and syntactic parsing. Stanford CoreNLP tool (Manning et al., 2014) is adopted for the operations.

Entity Pair Selection
The system aims to resolve semantic relation extraction as expected by the SeeDev task. In the task, each event has two arguments. We construct two entities as a candidate pair each time and predict their relation type. Table 3 presents sentence distance statistics of events on the training set, nearly 96.5% of events span within one sentence.
Since most events occur within a sentence, we only choose entity pairs in the same sentence.
Except three event types Is Linked To, Has Sequence Identical To, and Is Functionally Equivalent To, in which two arguments could be reversed, for the others they are ordered. Therefore an entity pair (Entity1, Entity2) is different from the reversed pair (Entity2, Entity1). They should be treated as two instances.
Sentence distance Number 0 1571 1 52 2 5 Table 3: Sentence distance statistics of events on the training set.

Feature Extraction
The features are extracted and summarized in Ta  Word, lemma, Part-Of-Speech (POS) are features directly represent an entity's lexical and grammatical characteristics. Adjacent words' features are used to represent the entity's contextual characteristics. Therefore, basic features include word, lemma, POS of entities, as well as the same information of the unigram words.
Generally speaking, if two entities are closer, they are more likely to be relative (Tikk et al., 2013). Token distance and entity distance are used here. Token distance is the number of tokens between two entities. Entity distance is the number of entities in the middle of two entities.
Syntactic parsing tree features are important for semantic relation (Punyakanok et al., 2008;D'Souza and Ng, 2012). Tree node depth, tree path, tree path length are used in our experiment. They are obtained from the syntactic parsing tree, generated during the pre-processing. Tree node depth is the distance between the corresponding tree node of an entity and the root node of the sentence. Tree path is the path between two entities. Tree path length is the number of middle nodes between two entities in their tree path.
Word embedding has demonstrated the ability of well representing linguistic and semantic information of a text unit (Mikolov et al., 2013;Tang et al., 2014), e.g. POS and N-gram. We continue using it as a feature in our system. Specifically, training, development and test datasets of SeeDev are used to obtain word embedding by using word2vec tool (Mikolov et al., 2013) after sentencization, tokenization and lemmatization on the original text. Since the word number of an entity is uncertain, we use the average value of all the word embeddings of an entity Wang et al., 2015), i.e. average word embedding. Middle lemmas include all of the lemmas between two entities, they are treated as a bag-of-word (BOW) feature, some keyword information may be obtained from it.

SVM Classifier Prediction
Support Vector Machine (SVM) (Cortes and Vapnik, 1995) and the C++ embodiment, LibSVM (Chang and Lin, 2011), is employed for the classification in LitWay. Positive event instances are retrieved from gold annotations. Negative instances are created by all of no-relation entity pairs within each sentence.
Among predication, if the predicted result of an entity pair belongs to Event-Set-A, it is taken as the label. Otherwise, rule-based filters are applied.

Rule-based Filters
In Event-Set-B, different event types have different rules. We summarize all rules in Table 5. We consider the event types of Event-Set-B one by one, according to their quantities on the training set, as showed in Table 2. Once all rules of an event type are satisfied, the entity pair label could be determined, and the matching of the rest event types could be stopped.
There are 6 types of rules: Event arguments match, Entity structure rules, Sentence structure rules, Token distance restriction, Keywords match and Training set match. The details about these rules are shown as following: (1) Event arguments match: According to the task description, the arguments of the event are strongly typed, which means that all types of entities are not possible as event arguments. What is more, according to the statistics of arguments of different events on the training set, we only retain those arguments that occur most times for each special event type. This could efficiently reduce false instances.
(2) Entity structure rules: Many entities have complicated structure, an entity could span over another entity. This results in that some entity structures are less likely to be event arguments. Such as, an entity with smaller span is not an argument, as it is often the modifier of the larger one. Meanwhile, some event types have several fixed special entity argument structures. We summarize 3 particular rules from the training set: • (2a) Entity is not covered: An entity is not covered by a larger one.
• (2b) Entity does not cover: An entity does not contain smaller entities or overlap with others.
• (2c) Special entity structure: Some special entity structure rules are summarized from the dataset. Presumably an entity pair (Entity1, Entity2) is within a sentence, Entity is another entity in the same sentence, the special entity structures could be: (3) Sentence structure rules: If two entities form an event relation, the sentence structure presents some syntactic characteristics. We summarize 3 sentence structure rules: • (3a) No subordinate clause: Subordinate clause is a complex sentence structure. If there is event relation between a pair of entities, the syntactic tree path structure between them is often simple and direct.
• (3b) Active or passive structure match: For an event argument pair (Entity1, Entity2), it should have such relation structure: Entity1influences-Entity2. While an entity pair has two orders in a sentence: Entity1 is on the left of Entity2, or right of Entity2. Different orders should match different sentence structure rules. If Entity1 is on the left of Entity2, their tree path is an active structure. Otherwise it is a passive structure.
• (3c) Special entity pair order: Some events usually have fixed order between their two arguments, Entity1 is always on the left (or right) of Entity2.
(4) Token distance restriction: Closer entities are more likely to be relative. The rule restricts the number of middle tokens between entities. It ignores distant entity pairs. (5) Keywords match: Some events are accompanied by keywords, we record these keywords of several different events, showed in following detailed rules. They are useful for event identification.

Results
To investigate the impact of different strategies and their comparison with the hybrid method, we test the system solely with machine learning, syntactic rules, or different combinations of them.
We compared the proposed hybrid method with the classifier-only based method on the development dataset. Table 6 shows the experiment results. All of the features are beneficial for the classifier, by using all of them we get the best SVM based result with 31.5% F1. Tree features make most improvement with 5.7% increase on F1, both recall and precision are increased. Dist features make only 0.2% F1 improvement and WM features make 1.2% F1 improvement. They increase precision with the loss of recall, while Tree features mainly contribute to recall.
Comparing hybrid method with the best SVM result in Table 6, we could see an obvious advantage. The F1 of the hybrid method is over 10% higher than the best SVM result, it greatly improves recall with around 16%, and has 3.4% precision increase. It's interesting because adding  rules usually increase precision instead of recall.
To verify the effect of rule-based method for different event types, we take the best SVM result as a basis, and then replace each event type with rule-based method in turns. Event-Set-B uses specific rules introduced before. Event-Set-A uses some frequent rules from Event-Set-B since it is difficult to create precise rules for minority class, they include: (1) Event arguments match; (2) Entity1 is not covered, Entity2 is not covered; (3) No subordinate clause, active or passive structure match. Table 7 presents the results. Except for Composes Primary Structure and Composes Protein Complex, F1 of Event-Set-B events are increased by using rules instead of SVM. While rules are not helpful for Event-Set-A, it verifies the partition of two sets.
Since Composes Primary Structure and Composes Protein Complex have better results in SVM, we move them into Event-Set-A and indeed get a little better result in overall events after the competition, it is showed in following. Table 8 presents the details of SVM method and hybrid method. Almost all the events of Event-Set-B have better results in the hybrid method. This demonstrates the effectiveness of it.
To investigate the rules used in the proposed method, we take several experiments on the development data by different rule combinations. Table  9 presents their results. All of these rules are beneficial to the system more or less. Event arguments match and entity structure rules have important in-  fluences to the performance, result in around 10% and 8% F1 decrease respectively. It is understandable because almost all kinds of event types in Event-Set-B use these two rules, which makes them important to the system, especially on the precision. Sentence structure rules and keywords match are also useful, around 3% to 3.5% F1 improvement could be obtained by using them. They improve the performance by increasing the precision of the system with the loss of recall. Token distance restriction and training set match have only 0.1% to 0.3% influences on F1 as they are merely used in one or two event types. Token distance restriction could improve the precision while training set match improves the recall. Table 10 is the official result of the SeeDev task (Chaix et al., 2016). LitWay achieves the best result among all participating teams with 43.2% F1 showing significant advantage. The recall of Lit-Way is 44.8%, which is comparable to the highest recall 45.8%. Its precision 41.7% is the second highest value, only lower than 53.3%.  We present two more additional experiments after the competition by moving Composes Primary Structure and Composes Protein Complex into Event-Set-A. Table 11 shows the results. The result on development data has 0.8% improvement on F1, while does not show benefit on test data.
We analyse the results on development dataset before and after the movement operation. Before the movement, for Composes Primary Structure there are 10 True Positive (TP) instances among 69 predicted instances (gold number is 15), for composes Protein Complex there are 0 TP instance among 8 predicted result (gold number is 0). After the operation both of the two predicted numbers are 0, i.e. we do not make any predictions of the two event types. In this case, 10 right events are lost, on the other hand 67 false events are successfully deleted. It brings more benefits than harm.

Conclusion
The paper presents a hybrid method system Lit-Way, to resolve the biomedical semantic relations. It achieves the best result in BioNLP-ST 2016 SeeDev task. It is built as a flexible way with the awareness of that different bio-events have different linguistic characteristics and are difficult to be tackled by a single method.
Without much feature engineering nor complex algorithm, LitWay obtains the state-of-the-art performance on the official test data, with the highest F-score 43.2%. A series of experiments of using the methods and their combinations are carried out to investigate the different linguistic characteristics of different event types.
Here we extract relations within one sentence. While a number of events still span across sentences. By incorporating coreference technics in the future, we expect to be able to interconnect  Table 9: Hybrid experiment results with different rules on development dataset. Methods (2) to (7) have been removed one type rule separately on the basis of (1). Method (2) only follows the event argument match rules given by the SeeDev task (http:// 2016.bionlp-st.org/tasks/seedev/ seedev-data-representation.), does not filter event arguments that never or rarely occur.