Overview of the Regulatory Network of Plant Seed Development (SeeDev) Task at the BioNLP Shared Task 2016.

This paper presents the SeeDev Task of the BioNLP Shared Task 2016. The purpose of the SeeDev Task is the extraction from scientiﬁc articles of the descriptions of ge-netic and molecular mechanisms involved in seed development of the model plant, Arabidopsis thaliana . The SeeDev task consists in the extraction of many different event types that involve a wide range of entity types so that they accurately reﬂect the complexity of the biological mechanisms. The corpus is composed of paragraphs selected from the full-texts of relevant scientiﬁc articles. In this paper, we describe the organization of the SeeDev task, the corpus characteristics, and the metrics used for the evaluation of participant systems. We analyze and discuss the ﬁnal results of the seven participant systems to the test. The best F-score is 0.432, which is similar to the scores achieved in similar tasks on molecular biology.


Introduction
Since its first edition in 2009, BioNLP Shared Task (BioNLP-ST) organizes information extraction (IE) tasks from scientific literature with a focus on molecular mechanisms with the aim to promote advances in IE research in the biomedical domain. The SeeDev task is the first task on event extraction about molecular biology of plants. It gives an opportunity for the BioNLP community to evaluate the reusability of methods, to characterize the peculiarities of IE for the plant biology domain and to develop dedicated approaches. For this purpose, we manually annotated a new corpus of scientific papers selected for their relevance to the topic. We propose to the participants to extract text-bound events that involve biological entities provided as input. The performances of the systems are evaluated by standard measures through the comparison of their predictions to the reference annotations.

Context
Seeds are the main vectors for breeding and production of annual field crops. The accumulation of seed storage compounds (e.g. sugars, lipids, proteins) is of primary importance for food, feed and industrial uses. Seed development requires the coordinated growth of different tissues that involves complex genetics and environmental regulations (Alberts et al., 2002). A comprehensive understanding of the molecular networks that underlie the regulation of seed development remains a major scientific challenge with important potential impact on fundamental research, agriculture and industry.
The SeeDev task of BioNLP Shared Task 2016 focuses on the accumulation of reserves in the seed of the model plant, Arabidopsis thaliana (Ath), for which research on regulatory networks is the subject of a large and active international community (Santos-Mendoza et al., 2008). Most of this knowledge is spread in thousands of articles. As such, this topic constitutes an excellent primer for the development of event extraction methods. The SeeDev corpus should then be largely reusable for the study of other plants and other development phases.
Information Extraction research applied to biology mainly consists in automatic entity extraction, their normalization and event extraction (Ananiadou et al., 2014). The extraction of regulatory network has become one of the most popular tasks in shared tasks in recent years. The increasing complexity of the event scheme over the years is driven by the significant scientific advances in IE and the increasing need for computational models in bioinformatics and systems biology. In 2005, the objective of the Learning Language in Logic challenge (LLL'05) was the extraction of gene interactions between proteins and genes with the goal of reconstructing bacterial regulatory networks (Nédellec, 2005). The diversity of the biological events (molecular, physiological) and entities (genes, proteins, families, sites, environmental factors and phenotypes) has continuously increased over the time together with the variety of the biological mechanisms studied. These mechanisms range from detailed networks as in Bacteria Interaction (Bossy et al., 2012) and Gene Regulation Network (Bossy et al., 2015) tasks, signaling pathways as in GENIA task (Kim et al., 2013a) and metabolism to diseases as in Pathway Curation (PC) and Cancer Genetics (CG) tasks (Pyysalo et al., 2015). Their extraction from text makes an increasing use of existing standards, nomenclatures and ontologies such as Gene Ontology that facilitates the integration of the text mining results into larger knowledge bases and bioinformatics applications (e.g. GRO task (Kim et al., 2013b)) or OntoBiotope (e.g. Bacteria Biotope task (Bossy et al., 2015)).
The SeeDev task brings a new application domain, plant development biology, with similar goals and representation as previous IE shared tasks on biological event extraction. This new application domain has required the design of a new knowledge model for the representation of the events, a manually annotated corpus and new metric that accounts for the varying importance of the event arguments.
We refer to the SeeDev task knowledge model as Gene Regulatory Network for Arabidopsis (GRNA). GRNA meets the usual constraints of manual annotation of texts (e.g. biological relevance and computational tractability), and of automatic annotation by IE methods ( e.g. learnability from training examples). We have also taken into account the expected use of GRNA for the indexing and retrieval of textual events and experimental data in a unified representation, the modeling of other plant systems, and also the integration of text knowledge with knowledge derived from experimental data.
SeeDev corpus is composed of paragraphs from a selection of recent full-text scientific papers about molecular biology of seed development.

Task Description
The SeeDev Task consists in two subtasks (1) SeeDev-binary on binary relation extraction and (2) SeeDev-full on full event extraction. The SeeDev-binary subtask has been conceived as a first step towards the extraction of full n-ary events, which is of interest for plant biology. Both subtasks share the same GRNA model and the same document set with different annotation sets. The two annotations sets contain binary relations and events respectively. The annotation set of SeeDev-binary has been computed from the annotation set of SeeDev-full through the application of formal transformation rules.

Knowledge Representation
The GRNA model defines 16 entity types ( Figure 1) and 21 event types (  The Molecule category includes molecules that are directly involved in regulation, such as Hormone that plays a critical role in plant growth, and Protein Domain and DNA regions (Box, Promoter) for the representation of physical binding events. Protein and gene families are also important entities because they are mentioned as actors of the regulations in some papers without more precision on the exact molecule. The Dynamic Process category is defined by two broad entity types, Regulatory Network and Metabolic pathway, with the purpose of keeping the complexity of the extraction task tractable. Moreover, the distinction in the SeeDev corpus between specific kinds of networks or pathways would have been difficult, if not impossible because the authors themselves remain vague.  X X X X X X X X  4  3  3  1  2  2  1  Hormone  3  3  3  3  3  4  4  4  2  6  3  3  1  2  2  1  Regulatory Network  2  2  2  2  2  3  3  3  1  3  4  2  1  2  2  1  Metabolic pathway  2  2  2  2  2  3  3  3  1  3  2  4  1  2  2  1  Genotype  1  1  1  1  1  2  2  2  0  2  1  1  3  1  1  0  Tissue  1  1  1  1  1  2  2  2  0  2  1  1  1  3  1  0  Development Phase  1  1  1  1  1  2  2  2  0  2  1  1  1  1  3  0  Environmental Factor 2  2  2  2  2  3  3  3  1  3  2  2  1  The conditions in which the regulations occur represent critical information about the event context. The entity types represent spatial conditions (Tissue), temporal conditions (Development phase), the organism, which is genetically modified or not (Genotype), and the environmental factors (biotic and abiotic external conditions). The entities in the corpus are denoted by individual words or by sets of words that may be discontinuous.
The 21 GRNA event types are grouped in 6 sets, according to their biological role (Table 1). The Regulation, Function and Interaction categories are central for the description of the biological mechanisms. Where and When event types represent the context of the mechanisms, whilst Composition and Membership events allow to finely represent relations among the biological entities. Some of the event types, e.g. Regulates Expression / Process / Molecule Activity are very similar to those of other molecular biology IE event schemes such as the ones of GENIA (Kim et al., 2013a), Cancer Genetics (Pyysalo et al., 2015) and Arabidopsis Leaf Growth (LG) (Szakonyi et al., 2015). Other GRNA event types are specific to biological development, e.g. Regulates Development Phase / Tissue Development or to the storage process, e.g. Regulates Accumulation. The LG model of Szakonyi et al. (2015) dedicated to Ath does not include plant or development specific events to be reused in GRNA. Protein modification and metabolism in GENIA and PC tasks and regulation of phenotype in LG, were not relevant for the SeeDev corpus but will be addressed in priority in further extensions of GRNA.
The first column of Table 1 displays the binary relation names of SeeDev-binary subtask and the n-ary event names of SeeDev-full subtask in brackets, with their definition in column two. Nary events have two mandatory arguments and up to five optional arguments: Tissue, Developmental Stage, Organism, Genotype, Environmental Factor, and Hormone.
Furthermore, n-ary events may have a negation modality. Participants are provided with text documents, gold entity annotations, and the detailed signatures of each event, i.e. the list of allowed types per slot. Figure 2 gives, for example, the Binding event signature.
The use of a strongly typed model facilitates the event prediction because it drastically reduces the number of event candidates given the types of the arguments. Figure 3 shows the number of relation types per pair of argument types. For example the argument pair (Arg1: Development Phase / Arg2: Protein Domain) does not accept any relation type; whereas the pair (Arg1: Protein / Arg2: Protein Family) may be involved into 8 different relations. The formal specification of event signatures drastically reduces the exploration space of possible events.

Sub-Task 1: SeeDev Binary Relation Extraction
The goal of SeeDev-binary is the extraction of binary relations of 22 different types without modality (no negation) as described in Table 1.
The Is Linked To relation is computed from the n-ary events, it links mandatory arguments to optional arguments. Figure 4.a gives an example of SeeDev-binary annotation with 3 different relations.

Sub-Task 2: SeeDev Full Relation Extraction
SeeDev-full aims at extracting n-ary events where the number of arguments ranges from two to eight, plus a negation modality. There are three arguments in average. There is no trigger word in SeeDev event representation. Events relate

Corpus Description
The SeeDev corpus is a set of 86 paragraphs from 20 full-text articles, selected by plant biology experts, about seed development in Arabidopsis thaliana.  Paragraphs of the same document may be distributed into different sets. The "Documents" row indicates the proportion of documents represented in the set. The SeeDev corpus is smaller than other BioNLP-ST corpora, e.g. a fifth of Cancer Genetics corpus and a third of GENIA corpus. The manual annotation of the SeeDev corpus required a high level of expertise that do not allow for a large corpus, as in many specific domains of Life Science. We identify small dataset processing as a challenge to overcome by information extraction tools. Table 1 details the distribution of instances per relation type in the training, development and test sets of the SeeDev-binary task. The distribution was balanced between the three data sets so that the test set would represent approximately a third of the annotations for each group of relations. The most frequent relations are Regulation with 48% of annotations, which corresponds to what is expected given the corpus domain. The three relations Regulate Expression, Regulates Process and Exist in Genotype, highlighted in Table 1, account for half of the total, whilst seven of the relations are relatively infrequent with 1% of the total.

Annotation Methodology
We have successively refined the annotation scheme of GRNA during the annotation process. We have defined an initial annotation scheme according to our expertise in A. thaliana seed development and in BioNLP task definition, starting from the GRN model (Bossy et al., 2015).
The scheme was improved through several iterations of manual annotations and collective discussions until it met the requirements, i.e. it allowed unambiguous, consistent, readable and detailed formal annotations. Together with the scheme, a very precise guideline document (Chaix et al., 2016) was produced that details the annotation principles for each entity and event type, and provides many examples and counter-examples.
The relevant paragraphs of the corpus were chosen by the biologists, mostly from the abstract, introduction, result and discussion sections. A team of three experts in seed development and two bioinformaticians has manually annotated the corpus following the guidelines by using the AlvisAE Annotation Editor (Papazian et al., 2012) in accordance with the final version of the scheme.

Automatic Annotation
Rigid designators of named entities, such as Gene, Protein, Tissues, and Developmental Phases were automatically pre-annotated with the AlvisNLP pipeline using relevant Ath databases (e.g. TAIR 1 ) and customized lexicons. The goal of automatic pre-annotation was to speed-up the manual annotation process. The evaluation of the automatic annotation compared to the gold standard annotation shows a F-score equal to 0.41, with a high precision (0.89) and low recall (0.26) due to a lack of relevant lexicon for most entity types.

Manual Annotation
The manual annotation has been achieved in four successive phases in order to both save expert time and achieve a high quality annotation. First, a bioinformatician who is not a specialist of Ath annotated all the entities of the corpus. The evaluation of the manual annotation of the entities compared to the gold standard annotation yielded a high 0.93 F-score with balanced Recall and Precision, 0.93 and 0.95 respectively.
Then Ath experts revised the entity annotations and annotated the events of the corpus in a double-blind manner. Thanks to the manual pre-annotation of entities, they could focus on events which require more expertise. Next, the annotators together with the bioinformatician used the AlvisAE conflict resolution functionality to build a consensus. Finally, the bioinformatician carefully checked the compliance of each annotation to the guidelines to produce the gold annotation set.
To evaluate the inter-annotator agreement, we measured the F-score between the annotation set of each annotator (referred to as A and B) and the consensus annotation set (i.e. gold annotations) ( Table 3). The differences between the individual annotators vary according to the event types. The recall measure of the annotations of events with arguments of Process type without regulation (Is Involved In Process) and events with Genotype arguments (Exists In Genotype, Occurs In Genotype) is lower.
Mistyping Regulates Accumulation was frequent because this event is easily confused with Regulates Molecule Activity. Annotations from annotator B are closer to the reference annotation, but the examination of the union of both annotation sets shows that annotator B missed events that were well annotated by A. The 0.724 F-score of the union of A and B annotation sets is quite high. The last step of the SeeDev corpus construction is the adjudication between the two annotators with a third person as external referee. It was an essential step to avoid event oversight.  6 Evaluation Procedure

Shared Task Organization
As for previous challenges, BioNLP-ST 2016 provides resources and information to the participants through the BioNLP-ST website 2 and mailing lists. The schedule of the SeeDev task follows the usual principles of BioNLP-ST tasks, it can be found on dedicated pages. We provided state-of-art automatic NLP analysis as supporting resources with the purpose to speed-up the participant system development. Nine tools were selected and applied to the training, development and test sets: POS tagger (GENIA Tagger (Tsuruoka et al., 2005)), parsers (Stanford Parser (Manning, 2003) Enju (Miyao and Tsujii, 2008) C&C CCG Parser (Clark and Curran, 2007)), term extractor (BioYaTeA (Golik et al., 2013)) named entity recognizers (Stanford NER (Finkel et al., 2005) LINNAEUS (Gerner et al., 2010) SR4GN (Wei et al., 2012)) and tokenizer and sentence splitter (AlvisNLP suite (Ba and Bossy, 2016)).
Community web tools (forum, FAQ and mailing list) have been made available on the website with the purpose to federate the community that participates to the challenge. In this way participants could interact with the task organizers and with other participants.
Furthermore, participants could evaluate their predictions through an online evaluation service. During the training phase it was restricted to the evaluation on training and development sets. The service allows now to evaluate predictions on the test set and will remain open. For the first time in BioNLP-ST, participants could also keep track of the performance of various experiments through the same online service. Thus, participants could follow and compare their results and competing team results. The recorded submissions were kept anonymous to other participants. The aim of this tool was to ease the interpretation of the scores and to assist participants in the development-test cycles.

Evaluation Metrics
The evaluation measures of the participant system results are computed through the comparison of predicted events against reference corpus events. In SeeDev-binary the participants had to predict relations between entities given as input. This task can be viewed as a classification task of all pairs of entities. Thus, we evaluate submissions with Recall, Precision and F-score. Submissions were ranked by F-score, however we also provided alternate evaluations in order to assess the strengths of each submission for each relation type separately, for each broad category of relations separately and without taking into account the relation types.
We also designed a measure for SeeDev-full task evaluation that is permissive for optional arguments. The evaluation is detailed on the task web site and is available through the online evaluation service to the benefit of teams that will bravely tackle this task. Their main background domains are Bioinformatics, Machine Learning, Natural Language Processing and Biology according to their responses to a survey. Table 4 summarizes the scores obtained by the participant systems ranked by F1-score (detailed results are available on the SeeDev site). The results of the DUTIR system are not displayed because they experienced a last minute hitch and ranked last. LitWay from Xidian University achieves the best F1-score (0.432), 0.068 points higher than the second team and 0.177 points higher than the lowest score at 0.255. The two systems that ranked first achieved a balanced recall and precision, while the four others favored recall over precision (VERSE, LIMSI), or the reverse (UTS, ULISBOA). VERSE obtained the best recall and UTS the best precision.  Table 4: Evaluation scores of the SeeDev binary task ranked by F-score.
The best F1-scores are very similar to the ones achieved by participants of previous shared tasks on regulation event extraction around 50% ( e.g. GRN, CG, PC), which is over what could be expected given the complexity and the novelty of the task and the variability of the example distribution among the events.
As shown by Table 5, the detailed scores per relation exhibit a high variability. Some relations were difficult to predict (e.g. Regulates Tissue Development, Regulates Molecule Activity, Occurs During) while others were well-predicted (e.g. Composes Primary Structure with a maximum F1score of 0.67).
As usual in such corpus, the analysis of the results shows that the causes are multifactorial, we hypothesize that the number of training examples combined with the regularity of the descriptions and the constraints imposed by the event signature are critical. For instance, the Composes Primary Structure relation has only 51 examples, but it links entities from a restricted range of types, which makes it easier to predict (0.67 best F1score). However, other relations such as Regulates Expression with a high number of examples (450 examples), inter sentence occurrences (23) and a wide range of argument types (4 types for the first argument and 16 for the second) were poorly predicted (0.39 best F1-score).
The scores of most of the systems remain unchanged when the dataset is restricted to the  relations that occur in a single sentence. The difference of the results obtained for intra-sentence dataset are less than 1 point, except for Limsi that gains 0.056 points; indeed, Limsi is the only team that attempts to predict inter-sentence relations whereas all other participant systems predicted only intra-sentence relations. Given the proportion of inter-sentence relations in the test set (4%), the penalty of ignoring them could have been considered as bearable.
In order to assess the difficulty to predict the correct relation type, we computed the F-scores when considering the category of the relations instead of the actual type (first line per category in bold and italic in Table 5). This did not yield a significant improvement although some participants were able to successfully predict events in categories with high biological relevance, such as the Regulation category ( Litway F1: 0.416) and the Interaction category ( UniMel F1: 0.303).

Systems Description and Result Discussion
All teams used supervised machine-learning approaches (Table 6). Five systems used support vector machines (SVM) and two systems were based on different algorithms, namely maximum entropy (MaxEnt) (LIMSI) and convolutional neural network (DUTIR).  Table 6: General methods of the participants SVM are widely used for information extraction tasks, because they are powerful versatile classifiers. SVM are kernel-based and there are several existing kernels available (Zelenko et al., 2003) adapted to different object representations. For instance, dependency-path kernels (Bunescu and Mooney, 2005;Airola et al., 2008) handle candidates represented as syntactic dependency paths. Moreover, the usual feature selection methods can be handled by kernels that work on vectorial representations. MaxEnt and neural networks are also popular algorithms in information extraction tasks (McCallum et al., 2000). The most notable characteristic of the best performing system, LitWay, is that it combines supervised machine learning for the prediction of a selection of event types with hand-crafted rules for the prediction of other types.

Participant General method
All teams used token segmentation, sentence splitting and token normalization (stemming, lemmatization, POS-tagging). Four teams, among which the three top ranking also used deep syntactic parsing, which confirms that parsing is a powerful pre-processing step for information extraction. Finally, the LitWay system also designed features based on word embedding which is a novelty in the BioNLP-ST.

Conclusion
We have described the SeeDev task that we have designed with the goal to promote progress in information extraction in the field of plant development and more precisely plant regulatory networks. Two sub-tasks were proposed with increasing levels of complexity, SeeDev-binary on binary relations and SeeDev-full on events.
The lack of participation to SeeDev-full shows that the extraction of n-ary events with optional arguments remains challenging.
Seven teams from different countries participated in the SeeDev-binary task with different approaches. The results are very promising, given the novelty of the task and the complexity of the model. The best F-score, 0.432, is close to what has been previously obtained in similar IE tasks on molecular biology.
The good results achieved by hybrid methods using machine learning and handcraft patterns show that efficient adaptation of generic methods to the task could rely not only on machine learning, but also on alternative approaches. This observation may also be true for the extraction of n-ary events from binary relations where rewriting rules may complement machine learning methods. This may be particularly appropriate for relatively small corpora as SeeDev, which belongs to a domain where a trade-off has to be found between the time needed for the training corpus annotation and the time needed for the manual development of dedicated rules for the IE method.