An Overview of the Active Gene Annotation Corpus and the BioNLP OST 2019 AGAC Track Tasks

The active gene annotation corpus (AGAC) was developed to support knowledge discovery for drug repurposing. Based on the corpus, the AGAC track of the BioNLP Open Shared Tasks 2019 was organized, to facilitate cross-disciplinary collaboration across BioNLP and Pharmacoinformatics communities, for drug repurposing. The AGAC track consists of three subtasks: 1) named entity recognition, 2) thematic relation extraction, and 3) loss of function (LOF) / gain of function (GOF) topic classification. The AGAC track was participated by five teams, of which the performance are compared and analyzed. The the results revealed a substantial room for improvement in the design of the task, which we analyzed in terms of “imbalanced data”, “selective annotation” and “latent topic annotation”.


Introduction
Biomedical natural language processing (BioNLP) has long been recognized as effective method to accelerate drug-related knowledge discovery (Vazquez et al., 2011;. Particularly, PubMed is regarded as a main source for knowledge discovery as it stored a vast amount of reports on scientific discovery, and the size keeps constantly growing (Hunter and Cohen, 2006;Cohen et al., 2016). Various corpora used texts from PubMed. Examples include GENIA (Kim et al., 2003), CRAFT (Cohen et al., 2017), and BioCreative task corpora (Li et al., 2016), to name just a few.
The growing interest in developing corpus annotation also has led to the development of public annotation platform in the BioNLP community. An example of recent progress is PubAnnotation (Kim and Wang, 2012;, which offers a versatile platform for corpus construction, annotation, sharing the data, and offering them as open shared tasks (https://2019. bionlp-ost.org/tasks).
In the context of drug-related knowledge discovery, various corpora were developed. Examples include annotated corpora for adverse drug reactions (ADR) (Roberts et al., 2017;Demner-Fushman et al., 2018;Karimi et al., 2015;Ginn et al., 2014;Gurulingappa et al., 2012), and those for drug-drug interactions (DDI) (Herrero-Zazo et al., 2013). However, as far as the authors know, there has been no work of corpus annotation (except AGAC-related ones) for drug repurposing. Drug repurposing (AKA drug repositioning) is to find new indications of approved drugs, which is now recognized as an important mean for investigating novel drug efficiency in the pharmaceutical industry.
This paper presents the Active Gene Annotation Corpus (AGAC) corpus and a shared task (the AGAC track of BioNLP Open Shared Tasks 2019) based on it. The design of AGAC is highly motivated by the LOF-agonist/GOF-antagonist hypothesis proposed by Wang and Zhang (Wang and Zhang, 2013), which states: For a given disease caused by driven gene with Loss of function (LOF) or Gain of function (GOF), an targeted antagonist/agonist is a candidate drug.
The hypothesis was well supported by experiments, which encouraged large scale automatic knowledge curation.
Actually, the hypothesis represented the ideas of tracking the phenotypic information of gene and it shared the similar motivation of phenome-wide association studies (PheWAS) (Rastegar-Mojarad et al., 2015). In PheWAS, the international classification of diseases (ICD) codes was assigned as the form of the phenotype to candidate single nucleotide polymorphisms (SNPs) so as to investigate the relevance of phenotypes and gene mutation.
AGAC is a corpus annotated by human experts, with an aim at capturing function changes of mutated genes in a pathogenic context. The design of the corpus and the guidelines were published in 2017 (Wang et al., 2018), and a case study of using such an annotated corpus for drug repurposing was successfully performed in 2019, unveiling potential associations of variations with a wide spectrum of human diseases (Zhou et al., 2019). Since then, the whole annotation work took 20 months, with involvement of four annotators.
Using the corpus the AGAC track of BioNLP Open Shared Tasks 2019 was organized, which was participated by 5 teams. In this paper, both the AGAC corpus and AGAC track are introduced, and the performance of the participants are presented. The full information of the AGAC track is available at the website, https://sites.google.com/view/ bionlp-ost19-agac-track.
2 The AGAC corpus and shared task

Corpus preparation
We collected abstracts by Mesh terms "Mutation/physiopathology" and "Genetic Disease". AGAC is annotated for eleven types of named entities, which categorized into bio-concepts, regulation types, and other entities, and for two types of thematic relations between them. All the types of named entities and thematic relations are defined in the AGAC ontology (see Figure 1).
While the full description of the named entity types can be found in the AGAC guideline book (Wang et al., 2018), briefly speaking, it is designed to include the entities which are relevant to genetic variations and forthcoming phenotype changes at molecular and cellular levels, with a focus on tracing the biological semantics of LOF and GOF mutations.
Since AGAC aims to annotate mutations and the subsequent bio-processes caused by the mutations, the two thematic role types, themeOf and causeOf, of which the original use are introduced by the GENIA event annotation (Kim et al., 2008), are adopted to represent relations between AGAC entities. Note that here the use of the themeOf and causeOf relations are a little bit different from their use in linguistic analysis, in the sense that they are not confined to be used only around verbs. In AGAC, the thematic relations may be used to connect two named entities, both in noun forms. Below is the semantics of the two thematic relations: • ThemeOf: a theme of an event (or a regulatory named entities) is the object which undergoes a change of its state due to the event.
• CauseOf: a cause of an event (or a regulatory named entities) is the object which leads the event to happen.
In order to help understanding of the semantics of the AGAC entities, they are mapped to corresponding MeSH terms (Lipscomb, 2000) whenever possible (see Figure 1). In addition to the annotations for named entities and relations, each abstract in AGAC is annotated with a statement of a LOF/GOF-classified genedisease association. The statement is expressed by a triple: a gene, the type of function change (GOF or LOF), and a disease. For example, if an abstract reports an association between a mutation of SHP-2, which causes a GOF type of function change, and leukemia, the abstract is annotated with the triple, SHP-2; GOF; leukemia. Note that it is the most straightforward form of knowledge piece to apply the LOF-agonist/GOF-antagonist hypothesis to discovery of candidate chemicals for diseases, which is the primary application scenario of AGAC.

Statistics and characteristics of AGAC corpus
AGAC corpus is annotated by four annotators: a main annotator and three fellow annotators.
To evaluate the quality of the annotations, interannotator agreement (IAA) was measured in an asymmetric way: the performance of the main annotator was assumed as the "oracle", to which the performance of each fellow annotator was compared. The IAAs of the three annotators were 0.68, 0.78 and 0.70, respectively, in F-score.
To serve as the training and test data sets of the AGAC shared task, the corpus was randomly divided into halves: 250 abstracts for each of the training and the test data sets. The basic statistics of the abstracts, sentences, and annotations are shown in Table 1.   The AGAC corpus is characterized in three terms: imbalanced data, selective annotation, and latent topic annotation.
i) Imbalanced Data: The statistics in Table 1 clearly shows that the entity distribution is imbalanced over the entity types, e.g. ii) Selective Annotation: According to the AGAC guidelines (Wang et al., 2018), annotations are made only to the sentences which carry sufficient information to mine a genedisease association with LOF/GOF specification, i.e., a sentence is annotated only if it contains specific gene, mutation, disease mentions. In other words, the named entities appearing in a sentence are not annotated if the sentence misses any of the required entities. Later, it has turned out to be a tricky feature, which makes the NER task based on the corpus a much more complicated one compared to typical NER tasks (See Section 5).
iii) Latent Topic Annotation: The annotation of each abstract with a LOF/GOF-classified gene-disease association may be regarded as a kind of latent topic annotation, in the sense that the LOF/GOF context of a gene-disease association may not be directly visible from the text. This feature makes the AGAC annotation unique: the annotation is really geared toward knowledge discovery for drug repurposing based on the LOF-agonist/GOFantagonist hypothesis. Note that the agonist or antagonist information of a chemical is available in various databases like Drugbank (Wishart et al., 2017) or Therapeutic Target Database (TTD) , which means, if mining of LOF/GOFclassified gene-disease association is possible in a large scale, mining of drug candidates for diseases also will be possible in a large scale.

Task Definition of AGAC Track
AGAC track consists of three tasks: Task 1: named entity recognition, Task 2: thematic relation extraction, and Task 3: mutation-disease knowledge discovery. While participants were allowed to choose the tasks they would participate, due to the dependency between the tasks, it was expected that participating all the three tasks might maximize the chance of high performance: Task 2 requires the result of Task 1, and Task 3 may be benefited from the result of Task 1 and 3. Below is the details of the three tasks: Task 1. NER: To recognize named entities appearing in given texts, and to assign them their entity class, based on the AGAC ontology. Figure 2 shows an example, where four spans, "protein", "Truncating", "DNMs", and "SHROOM3" are annotated as Protein, Negative Regulation, Variation, and Gene, respectively. The participants are required to produce the result in the PubAnnotation JSON format. Note that while compound nouns are common, there is no discontinuous or overlapping spans annotated as named entities, in AGAC.
Task 2. Thematic relation identification: To identify the thematic relation, ThemeOf, CauseOf, between named entities. Figure 3 shows an example, where two ThemeOf relations, Protein → Negative regulation and Gene → Variation, and one CauseOf relation, Negative regulation → Variation, are annotated.
Note that the relation annotations are added on top of the NER annotations. Note also that relations may be intraor inter-sentential, and in AGAC, 3.98% of the relations are inter-sentential.  Figure 4 shows an example, where the PubMed abstract, 25805808, is annotated with the triple, SHROOM3; LOF; Neural tube defects. Participants are requried to produce a text file where a quadraple (a PubMed Id, plus a triple) takes one line. Note that while this task is inde- Figure 4: Annotation example for Task 3 pendent from Task 1 and 2, syntactically, it may be benefited from the results of the two tasks, semantically.
For better understanding, let us pick a sentence, "Mutations in SHP-2 phosphates that cause hyperactivation of its catalytic activity have been identified in human leukemia, particularly juvenile myelomonocytic leukemia." From a biological view, hyperactivation of catalytic activity is clearly a description of Gain-of-Function. Henceforth, this sentence carries clear semantic information that, a gene "SHP-2" after mutation plays a GOF function related to the disease "juvenile myelomonocytic leukemia". Therefore, the Task 3 requires the triple from this sentence, i.e., SHP-2;GOF;juvenile myelomonocytic leukemia.
In another sentence, "Lynch syndrome (LS) caused by mutations in DNA mismatch repair genes MLH1.", it describes the association between disease "Lynch syndrome" and gene "MLH1", but the phrase "caused by" means no loss or gain, hence the triple from this sentence should be MLH1;REG;Lynch syndrome.
In a COM example, "Here, we describe a fourth case of a human with a de novo KCNJ6 (GIRK2) mutation, who presented with clinical findings of severe hyperkinetic movement disorder and developmental delay. Heterologous expression of the mutant GIRK2 channel alone produced an aberrant basal inward current that lacked G protein activation, lost K+ selectivity and gained Ca2+ permeability." , the description "lost K+ selectivity and gained Ca2+ permeability" shows both LOF and GOF, therefore the function change can not be labeled as LOF or GOF but COM, GIRK2;COM;hyperkinetic movement disorder.
2.4 Sample data for task 1, 2, and 3 Figure 5 shows a sample text of AGAC corpus, the format of which is JSON. The bold term "target" is the address of the annotated text. "sourcedb" is where the text original from, all the text in AGAC corpus are from PubMed. "sourceid" is pmid of the text. "text" contains the raw abstract. 1) "denotations" for Task 1: "denotations" contains the named entity annotations corresponding to Task 1. Each named entity annotation has an "id"; a "span": its position in the abstract; an "obj": the named entity it belongs to.
2) "relations" for Task 2: "relations" contains the thematic roles between the named entities, which corresponds to Task 2. Each relation contains an "id"; a "pred": the thematic roles; "subj" and "obj": the named entity "id" that the relation associates, and the direction of the relation is from "subj" to "obj".
Note that Task 2 requires the result of Task 1.

3) Triples for Task 3:
25805808;SHROOM3;LOF;Neural tube defects Triples showed above is the result of Task 3, which is required to be extracted from the sample text. So, for the result template during evaluation, the standard format of triples is: pmid;gene;function change;disease.
The visualization of part of this sample text is shown in Figure 5, which is presented by the annotation platform PubAnnotation.

Evaluation methods
The performance of the participants was evaluated in standard precision, recall, and F-score. For Task 1 and 2, the PubAnnotation Evaluator 1 tool was used, with a parameter setting for strict span matching (soft match characters = 0 & soft match words = 0). For task 2, for a predicted relation to be counted as a true positive, the two entities participating in the relation have to be correctly predicted, together with the type of the relation. Note that the evaluation criteria applied to Task 1 and 2 are very strict.
For Task 3, a custom evaluation tool was provided by the organizers Unlike Task 1 and 2, for Task 3, a relaxed matching criteria was applied: a "Function-Classified Gene-Disease Assciation" (FCGDA) statement is counted as correct one if the function classification (LOF or GOF) is correctly recognized. The motivation of using the relaxed matching criteria was that it was fairly a new type of task, making a highly challenging one, and and that prediction of the LOF/GOF context was of the primary interest.

Results and observations
Overall, five teams participated in the tasks of the AGAC track: three teams in both Task 1 and 2, one team only in Task 1, and one team (through a late submission) only in Task 3. The results of Task 1, 2, and 3 are presented in Table 2, 3, 4, respectively. 1 https://github.com/pubannotation/ pubannotation_evaluator Looking into the methods used by the participants, it is observed that, although the number of participants is not so high, various methods are well mixed: a probabilistic sequence labeling model, e.g., CRF (Lafferty et al., 2001)), a kernel-based linear classification model, e.g., SVM, modern neural network models, e.g., CNN (Lawrence et al., 1997) and Bi-LSTM (Hochreiter and Schmidhuber, 1997;Sundermeyer et al., 2012),We collected abstracts by Mesh terms "Mutation/physiopathology" and "Genetic Disease". and also a joint learning. It is also observed that use of BERT (Devlin et al., 2018), a pre-trained language representation model, was popular.

Task 1
In Task 1, DX-HITSZ used "JFB-NER" model which was a joint learning model with parameters fine tuned bioBert. Zheng-UMASS used a hierarchical multi-task learning model for both Named entity recognition and Relation Extraction. In this model 12 entities were decomposed into three subtasks: (1) Var, MPA,CPA,Enzyme for part one (2) Gene, Pathway, Protein, Disease for part two (3) PosReg, Interaction, NegReg, Reg for part three. Besides, they used Bert embedding, customized embedding, and Char level embedding to represent inputs sentences. Then, the bi-LSTM encoders were used as encoders for each of the subtasks. YaXXX-SiXXX/LMX used Bi-LSTM CRF with linguistic features and ensemble 3 best models on 3 data splits. Finally, DJDL-HZAU used traditional CRF method and combined with some

Task 2
In Task 2, Zheng-UMASS used a hierarchical multi-task learning model for both Named entity recognition and Relation Extraction. In relation extraction part the model shared the same encoding layers with Named entity recognition part. DX-HITSZ used a simple fine tuned bioBert, refer as "SB-RE". The F-score they obtained is 0.35 and 0.25, respectively. Furthermore, YaXXX-SiXXX/LMX converted the task 2 into a classification model and used the traditional support vector machine to obtain a F-score of 0.03.

Task 3
In Task 3, Ashok-BenevolentAI used BERT as well to extract "gene function change disease triples. They encoded the pair of mentions and their textual context as two consecutive sequences and then used a single linear layer to classify their relation into five classes. It is noted that none of the results in Task 1 and Task 2 were jointly learned in this model.
As the task organizer, AGAC team provided baseline method for Task 1 and 3. We used BERT to learn semantic structure of the sentences, and use joint learning for output sequence labeling in Task 1 and triple recognition in Task 3.

Summary
To sum up, the best performance for Task 1 was 0.6 in F-score, which was obtained by DX-HITSZ. It outperformed the reference method provided by the organizers by 0.10 in F-score. For task 2, the base performance was 0.35, which was acheived by Zheng-UMASS. The best performance for Task 1 and 2 are quite low compared to other NER and RE tasks. We attribute the reason to the strict evaluation criteria and the selective annotation characteristics of the AGAC corpus, the latter of which is discussed in Section 5. For Task 3, while the reference performance provided by the organizers achieved a moderate performance, 0.65 in F-score, the only participant achieved a much lower performance, 0.26. We attribute the reason to the fact that the team did not use the results of Task 1 and 2 which we expected critical to perform Task 3.

Discussion and Conclusion
In this section, the "selective annotation" and "latent topic annotation" features of AGAC are reviewed and future research directions are discussed.

Selective annotation makes NER challenging
As suggested in the previous discussion, stateof-art methods in NLP community, like BERT and joint learning, are frequently tested in AGAC track. Comprehensive investigation of the performance results show the effectiveness and disadvantages of these method. Unlike normal sequence labelling task, AGAC track requires the artificial intelligence method to perform NER only when the sentence exactly fit the GOF/LOF topic. Here, "selective annotation" attribute refers that only the core named entities or phrase within a sentence which carries clear function change semantics is annotated. Actually, the design with this attribute stem from real scenario of the drug knowledge discovery where curators need to trace and extract exact relevant function change information of a mutated gene among texts. Unfortunately, this attribute also make AGAC track a fairly challenging task to fulfill.
The performances comparison in AGAC track shows that the modern NLP strategies like BERT propel the traditional sequence labeling task to the full strength. Both the team won the first position and the baseline method use BERT and joint learning model. As a conclusion, sophisticated language representative model is an effective way to handle sequence labeling in AGAC research. In addition, LOF/GOF recognition without using results of Task 1 and 2 failed to outperform the baseline method which make good use of the named entities in AGAC. It hints that joint learning model is a proper integrated tasks solution for NER, thematic role recognition and LOF/GOF triplet recognition.
In all, the "Selective annotation" attribution make AGAC track more challenging than traditional sequence labeling task. Just mocking the human annotator who make annotation with sufficient LOF or GOF semantics consideration, a successful model should discern the full semantics when correctly performing the labeling. Hopefully, the performance of the AGAC track will be enhanced by a design of a more intellectual learning model, which is capable of capturing both the sequence labeling and the triple information, and therefore making tactical adjustment.

The potential of latent topic annotation
The purpose of AGAC track for drug repurposing requires comprehensive cooperation among BioNLP and Bioinformatics communities, even in general, NLP and Biology communities. Though none of the participants attempts to solve Task 3 due to the domain gap of computer science and life science, a cross disciplinary cooperation is still promising, especially in the era of Multi-Omics data (Groen et al., 2016).
"Latent topic annotation" attribute refers to comprehensive integration of drug related knowledge and deep cooperation in a cross-disciplinary manner. As mentioned in the introduction, the biological idea of the AGAC design is consistent with the mainstream phenotype mining strategy as PheWAS (Rastegar-Mojarad et al., 2015). In addition, the literature review as well suggests that BioNLP and computational method shed light to drug-related knowledge discovery . In our early attempt of AGAC application (Zhou et al., 2019), a PubMed-wide GOF and LOF recognition is successfully achieved by using AGAC as training data. Specifically, AGAC corpus offers abundant semantic information in the function change recognition, and helps to evaluate the GOF/LOF topic of a Pubmed abstract.
All of the above facts hint that well formed knowledge structure in AGAC is capable of ensuring nice application of function change investigation, and good commanding of the domain knowledge is the key point to propel the research of drug repurposing. Henceforth, it is promising to develop deep cooperation among BioNLP and Bioinformatics communities based on the outcome of AGAC track competition.

Data Availability
The AGAC corpus is developed and made available in the PubAnnotation platform, which is technically supported by Database Center for Life Science (DBCLS), Japan.