Results of the sixth edition of the BioASQ Challenge

This paper presents the results of the sixth edition of the BioASQ challenge. The BioASQ challenge aims at the promotion of systems and methodologies through the organization of a challenge on two tasks: semantic indexing and question answering. In total, 26 teams with more than 90 systems participated in this year’s challenge. As in previous years, the best systems were able to outperform the strong baselines. This suggests that state-of-the-art systems are continuously improving, pushing the frontier of research.


Introduction
The aim of this paper is twofold.First, we aim to give an overview of the data issued during the BioASQ challenge in 2018.In addition, we aim to present the systems that participated in the challenge and evaluate their performance.To achieve these goals, we begin by giving a brief overview of the tasks, which took place from February to May 2018, and the challenge's data.Thereafter, we provide an overview of the systems that participated in the challenge.Detailed descriptions of some of the systems are given in workshop proceedings.The evaluation of the systems, which was carried out using state-of-the-art measures or manual assessment, is the last focal point of this paper, with remarks regarding the results of each task.The conclusions sum up this year's challenge.

Overview of the Tasks
The challenge comprised two tasks: (1) a largescale semantic indexing task (Task 6a) and (2) a question answering task (Task 6b).

Large-scale semantic indexing -6a
In Task 6a the goal is to classify documents from the PubMed digital library into concepts of the MeSH hierarchy.Here, new PubMed articles that are not yet annotated by MEDLINE indexers are collected and used as test sets for the evaluation of the participating systems.In contrast to previous years, articles from all journals were included in the test data sets of task 6a.As soon as the annotations are available from the MEDLINE indexers, the performance of each system is calculated using standard flat information retrieval measures, as well as, hierarchical ones.As in previous years, an on-line and large-scale scenario was provided, dividing the task into three independent batches of 5 weekly test sets each.Participants had 21 hours to provide their answers for each test set.Table 1 shows the number of articles in each test set of each batch of the challenge.13,486,072 articles with 12.69 labels per article, on average, were provided as training data to the participants.

Biomedical semantic QA -6b
The goal of Task 6b was to provide a large-scale question answering challenge where the systems had to cope with all stages of a question answering task for four types of biomedical questions: yes/no, factoid, list and summary questions (Balikas et al., 2013).As in previous years, the task comprised two phases: In phase A, BioASQ released 100 questions and participants were asked to respond with relevant elements from specific resources, including relevant MEDLINE articles, relevant snippets extracted from the articles, relevant concepts and relevant RDF triples.In phase B, the released questions were enhanced with relevant articles and snippets selected manually and the participants had to respond with exact answers, as well as with summaries in natural language (dubbed ideal answers).The task was split into five independent batches and the two phases for each batch were run with a time gap of 24 hours.In each phase, the participants received 100 ques- tions and had 24 hours to submit their answers.3 Overview of Participants

Task 6a
For this task, 11 teams participated and results from 42 different systems were submitted.In the following paragraphs we describe those systems for which a description was available, stressing their key characteristics.An overview of the systems and their approaches can be seen in as an UIMA (Tanenblatt et al., 2010) text and data mining workflow, combined with a heterogeneous database architecture, where different search strategies were adopted to automatically select probable MeSH terms.More specifically, the system is based on the ZB MED Knowledge Environment (Müller et al., 2017), while also utilizing the Snowball Stemmer (Agichtein and Gravano, 2000), to find matches between MeSH terms and words in the title and abstract of each target document.
The "AttentionMeSH" systems utilize deep learning and attention mechanisms which enable the models to associate textual evidence with annotations, thus providing interpretability at the word level.Firstly, they use a bidirectional gated recurrent unit to derive word representations with contextual information (Cho et al., 2014), to represent each document.At the same time, all MeSH terms are embedded using a technique that takes into account co-occuring MeSH terms in textually similar articles and finally an attention matrix (Mullenbach et al., 2018) is created based on the MeSH and word representations, leading to MeSH-specific article representations.This procedure allows the model to provide local interpretations of the predicted MeSH terms in relation to words of a specific article, raising the interesting subject of how explanations of an automatic MeSH indexer could further help human annotators in this task.
Other participating systems, including the "DeepMeSH" systems (Peng et al., 2016), the systems of the "AUTH" team (Papagiannopoulou et al., 2016) and the "Iria" systems (Ribadas-Pena et al., 2015) are based on the same techniques used by theirs systems for the previous version of the challenge which are summarized in Table 3 and described in the corresponding challenge overview (Nentidis et al., 2017).Similarly to the previous year, two systems developed by the National Library of Medicine (NLM) to assist the indexers in the annotation of MEDLINE articles, served as baselines for the semantic indexing task of the challenge.The Medical Text Indexer (MTI) (Mork et al., 2014) with some enchantments introduced in (Zavorin et al., 2016) and an extension of it, incorporating features of the winning system of the first BioASQ challenge (Tsoumakas et al., 2013).

Task 6b
The question answering task was tackled by 50 different systems, developed by 15 teams.In the first phase, which concerns the retrieval of information required to answer a question, 9 teams with 27 systems participated.In the second phase, where teams are requested to submit exact and ideal answers, 10 teams with 27 different systems participated.Four of the teams participated in both phases.An overview of the technologies employed by each team can be seen in Table 4.
The "AUEB" team that participated only in Phase A, used novel extensions of deep learning models for retrieving question-relevant documents and snippets.Firstly, they pre-trained word embeddings (Mikolov et al., 2013) on a very large collection of articles from MEDLINE/PubMed, while also implementing some pre-processing steps (stop-word removal, stemming (Krovetz, 1993), tokenization etc.).Then, for the document retrieval task they focused on the PACRR model of (Hui et al., 2017) and the DRMM model (Guo et al., 2016), while for snippets retrieval they utilized the ABCNN model (Yin et al., 2015).Alongside the extensions made on these models, they also deployed a clever post-processing scheme for snippet retrieval, as well as a model for initial document-retrieval based on BM25 (Robertson and Jones, 1976) for efficiency purposes.
Another approach based on deep learning methodologies for Phase A, focusing again on document and snippet retrieval, was proposed by the "MindLaB" team from the National University of Colombia.While for the document retrieval they use the BM25  work (CNN) for snippet retrieval.As in the previous approach, they utilized a very large collection of PubMed Articles to train the CNN with similarity matrices of question-answer pairs.More specifically, they deploy similar pre-processing steps (tokenization, lowercasing, skip-gram embeddings (Moen and Ananiadou, 2013)) for the question and the document texts, however they also apply Part of Speech tagging to extract syntactical information regarding the terms.Based on the idea that not all terms are equally informative (Dong et al., 2015), they deploy a salience weighting scheme focusing on verbs, nouns and adjectives.Another interesting extension is the way final rankings of the snippets are generated based on a pseudo-relevance-feedback re-ranking step (Riezler et al., 2007).
In Phase B, the Macquarie University ("MQU") team focused on ideal answers and explored ideas of reinforcement learning on deep learning mod-els.Extending their previous work (Molla, 2017), they implemented different models under a regression setting for finding similar sentences to a question, based on the corresponding word2vec embeddings of the question-sentence pairs.They also experimented with different ways of utilizing these embeddings, notably using a bidirectional Recurrent Neural Networks with LSTM cells (Hochreiter and Schmidhuber, 1997) to equip the model with knowledge regarding the sentence position.Moreover, they also run interesting experiments using reinforcement learning towards the ROUGE score of the ideal answers, based on their previous work (Mollá-Aliod, 2017), but the results did not advocate for the use of such models.
The Carnegie Mellon University team ("OAQA"), focused also on ideal answer generation, building upon previous versions of the "OAQA" system (Chandu et al., 2017).They experimented with ways to improve the generated answer by extracting the most relevant non-redundant sentences from multiple documents and then re-ordering and fusing them to make the resulting text more human-readable and coherent.To this end, they tried different ordering algorithms for sentences and also made various improvements in different stages of the candidate sentences expansion, fusion and filtering procedure that was already used by their model.Among the notable additions is the use of an Integer Linear Program (ILP) module that is capable of fusing repeated content and simplifying complicated sentences, thus improving human readability.
Another system deployed by the same team focuses on answer generation using a knowledge graph and a neural learning-to-rank approach, combined with different summarization techniques.One of the novelties introduced is the creation of an ontology-based retrieval module for relevant snippets, through the relation extraction between biomedical entities found in the abstracts' texts (Abacha and Zweigenbaum, 2015).Also, different learning-to-rank approaches were explored (Qin et al., 2010;Cao et al., 2006Cao et al., , 2007) ) alongside both extractive (Allahyari et al., 2017) and abstractive (See et al., 2017) summarization techniques for the ideal answers generation.
An interesting approach comes from the "L2PS" team where they use an open-domain model (Chen et al., 2017), pre-trained on the SQUAD (Rajpurkar et al., 2016) dataset, and finetuned to the biomedical domain.An interesting difference with other deep learning approaches is the fact that the GloVe embeddings (Pennington et al., 2014) were the best amongst the ones tried.Moreover, they raise interesting questions regarding the effects of non-normalized answers (synonyms, abbreviations, multi-word answers) in the evaluation of different systems.
The "UNCC" team participated in Phase B, deploying lexical chaining techniques (Reeve et al., 2006) for sentence similarity and ranking to extract summaries from related snippets and efficiently fuse them in an ideal answer.They take advantage of the MetaMap tool (Aronson and Lang, 2010) for biomedical entity recognition and they also present a way to extend their methodology to factoid/list question answering in Phase A as well.
"Olelo" is one of the approaches that tackles both phases of the question answering task.More specifically, in Phase A Semantic Role Labeling (SRL) approaches for QA systems were utilized.These focus on the automatic extraction of predicate-argument structures (PAS) from both questions and document text, aimed at finding semantically related PAS between associated pairs.For Phase B, the system is built on top of the SAP HANA database and uses various NLP components, such as question processing, document and passage retrieval, answer processing and multidocument summarization based on previous approaches (Schulze et al., 2016) to develop a comprehensive system that retrieves relevant information and provides both exact and ideal answers for biomedical questions.
Other systems, including the "USTB" (Jin et al., 2017) and the "LabZhu" (Peng et al., 2015) systems employed the same techniques used by their systems for the previous version of the challenge, as summarized in Table 4 and described in the previous challenge overview (Nentidis et al., 2017).In this challenge too, the open source OAQA system proposed by (Yang et al., 2016) served as baseline for phase B. The system which achieved among the highest performances in previous versions of the challenge remains a strong baseline for the exact answer generation task.The system is developed based on the UIMA framework.ClearNLP is employed for question and snippet parsing.MetaMap, TmTool (Wei et  Table 5: Average system ranks across the batches of the Task 6a.A hyphenation symbol (-) is used whenever the system participated in fewer than 4 tests in the batch.Systems with fewer than 4 participations in all batches are omitted.
2016), C-Value and LingPipe (Baldwin and Carpenter, 2003) are used for concept identification and UMLS Terminology Services (UTS) for concept retrieval.The final steps include identification of concept, document and snippet relevance, based on classifier components and scoring, ranking and reranking techniques.

Task 6a
Each of the three batches of Task 6a were evaluated independently.The classification performance of the systems were measured using flat and hierarchical evaluation measures (Balikas et al., 2013).The micro F-measure (MiF) and the Lowest Common Ancestor F-measure (LCA-F) were used to choose the winners for each batch (Kosmopoulos et al., 2013).
According to (Demsar, 2006) the appropriate way to compare multiple classification systems over multiple datasets is based on their average rank across all the datasets.On each dataset the system with the best performance gets rank 1.0, the second best rank 2.0 and so on.In case two or more systems tie, they all receive the average rank.Table 5 presents the average rank (according to MiF and LCA-F) of each system over all the test sets for the corresponding batches.Note, that the average ranks are calculated for the 4 best results of each system in the batch according to the rules of the challenge.
The results in Task 6a show that in all test batches and for both flat and hierarchical measures, some systems outperform the strong baselines.The "DeepMeSH" systems achieve the best performance in the first two batches, outperformed only by "xgx" systems in the third batch.More detailed results can be found in the online results page 1 .Comparison of these results with corresponding system results from previous years reveals the improvement of both the baseline and the top performing systems through the years of the competition as shown in Figure 1.

Task 6b
Phase A: For phase A and for each of the four types of annotations: documents, concepts, snippets and RDF triples, we rank the systems according to the Mean Average Precision (MAP) measure.The final ranking for each batch is calculated as the average of the individual rankings in the different categories.In Tables 6 and 7 some indicative results from batch 3 are presented.Full results 1 http://participants-area.bioasq.org/results/6a/ are available in the online results page of Task 6b, phase A2 .These results are preliminary.The final results for Task 6b, phase A will be available after the manual assessment of the system responses.
Phase B: In phase B of Task 6b the systems were asked to produce exact and ideal answers.For ideal answers, the systems will eventually be ranked according to manual evaluation by the BioASQ experts (Balikas et al., 2013) exact answers 3 , the systems were ranked according to accuracy, F1 score on prediction of yes answer, F1 on prediction of no and macro-averaged F1 score for the yes/no questions, mean reciprocal rank (MRR) for the factoids and mean F-measure for the list questions.Table 8 shows the results for exact answers for the fourth batch of Task 6b.The symbol (-) is used when systems don't provide exact answers for a particular type of question.The full results of phase B of Task 6b are available online 4 .These results are preliminary.The final results for Task 6b, phase B will be available after 3 For summary questions, no exact answers are required 4 http://participants-area.bioasq.org/results/6b/phaseB/ the manual assessment of the system responses.
The results presented in Table 8 show that evaluation of system performance in the yes/no questions using the macro averaged F1 measure this year is useful to identify systems that achieve good performance regardless of any dataset imbalance in the yes-no classes.In batch 4 for example, two systems outperformed the strong baseline based on previous versions of the OAQA system, which is not clear considering only the accuracy.Regarding factoid and list questions, the performance achieved by the systems indicates that there is even more room for improvement in these types of question.

Conclusions
In this paper, an overview of the sixth BioASQ challenge is presented.The challenge consisted of two tasks: semantic indexing and question answering.Overall, as in previous years, the best systems were able to outperform the strong baselines provided by the organizers.This suggests that advances over the state of the art were achieved through the BioASQ challenge but also that the benchmark in itself is challenging.Moreover, a clear shift towards the use of systems that incorporate ideas based on deep learning models can be seen, with respect to previous years.Novel ideas have been tested and state-of-the-art deep learning methodologies have been adapted to biomedical question answering with great results.Consequently, we believe that the challenge is successfully pushing the research frontier in biomedical information systems.In future editions of the challenge, we aim to provide even more benchmark data derived from a community-driven acquisition process.

Figure 1 :
Figure1: The micro f-measure achieved by systems across different years of the BioASQ challenge.For each test set the micro F-measure is presented for the best performing system (Top) and the MTI, as well as the average micro f-measure of all the participating systems (Avg).

Table 1 :
Statistics on test datasets for Task 6a.

Table 2 :
Statistics on the training and test datasets of Task 6b.All the numbers for the documents and snippets refer to averages.

Table 3
model and ElasticSearch for efficiency, they train a Convolutional Neural Net-

Table 4 :
Systems and approaches for Task 6b.Systems for which no information was available at the time of writing are omitted. al.,

Table 6 :
Results for snippet retrieval in batch 3 of phase A of Task 6b.Only the top-10 systems are presented.

Table 7 :
. Regarding Results for document retrieval in batch 3 of phase A of Task 6b.Only the top-10 systems are presented.

Table 8 :
Results for batch 4 for exact answers in phase B of Task 6b.