Large-Scale Semantic Indexing and Question Answering in Biomedicine

In this paper we present the methods and approaches employed in terms of our participation in the 2016 version of the BioASQ challenge. For the semantic indexing task, we extended our successful ensemble approach of last year with additional models. The ofﬁcial results obtained so-far demonstrate a continuing consistent advantage of our approaches against the National Library of Medicine (NLM) baselines. For the question answering task, we extended our approach on factoid questions, while we also developed approaches for the document, concept and snippet retrieval sub-tasks.


Introduction
The BioASQ project (Balikas et al., 2014) aims to provide a challenge framework for researchers dealing with classification (semantic indexing) and natural language processing (question answering) tasks in the field of bio-medicine. The challenge, similar to the previous three years, is divided into two tasks: automated semantic indexing (4A) and question answering (4B).
In Task 4A participants are given a set of new, unannotated articles and are required to automatically predict the relevant MeSH terms for each one of them in a given time. For each article only the abstract along with some meta-information is provided (journal, year and title). This task is particularly difficult, as the MeSH taxonomy is comprised of a large number of labels (∼ 27000), with the label set following a distribution similar to power-law. Furthermore the terms are subject to a significant concept drift along time.
Task 4B is divided into 2 phases, called A and B. In phase A participants are given a set of ques-tions and must return the 10 most relevant documents, snippets, concepts (from designated ontologies) and RDF triples. In phase B participants are given the gold standard documents and snippets and must provide exact and ideal answers.
This paper discusses the approaches we developed for this year's BioASQ challenge. In particular, Section 2 discusses our semantic indexing algorithms, Section 3 our document retrieval system, Section 4 our concept retrieval method, Section 5 our snippet retrieval approach and Section 6 discusses our question answering approach. Final considerations and conclusions are drawn in Section 7.

Task 4A: Semantic Indexing
In this section we present the methods that we used for the semantic indexing task. We first provide the pre-processing pipeline and subsequently the methods employed.

Pre-processing
In this year's participation, we used the 1,050,000 most recent documents from the BioASQ 2016 corpus using as a training set the first 1 million articles and the last 50 thousand as a validation set. The motivation behind using the latest articles of the corpus, stems from the hypothesis that more recent chronologically articles will tend to follow more similar labels distributions to new articles that have to be predicted, compared to older ones. Pre-processing of the articles was carried out similar to previous years; the abstract and the title were concatenated, uni-grams and bi-grams were used as features, removing stop-words and features with less than five occurrences in the corpus. We used the tf-idf representation for the features. Also, zoning of the features belonging to the title and those equal to a MeSH label was performed by increasing the tf-idf value of features that belonged to the title by log2 and those being equal to a label by log1.25. The above features were used in order to train several multi-label learning models, described in the following section.

Methods
Our participation to this year's contest included several multi-label classifiers (MLC) that were combined in various ensembles. As in the previous year, we used the Meta-Labeler (Tang et al., 2009), a set of Binary Relevance (BR) models with Linear SVMs (both tuned and with default parameters) and a Labeled LDA variant, Prior LDA (Rubin et al., 2012). For the tuned SVM models, we used different values for the C parameter and handled class imbalance by penalizing more heavily false negative errors than false positive ones by adjusting properly the weight parameter (Lewis et al., 2004). This year, we additionally employed Fast XML (Prabhu and Varma, 2014) and HOMER-BR (Tsoumakas et al., 2008).
All the above models were combined in an ensemble, using the MULE framework (Papanikolaou et al., 2014). MULE is a statistical significance multi-label ensemble that performs classifier selection. The key idea is to combine a set of multi-label classifiers aiming to optimize a selected measure (for the purpose of this challenge, we are mainly interested in the micro-F measure) and validate this combination through a statistical significance test; McNemar's test. This way, each label of the multi-label problem is predicted with a specific component model, the one that (a) contributes to the greatest improvement to the evaluation metric of interest and (b) is validated from the statistical test to indeed produce the aforementioned improvement. If (b) does not hold, in other words if the component model's improvement is not statistically significant, we predict that label with the globally optimal model.

Results
Since at the moment of writing this paper there are not sufficient official results yet(only the a small part of documents of the first batch are annotated), in Table 1 we present the performance of the multi-label classifiers used in our ensembles, in terms of the Micro-F and Macro-F measures, for the training set (one million documents) and the validation set (fifty thousand documents) used throughout the challenge.

Task 4B Phase A: Documents
Here we describe our document retrieval system. The system was written in Java. A variety of libraries have been used. The StAX Parser 1 for the input of XML files, the Stanford Parser 2 for natural language parsing and the GSON library 3 for output of JSON files. We build our system on open source Indri search engine from the Lemur Project 4 .

Pre-processing of citations
We processed the full database of MEDLINE and extracted the citations that contained Title, Abstract and MeSH annotations. There are 14,938,869 documents.

Search Engine
We used Indri as our search engine. We normalized the text of all the processed citations and we inserted them to our search engine. No stemming or stop-words filtering has been done in order to avoid any distortion of bio-medical and other important terminology.

Question Parsing and Query
Our system processes and analyzes the input question before producing the final query. It removes any unwanted punctuation, it analyzes the question with the Stanford Parser and produces a bag of words. Finally, we form our query by combining the bag of words with the query language grammar of Indri.

Testing
We tested our system by using both the questions and the gold standard articles of the previous BioASQ challenges and the current challenge. We experimented with Indri's great variety of search terms and tried retrieving top-10, top-20 and top-50 documents. The table below provides the results of our experiments retrieving top-10 documents.

Task 4B Phase A: Concepts
We are working at the phase A task of returning a list of at most 10 relevant concepts from the designated terminologies and ontologies. The list is ordered by decreasing confidence. In our approach, we use MetaMap 5 and LingPipe 6 to detect the biomedical concepts and local ontology files (Disease ontology, Gene ontology, Jochem, Uniprot and MeSH) to retrieve the appropriate information. More particularly, we use RDF4J 7 , a powerful Java framework for processing and handling RDF data of Disease ontology, Gene ontology, Jochem, and MeSH. This includes creating, parsing, storing, inferencing and querying over such data. Additionally, we use RDF4J's Lucene Sail that enables us to add full text search of RDF literals to find fast subject resources. As far as the Uniprot data are concerned which are not in obo format, we exploit them in XML format (not plain text that is recommended by the contest). Of course, Lucene indexing is necessary again. We present our methodology step by step: 1. The first step of our methodology is to remove stopwords from the given question. We use 2 stopwords lists: a basic list with 634 words and the Pubmed stopword list 8 . Then, we detect keywords using MetaMap and LingPipe. We give a boosting score to those concepts that come from MetaMap/LingPipe and a smaller score in any other word that appears in the question and MetaMap/LingPipe does not recognize it as biomedical concept.
2. Then, we expand the list with the candidate concepts exploiting the MeSH ontology (up to 15 candidate concepts, totally, enriching the list with ExactSynonyms). We retain two lists with candidate concepts: a list with all possible biomedical concepts for search in Disease ontology, Gene ontology, Jochem, and MeSH and a list that contains only proteins or genes for search in Uniprot XML data.
3. We search for each candidate term separately combining search in RDF4J's Lucene Sail index for fast detection of relevant terms and search in RDF4J RDF Repositories via SPARQL queries to filter the results which are returned as relevant terms by RDF4J's Lucene Sail index. More specifically, for the 4 ontologies we examine if the candidate term appears in properties: label, Ex-actSynonym, RelSynonym, Synonym, Nar-rowSynonym, BroadSynonym in order to add to Lucene score an additional boosting score and return the corresponding URI. If the candidate term does not appear in the above properties, then we just keep the Lucene score. Additionally, we exploit the properties (Positively/Negatively) Regulates in order to return the corresponding URI, too. Similarly, we conduct search in Uniprot data but instead of SPARQL queries, we use XPath, focusing in the following XML elements: fullName, shortName, alternativeName and innName.
4. Finally, we take the top 10 concepts with the biggest scores.
Here, we present experimental results on 2 different sets of questions (the sets belong to the training set of BioASQ contest). In order to extract relevant snippets to a query, we exploit our knowledge given by the ontologies we referred to in Section 4 (Disease ontology, Gene ontology, Jochem, Uniprot and MeSH). Briefly: 1. Detect keywords using MetaMap and Ling-Pipe 2. Search for synonyms for each keyword in order to make query expansion. Consider we have K keywords and for each one we find a few synonyms, e.g. for i-th keyword, i = 1, ..., K, we detect N synonyms. Each synonym is denoted by syn j key i , that is the jth synonym (syn j ), j = 1, ..., N of i-th keyword (key i ).
Format of query after the expansion step: Suppose K=2, key 1 has N synonyms and key 2 has M synonyms, so the query is going to be the following: ((key 1 OR syn 1key 1 OR syn 2key 1 OR . . . OR syn N key 1 ) AND (key 2 OR syn 1key 2 OR syn 2key 2 OR . . . OR syn M key 2 )) The total number of the candidate concepts (i.e. keywords with their corresponding synonyms) should contain up to 15 concepts.
3. Retrieve top 100 relevant documents (use of Lucene index). More particularly, we are interested in their title, abstract and pmid.
4. Split titles/abstracts returned in step 3 into sentences.
5. Calculate sematic similarity between each one of the sentences and the (expanded) query using the semantic similarity measure described in (Han et al., 2013). (At this point, we experiment using clustering algorithms in order to select the sentences that are located in the same cluster with the query, regarding them as the most relevant snippets.) 6. Return the top 10 sentences that are more similar to our query according to the similarity measure.

Task 4B, Phase B: Exact Answers
We developed a system that extracts answers from factoid questions under a scoring mechanism. In our approach, we applied numerous measurements that rank the candidate answers based on their relations with the questions. Some of them were applied in our previous system, but we realized that were not enough to estimate the correct answer. Thus, we extended the previous scoring mechanism in order to include the measures describing below.
• distance: The words, being near to the LAT of the question into the snippets, it is more possible to be a candidate answer.
• wordnet synonyms: We strongly believe that words with many synonyms in wordnet are more likely to be used in common language rather than in biomedicine. Thus, they take a punishment according to the number of synonyms that they have.
Furthermore, in the previous work, the system selected some of the words of an article as candidate answers. It selected the words that were produced by MetaMap parsing. Although, the results of the previous system were promising in the BioASQ training set, in the BioASQ challenge were quite low. The system's failure was caused by the lack of candidate answers. That's why we decided to expand the set of candidate answers considering all words including in the related snippets of a question. Finally, the specificity measure in our previous work changed because of the execution time. We had implemented that measure to count the number of instances of a candidate answer in all PubMed documents. Thus, we decided to seek the documents including the candidate answers with a document retrieval system. For each retrieving document, the candidate answer take a punishment.

Conclusions
In this paper we presented the participation of our team in the BioASQ challenge 2016. Building on the successful approaches in the past three challenges, we further extended our line of work to improve the performance of our systems. Additionally, our methodology for relevant concepts retrieval gives quite good results based on our evaluation in a variety of bio-medical questions that are provided by BioASQ's training set. Moreover, the semantic information from ontologies could be exploited for other tasks.