Towards Medical Machine Reading Comprehension with Structural Knowledge and Plain Text

.

For example, among the top 10 works on SQuAD 2.0, nine models are based on ALBERT (Lan et al., 2020). 1 2) the most popular MRC datasets belong to the open domain, which are built from news, fiction, and Wikipedia text, etc. The answers to most questions can be derived from the given plain text directly.
Compared to the open domain MRC, medical MRC is more challenging, while owning the great potential of benefiting clinical decision support. There still lacks the popular benchmark medical MRC dataset. Some recent works are trying to construct medical MRC dataset such as PubMedQA , emrQA (Pampari et al., 2018) and HEAD-QA (Vilares and Gómez-Rodríguez, 2019), etc. However, either these data sets are noisy (e.g., due to semi-automatically or heuristic rules generated), or the annotated data scale is too small (Yoon et al., 2019;Yue et al., 2020). Instead, we constructs a large scale medical MRC dataset by collecting 21.7k multiplechoice problems with human-annotated answers for the National Licensed Pharmacist Examination in China. This entrance exam is a challenging task for humans, which is used to assess human candidates' professional medical knowledge and skills. According to the statistics data, the examinee's pass rate in 2018 is less than 14.2%. 2 The text of the reference books is used as the plain text for the questions. One example is illustrated in Table 1.
Though several pre-trained language models have been introduced for domain-specific MRC, BERT based models are not as consistently dominant as they are in open field MRC tasks (Zhong et al., 2020;Yue et al., 2020). Another challenge is that medical questions are often more difficult; no labeled paragraph contains the answer to a given question. Searching for multiple relevant snippets from possibly large-scale text such as the whole reference books is usually required. In many cases, the answer can not be found explicitly from the relevant snippets, and the medical background knowledge is needed to derive the correct answers from the relevant snippets. Therefore, unlike open domain, just using the powerful pre-trained language model and plain text cannot obtain the high performance for medical MRC. For example, in Table 1, the relevant snippets (the 3rd row) can only induce that Ribavirin and Entecavir are the possible answers for the given question (the 1st row). If the triples from medical knowledge graph (entecavir, indication, chronic hepatitis B) is used, we can quickly obtain the correct answer as Entecavir.
Here, we propose a novel medical MRC model KMQA, which exploits the reference medical text and external medical knowledge. Firstly, KMQA models the representations of interaction between question, option, and retrieved snippets from reference books with the co-attention mechanism. Secondly, the novel proposed knowledge acquisition algorithm is performed on the medical knowledge graph to obtain the triples strongly related to questions and options. Finally, the fused representations of knowledge and question are injected into the prediction layer to determine the answer. Besides, KMQA acquires factual knowledge via learning from an intermediate relation classification task and enhances entity representation by constructing a sub-graph using questionto-options paths. Experiments show that our unified framework yields substantial improvements in this task. Further ablation study and case studies demonstrate the effectiveness of the injected knowledge. We also provide an online homepage at http: //112.74.48.115:8157. 2 Related Work Medical Question Answering The medical domain poses a challenge to existing approaches since the questions may be more challenging to answer. BioASQ (Tsatsaronis et al., 2012(Tsatsaronis et al., , 2015 is one of the most significant community efforts made for advancing biomedical question answering (QA) systems. SeaReader  is proposed to answer questions in clinical medicine using documents extracted from publications in the medical domain. Yue et al. (2020) conduct a thorough analysis of the emrQA dataset (Pampari et al., 2018) and explore the ability of QA systems to utilize clinical domain knowledge and to generalize to unseen questions.  introduce PubMedQA where questions are derived based on article titles and can be answered with its respective abstracts. Recently, pre-trained models have been introduced to medical domain Beltagy et al., 2019;Huang et al., 2019a). They are trained on unannotated biomedical texts such as PubMed abstracts and have been proven useful in biomedical question answering. In this paper, we focus on multiple choice problems in medical exams that are more difficult and diverse, which allows us to directly explore the capabilities of QA models to encode domain knowledge. Knowledge Enhanced Methods KagNet (Lin et al., 2019) represents external knowledge as a graph, and then uses graph convolution and LSTM for inference. Ma et al. (2019) adopt the BERTbased option comparison network (OCN) for answer prediction, and propose an attention mechanism to perform knowledge integration using relevant triples. Lv et al. (2020) propose a GNN-based inference model on conceptual network relationships and heterogeneous graphs of Wikipedia sentences. BERT-MK (He et al., 2019) integrates fact triples in the KG, while REALM (Guu et al., 2020) augments language model pre-training algorithms with a learned textual knowledge retriever. Unlike previous works, we incorporate external knowledge implicitly and explicitly. Built upon pre-trained models, our work combines the strengths of both text and medical knowledge representations.

Method
The medical MRC task in this paper is a multiplechoice problem with five answer candidates. It can be formalized as follows: given the question Q and answer candidates {O i }, the goal is to select the most plausible correct answerÔ from the candidates. KMQA utilizes textual evidence spans and incorporates Knowledge graphs facts for Medical multi-choice Question Answering. As shown in Figure 1, it consists of several modules: (a) the multi-level co-attention reader that computes context-aware representations for the question, options and retrieved snippets, and enables rich interactions among their representations. (b) the knowledge acquisition which extracts knowledge facts from KG given the question and options. (c) the injection layer that further incorporates knowledge facts into the reader, and (d) a prediction layer that outputs the final answer. And also, we utilize the relational structures of question-tooptions paths to further augment the performance of KMQA.

Multi-level Co-attention Reader
Given an instance, text retrieval system is firstly used to select evidence spans for each questionanswer pair. We take the concatenation of question and candidate answer as input, and keep top-N relevant passages. These passages are combined as new evidence spans. Here, we use BM25-based search indexer (Robertson and Zaragoza, 2009) and medical books as text source.
Multi-level co-attention reader is used to represent the evidence spans E, the question Q and the option O. We formulate the input evidence spans as E 2 R m , the question as Q 2 R n and a candidate answer as O 2 R l , where m, n and l is the max length of the evidence spans, question and candidate answer respectively. Similar to , given the input E, Q and O, we apply the WordPiece tokenizer and concatenate all tokens as a new sequence ([CLS],E,[SEP],Q,#,O,[SEP]), where "[CLS]" is a special token used for classification and "[SEP]" is a delimiter. Each token is initialized with a vector by summing the corresponding token, segment and position embedding, and then encoded into a hidden state by the BERT based pre-trained language model. Generally, the PLMs are pre-trained on the large scale open domain plain text, which lacks the knowledge of the medical domain. There are some recent works show that to further pre-train PLMs on the intermediate tasks can significantly improve the performance of target task Clark et al., 2019;Pruksachatkun et al., 2020). Following this observation, we incorporate knowledge from the Chinese Medical Knowledge Graph (CMeKG) (Byambasuren et al., 2019) 3 by intermediate-task training. The CMeKG is a Chinese knowledge graph in medical domain developed by human-in-the-loop approaches based on large-scale medical text data using natural language processing and text mining technology. Currently, it contains 11,076 diseases, 18,471 drugs, 14,794 symptoms, 3,546 structured knowledge descriptions of diagnostic and therapeutic technologies, and 1,566,494 examples of medical concept links, along with attributes describing medical knowledge. The triple in CMeKG consists of four parts: head entity, tail entity and relation along with an attribute description. To acquire factual knowledge, we adopt the relation classification task to further pre-train PLMs on this dataset. This task requires a model to classify the relational labels of a given entity pair based on context. Specifically, we select a subset from CMeKG with 163 distinctive relations and include only the triples in which the relation related to drugs and disease types in the exam. Then, we discard all the relations with fewer than 5,000 entity pairs and retain 40 relations and 1,179,780 facts. After that, we concatenate two entities and insert "[SEP]" between the two as input, and then apply a linear layer to "[CLS]" vector of the last hidden feature of PLM to perform relation classification. Next, we discard the classification layer and initialize the corresponding part of the PLM with other parameters, denoted as B. Finally, we employ B to get encoding representation H cls 2 R h , H E 2 R m⇥h , To strengthen the information fusion from the question to the evidence spans as well as from the evidence spans to the question, we adopt a multi-level co-attention mechanism, which has been shown effective in previous models (Xiong  Taking the candidate answer representation O as input, we compute three types of attention weights to capture its correlation to the question, the evidence, and both the evidence and question, and get question-attentive, evidence-attentive, and question and evidence-attentive representations: where W t and b t are learnable parameters. Next we fuse these representations as follows: where [; ] denotes concatenation operation. Finally, we apply column-wise max and mean pooling on T O and concatenate it with H cls . It obtains the new option representationT O 2 R 3h .

Knowledge Acquisition
In this section, we describe the method to extract knowledge facts from knowledge graph in details. Once the knowledge is determined, we can choose the appropriate integration mechanism for further knowledge injection, such as attention mechanism (Sun et al., 2018;Ma et al., 2019), pre-training tasks (He et al., 2019) and multi-task training . Given a question Q and a candidate answer O, we first identify the entity and its type in the text by entity linking. The identified entity exactly matches the concept in KG. We also perform soft

Algorithm 1 Knowledge Acquisition Algorithm
Require: Question q and entities EQ = {e}, option facts SO = {(h, r, t)}, embedding function F , template function g 1: Translate triple sj = (hj, rj, tj) 2 SO to general text pj using g 2: if EQ is empty set then 3: Calculate knowledge-based option scores for each pj using the word mover's distance wmd(F (q), F (pj)) 4: return top-K option facts ranking by score in the ascending order 5: end if 6: Initialize similarity vector o 2 R |S O | with infinities. 7: Calculate the entity-to-triple score ci,j of entity ei with transformed text pj: wmd(F (ei), F (pj)) 8: Set the j-th element of similarity vector oj = min i2|E Q | {ci,j} 9: return top-K option facts ranking by o in the ascending order matching of part-of-speech rules and filter out stop words, and obtain key entities for Q according to category description, such as "western medicine", "symptoms", "Chinese herbal medicine" as E Q . After that, we retrieve all triples S O whose head or tail contains the entities of O as knowledge facts for this option. For these knowledge facts, we first convert head-relation-tail tokens into regular words by template function g in order to generate a pseudo-sentence. For example, "(chronic hepatitis B, Site of disease, Liver)" is converted to "The site of disease of chronic hepatitis B is liver". Then we can get re-rank option facts for each question-answer pair with the method shown in Algorithm 1, which uses the word mover's distance (Kusner et al., 2015) as similarity function empirically. The reason we apply it is to be able to find higher-quality knowledge facts that are more relevant to current option and input them into the model. The embedding function F here is the mean pooling of sentence word vectors. The word embedding uses 200-dimension pre-trained embedding for Chinese words and phrases (Song et al., 2018).
Although not perfect, the triple text found by Algorithm 1 does provide some useful information that can help the model find the correct answer.

Knowledge Injection and Answer Prediction
We first concatenate the returned option fact text as F , and then use the B to generate an embedding of this pseudo-sentence: Let H F 2 R s⇥h be the concatenation of the final hidden states, where s is max length, and we then adopt the attention mechanism to model the interaction between H F and the PLMs encoding output of question H Q : where element-wise multiplication is denoted by . Specifically, H F is linear transformed using W fq 2 R s⇥h . Then, the similarty matrix M F Q 2 R s⇥n is computed using standard attention. Then we use M F Q to compute question-to-knowledge attention A F Q 2 R s⇥h and knowledge-to-question attention A Q F 2 R s⇥h . Finally, the question-aware knowledge textual representation T F 2 R s⇥h is computed, where W proj 2 R 4h⇥h . Finally, max pooling and mean pooling are applied on the T F to generate final knowledge representationT F 2 R 2h . In the output layer, we combine textual represen-tationT O with the knowledge representationT F . For each candidate answer O i , we compute the loss as follows: where W out 2 R 1⇥5h . We add a simple feedforward classifier as the output layer which takes the contextualized representation T C as input and outputs the answer score Score(O i |E, Q, F ). Finally, the candidate with the highest score is chosen as the answer. The final loss function is obtained as follows: where C is the number of training examples, andÔ i is the ground truth for the i-th example, ✓ denotes all trainable parameters.

Augmenting with Path Information
For concepts in question and options (remove entities that are not diseases, drugs, and symptoms), we combine them in pairs and retrieve all paths between them within 3 hops to form a sub-graph about the option. For example, (chronic hepatitis B ! related diseases ! cirrhosis ! medical treatment ! entecavir) is a path for (chronic hepatitis B, entecavir). Then, we apply L layer graph convolutional networks (Kipf and Welling, 2017) to update the representation of the nodes, which is similar to (Lin et al., 2019;. Here, we set L equals 2. The vector h (0) i 2 R h for concept c i in the sub-graph g is initialized by the average embedding vector of tokens similar to §3.2. Then, we update them at (l + 1)-th layer using the following equation: where N i is the neighboring nodes, is ReLU activation function, W gcn is the weight vector. After that, we update i-th tokens representation t i 2 T O with the corresponding entity vector via a sigmoid gate to the new token representation t 0 i :

Dataset
We use the National Licensed Pharmacist Examination in China 4 as the source of questions. The exam is a comprehensive evaluation of the professional skills of candidates. Medical practitioners have to pass the examination to obtain the qualification for licensed pharmacist in China. Passing the exam requires getting a minimum of 60% of the total score. The pharmacy comprehensive knowledge and skills part of the exam consists of 600 multiple-choice problems over four categories. To test the generalizability of MRC models, we use the examples of this part in the previous five years (2015-2019) as the test set, and exclude questions of multiple-answer type. In addition to that, we also collected over 24,000 problems from the Internet and exercise books. After removing duplicates and incomplete questions (e.g. no answer), we randomly divide it into training, development sets according to a certain ratio, and remove the problems similar to the test set according to the condition that the edit distance is less than 0.1. The detailed statistics of the final problem set, named as NLPEC, are shown in Table 2.  We use the official exam guide book of the National Licensed Pharmacist Examination as text source (NMPA, 2018). It has 20 chapters, including pharmaceutical practice and medication, selfmedication for common diseases, and medication for organ system diseases. The book covers most of the necessary contents of the examination. In order to ensure the quality of retrieval, we first convert it into structured electronic versions through OCR tools, and then manually proofread and divide all the texts into paragraphs. Meanwhile, we also extract passages from other literature and add it to the text source, including the pharmacological effects and clinical evaluation of various drugs, explanations of drug monitoring and descriptions of essential medicines.

Experiment Settings
We use the Google-released BERT-base model as the PLM . We also compare the performance of KMQA, which uses the pre-trained RoBERTa large model . The pretrained weights that we adopt are the version of whole word masking in Chinese text (Cui et al., 2019). Our model is also orthogonal to the choice of the pre-trained language model. We use AdamW optimizer (Loshchilov and Hutter, 2019) with a batch size of 32 for model training. The initial learning rate, the maximum sequence length, the learning rate warmup proportion, the gradient accumulation steps, the training epoch, the hidden size h, , the number of evidence spans N , and the hyperparameter K are set to 3⇥ 10 5 , 512, 0.1, 8, 10, 768, 1 ⇥ 10 6 , 1, and 3 respectively. The learning parameters are selected based on the best performance on the development set. Our model takes approximately 22 hours to train with 4 NVIDIA Tesla V100. In order to reduce memory usage, in our implementation, we concatenate the knowledge text and the retrieved evidence spans, and then obtain separate encoding representations. For other models, the dimension of word embeddings is 200, the hidden size is 256, and the optimizer is Adam optimizer (Kingma and Ba, 2015). We also pretrained word embeddings on a large-scale Chinese medical text.

Main Results
The comparison between our method and previous works on the multi-choice question answering task over our dataset is shown in Table 3. IR baseline refers to the selection of answers using the ranking of the score of the retrieval system, and random guess refers to the selection of answers according to a random distribution. The third to fifth lines show the results of the previous stateof-the-art models. These models all employ the co-matching model and perform better than those two baselines. They use attention mechanisms to capture the correlation between retrieved evidence, questions, and candidate answers, and tend to choose the answer that is closest to the semantics of the evidence. Pre-trained language models with fine-tuning achieve more than 18% improvement over baselines. By fusion of knowledge source and text over BERT-base, the performance is further improved, which demonstrates our assumption that incorporating knowledge from the structure source can further enhance the option contextual understanding of BERT-base. Furthermore, our single model of KMQA-RoBERTa large, which employs RoBERTa large model pre-trained with whole word mask achieves better performance on both development set and test set and also outperforms RoBERTa large. This result also slightly surpasses the human passing score. These results demonstrate the effectiveness of our method.  In the exam, the questions are divided into three types, namely, type A (statement best choice), type B (best compatible choice), and type C (case summary best choice). The evaluation results are listed in Table 4. We observe that the best compatible choice type accounts for the highest proportion of the questions, and the model performance is lower than the other two. According to the different methods required for answering questions, we further divide them into three types: conceptual knowledge, situational analysis, and logical reasoning. For the problem of conceptual knowledge, they account for a lot and are usually related to specific concept knowledge. It means that we also need to improve our retrieval module. According to the needs of the problem to be deduced in a positive or negative direction, we divide the problem into two categories: positive questions and negative questions. We find that their performance is similar, but the positive part accounts for a more significant proportion.

Ablation Study
To study the effect of each KMQA component, we also conduct ablation experiments. The results are shown in Table 5. From the experimental results, if there is no external information but only questions and options, the model is only 2.5% higher than the retrieval baseline. After adding the information retrieved by the text retrieval model and knowledge graph, the model is improved by 26.3% and 6.4% respectively, which shows the effectiveness of external information. Further, we find that pre-training on relation classification can also improve the performance of our downstream QA tasks. When the path information from the question to the option is further added, the model has 0.8% improved accuracy. If we only use retrieved snippets from reference books with the co-attention mechanism, the model has more performance drops. We also change the hyper-parameter K, and results show that the setting K = 3 performs best. Due to the max length of BERT model, a larger K will not bring more improvements.

Model
Accuracy (

Case Study
As shown in Table 6, we choose an example to visualize joint reasoning using KG and retrieval text. In Example 1 of Table 6, we find that limited by the process of retrieval, some of the descriptions of the indications of the option are not completely relevant to the question stem, and the paragraphs contain descriptions of the chemical composition of this drug, which is noisy for answering the question. In contrast, our model is able to answer this question using both KG and textual evidence, alleviating the noise problem to some extent. Since many of the questions in our dataset are about diseases and drugs that require descriptions of their underlying meanings, using the medical KG may be the most convenient for our research.
Noisy Evidence: In 32% of the errors, the model is misled by noisy knowledge of other wrong answers. The reason may be that the context is too long and overlaps with the problem description. For example, in Example 2 of Table 6, both the right answer and wrong prediction could be potentially selected by retrieval evidence. However, we can intuitively get the answer through mutual verification of essential information in KG and retrieved texts.
Weak Reasoning Ability: 14% of the errors are due to the weak reasoning ability of the model, such as the understanding of symbolic units in op-tions. For example, in Example 3 of Table 6, the model needs to first understand the joint meaning of options using common sense, and then eliminate the wrong answer with counterfactual reasoning through knowledge and text.
Numerical Analysis: 10% of the errors are from mathematical calculation and analysis questions. The model cannot handle the question like "To prepare 1000ml 70% ethanol with 95% ethanol and distilled water, what is the volume of 95% ethanol needed?" properly since it cannot be directly entailed by the given paragraph. Instead, it requires mathematical calculation and reasoning ability of the model.

Conclusion
In this work, we explore how to solve multi-choice reading comprehension tasks in the medical field based on the examination problems of licensed pharmacists, and propose a novel model KMQA. It explicitly combines knowledge and pre-trained models into a unified framework. Moreover, KMQA implicitly takes advantage of factual information via learning from an intermediate task and also transfers structural knowledge to enhance entity representation. On the test set from the real world, the KMQA is the single model that outperforms the human pass line. In the future, we will explore how to apply our model to more domains, and enhance the interpretability of the reasoning path when the model answers questions.

A Compared Methods
BiDAF (Seo et al., 2017) is a representative network for machine comprehension. It is a multistage hierarchical process that represents context at different levels of granularity and uses a bidirectional attention flow mechanism to achieve a query-aware context representation without early summarization.
Co-matching (Wang et al., 2018) uses the attention mechanism to match options with the context that composed of paragraphs and the question, and output the attention value to score the options. It is used to solve the single paragraph reading comprehension task of a single answer question.
Multi-Matching (Tang et al., 2019) applies the Evidence-Answer Matching and Question-Passage-Answer Matching module to gather matching information and integrate them to get the scores of options.
SeaReader  is proposed to answer questions in clinical medicine using knowledge extracted from publications in the medical domain. The model extracts information with question-centric attention, document-centric attention, and cross-document attention, and then uses a gated layer for denoising.
BERT  achieves remarkable state-of-the-art performance across a wide range of related tasks, such as textual entailment, natural language inference, question answering. It first TRAIN DEV TEST # Knowledge facts 1, 129, 780 50, 000 50, 000
RoBERTa  is based on BERT's language masking strategy and modifies key hyperparameters in BERT, including changing the target of BERT's next sentence prediction, and training with a larger bacth size and learning rate. It has achieved improved results than BERT on different data sets.
ERNIE (Sun et al., 2019) is designed to learn language representation enhanced by knowledge masking strategies, which includes entity-level masking and phrase-level masking. It achieves state-of-the-art results on five Chinese natural language processing tasks.

B Relation Classification
We also show the dataset that used to pre-train on the relation classification task and the performance of the pre-trained models in this task. We compare several common text classification and matching models, including TextCNN (Kim, 2014), ESIM (Chen et al., 2017), DPCNN (Johnson and Zhang, 2017). For text classification, the input of the model is the concatenation of two entity words. For ESIM, the input layer is softmax multiclassification. Through learning with the relation classification task, pre-trained models achieve improved performance on the divided test set.

C Introduction to Exam
The detailed statistics of exams in recent years are listed in Table 8. The professional qualifications for licensed pharmacists are subject to a national unified outline, unified proposition, and unified organized examination system (Fang et al., 2013). The qualification exam for licensed pharmacists is held on every October. The examination takes