ChiMed: A Chinese Medical Corpus for Question Answering

Question answering (QA) is a challenging task in natural language processing (NLP), especially when it is applied to specific domains. While models trained in the general domain can be adapted to a new target domain, their performance often degrades significantly due to domain mismatch. Alternatively, one can require a large amount of domain-specific QA data, but such data are rare, especially for the medical domain. In this study, we first collect a large-scale Chinese medical QA corpus called ChiMed; second we annotate a small fraction of the corpus to check the quality of the answers; third, we extract two datasets from the corpus and use them for the relevancy prediction task and the adoption prediction task. Several benchmark models are applied to the datasets, producing good results for both tasks.

In the medical domain, most medial QA corpora consist of scientific articles, such as BioASQ (Tsatsaronis et al., 2012), emrQA (Pampari et al., 2018), and CliCR (Šuster and Daelemans, 2018). Although some studies were done for conversational datasets (Wang et al., 2018a,b), corpora designed for community QA are extremely rare. Meanwhile, given that many online medical service forums have emerged (e.g. MedHelp 1 ), there are increasing demands from users to search for answers for their medical concerns. One might be tempted to build QA corpora from such forums. However, in doing so, one must address a series of challenges such as how to ensure the quality of the derived corpus despite the noise in the original forum data.
In this paper, we introduce our work on building a Chinese medical QA corpus named ChiMed by crawling data from a big Chinese medical forum 2 . In the forum, the questions are asked by web users and all the answers are provided by accredited physicians. In addition to (Q, A) pairs, the corpus contains rich information such as the title of the question, key phrases, age and gender of the user, the name and affiliation of the accredited physicians who answer the question, and so on. As a result, the corpus can be used for many NLP tasks. In this study, we focus on two tasks: relevancy prediction (whether an answer is relevant to a question) and adoption prediction (whether an answer will be adopted).
# of As per Q # of Qs % of Qs 1 5,517 11.8 2 39,098 83.7 ≥3 2,116 4.5 Total 46,731 100.0 Table 1: Statistics of ChiMed with respect to the number of answers (As) per question (Q).

The ChiMed Corpus
To benefit NLP research in the medical domain, we create a Chinese medical corpus (ChiMed). This section describes how the corpus was constructed, the main content of the corpus, and and its potential usage.

Data Collection
Ask39 3 is a large Chinese medical forum where web users (to avoid confusion, we will call them patients) can post medical questions and receive answers provided by licensed physicians. Each question, together with its answers and other related information (e.g., the names of physicians and similar questions), is displayed on a page (aka a QA page) with a unique URL. Currently, approximately 145 thousand forum-verified physicians have joined the forum to answer questions and there are 17.6 million QA pages. We started with fifty thousand URLs from the URL pool and downloaded the pages using the selenium package 4 . After removing duplicates or pages with no answers, 46,731 pages remain and most of the questions (83.7%) have two answers (See Table 1).

QA Records
From each QA page, we extract the question, the answers and other related information, and together they form a QA record. Table 2 displays the main part of a QA record, which has five fields that are most relevant to this study: (1) "Department" indicates which medical department the question is directed to; 5 (2) "Title" is a brief description of disease/symptoms (5-20 characters); (3) "Question" is a health question with a more detailed description of symptoms (at least 20 characters); (4) "Keyphrases" is a list of phrases related to the question and the answer(s); (5) The last field is a list of answers, and each answer has an Adopted flag indicating whether it has been adopted. Among the five fields, Title and Question are entered by patients; Answers are provided by physicians; Department is determined by the forum engine automatically when the question is submitted. As for the Keyphrases field and the Adopted flag, it is not clear to us whether they are created manually (if so, by whom) or generated automatically. 6 In addition to these fields, a QA record also contains other information such as the name and affiliation of the physicians who answer the question, the patient's gender and age, etc. Table 3 shows the statistics of ChiMed in terms of QA records. On average, each QA record contains one question, 1.96 answers, and 4.48 keyphrases. Overall, 69.1% of the answers in the corpus have an adopted flag.

Potential Usage of the Corpus
Given the rich content of the QA record, ChiMed can be used in many NLP tasks. For instance, one can use the corpus for text classification (to predict the medical department that a Q should be directed to), text summarization (to generate a title given a Q), keyphrase generation (to generate keyphrases given a Q and/or its As), answer ranking (to rank As for the same Q, if adopted As are indeed better than unadopted As), and question answering (retrieve/generate As given a Q).
Because the content of the corpus comes from an online forum, before we use the corpus for any NLP task, it is important to check the quality of the corpus with respect to that task. As a case study, for the rest of the paper, we will focus on three closely related tasks, all taking a question and an answer (or a set of answers) as the input: The first one determines whether the answer is relevant to the question; the second determines whether the answer will be adopted for the question (as indicated by the Adopted flag in the corpus); the third one ranks all the answers for the question if there are more than one answer. We name them the relevancy task, the adoption prediction task, and the answer ranking task, respectively. The first two are binary classification tasks, while the last one is a ranking task. In the next section, we will manually check a small fraction of the corpus to determine whether its quality is high for those tasks.   3 Relevancy, Answer Ranking, and Answer Adoption Given ChiMed, it is easy to synthesize a "labeled" dataset for the relevancy task. E.g., given a question, we can treat answers in the same QA record as relevant, and answers in other QA records as irrelevant. The quality of such a synthesized dataset will depend on how often answers in a QA record are truly relevant to the question in the same record. For the adoption prediction task, we can directly use the Adopted flag in the QA records. For the answer ranking task, the answers in a QA record are not ranked. However, if adopted answers are often better than unadopted answers, the former can be considered to rank higher than the latter if both answers come from the same QA record.  one adopted answer and 34.46% have two adopted answers. We can use these 65.46% of QA records as a labeled dataset for the answer ranking task. However, the quality of such a dataset will depend on the correlation between the Adopted flag and the high quality of an answer. To evaluate whether the answers are relevant to the question in the same QA record, and whether adopted answers are better than unadopted ones, we randomly sampled QA records containing exactly two questions, and picked 60 records with exactly one adopted and one unadopted answers (called Subset-60) and 40 records with both answers adopted (called Subset-40). The union of subset-60 and subset-40 is called Full-100, and it contains 100 questions, 200 answers (140 answers are adopted and 60 are not).

Annotating Relevancy and Answer Ranking
To determine the quality of ChiMed, we manually added two types of labels to each QA record in Possible Relevancy Labels for a (Q, A) pair: 1: The A fully answers the Q 2: The A partially answers the Q 3: The A does not answer the Q 4: Cannot tell whether the A is relevant to Q Possible Ranking Labels for one Q and two As: 1: The first A is better 2: The second A is better 3: The two As are equally good 4: Neither of As is good (fully answers the Q) 5: Cannot tell which A is better Properties of Good As: 1: Answer more sub-questions 2: Analyze symptoms or causes of disease 3: Offer advice on treatments or examinations 4: Offer instructions for drug usage 5: Soothe patients' emotions Properties of Bad As: 1: Answer the Q indirectly 2: The A has grammatical errors 3: Offer irrelevant information Table 5: Labels and part of annotation Guidelines for relevancy and ranking annotation.
the Full-100 set. The first is relevancy label, indicating whether an answer is relevant to a question (i.e., whether the answer field provides a satisfactory answer to the question). There are four possible values as shown in the top part of Table 5.
The second type of labels ranks the two answers for a question. Sometimes, determining which answer is better can be challenging especially when both answers are relevant. Intuitively, people tend to prefer answers that address the question directly, that are easy to understand while supported by evidence, etc. Based on such intuition, we create a set of annotation guidelines, parts of which are shown in the second half of Table 5. Because both types of annotation may require medical expertise, we include a Cannot tell label (label "4" for relevancy annotation and label "5" for ranking annotation) for non-expert annotators to annotate different cases.

Inter-annotator Agreement on Relevancy and Answer Ranking
We hired two annotators without medical background to first annotate the Full-100 set independently and then resolve any disagreement via discussion. The results in terms of percentage and  Table 6: Inter-annotator agreement for relevancy and ranking labeling on the Full-100 set in terms of percentage (%) and Cohen's Kappa (κ). I and II refer to the annotations by the two annotators before any discussion, and Agreed is the annotation after the annotators have resolved their disagreement.  Cohen's Kappa are in Table 6. Inter-annotator agreement on the relevancy label is quite high (90.5% in percentage and 55.6% in kappa), while the agreement on the ranking label is much lower (62.0% in percentage and 43.0% in kappa). Table 7a and Table 7b show the confusion matrices of the two annotators on the relevancy annotation and ranking annotation, respectively. Out of four relevancy labels and five ranking labels, relevancy label "3" and ranking label "4" are rare as most answers in the corpus are relevant; relevancy label "4" and ranking label "5" are also rare, but they do occur as sometimes choosing the relevant/better answer requires medical expertise. There are two reasons for curly hair: one is congenital natural curly hair; the other is caused by inadvertently acquired, such as perming or dyeing hair. Congenital straightening or chemical straightening can only be temporary. In addition to the shampoo and hair care products need to be adjusted, avoid using a hot hair dryer. Be careful when combing your hair. Do not use a headband or rubber band hairpin to prevent hair strain.

A2
自然卷是一种受遗传因素影响的发型。头发自然卷成一卷。形成的原因是由于人类基因 的不同。卷发并不是一件坏事。这种自卷曲的类型是药物无法改变的。如果拉直用的是 直板，离子是热的，经过熨烫，一段时间后，它就会回到原来的状态。 Natural rolls are a type of hair that is affected by genetic factors. The hair is naturally rolled into a roll. The reason for the formation is due to differences in human genes. Curly hair is not a bad thing. This type of self-curling is that the drug cannot be changed. If the way of straightening is straight, the ions are hot and after ironing, after a while, it will return to its original state. Table 8: An example where one annotator thinks the two answers are equally good because they both answer the question informatively. The other annotator thinks A1 is better because it tells the patient how to take care of his/her hair in daily life, although A1 provides less analysis of the causes of the symptom. After discussion, the two annotators reach an agreement that advice on daily care is very important and thus A1 is better than A2.
For ranking annotation, disagreement tends to occur when the two answers are very similar. That is why the majority of disagreed annotations (22 out of 38) occur when one annotator chooses one answer to be better while the other annotator considers the two answers to be equally good (an example is given in Table 8). There are 13 examples where annotators have completely opposite annotation (e.g., one annotates "1" while the other annotates "2"), which further shows the difficulty in identifying which answer is better.

The Adopted flag in ChiMed
As is mentioned above, each answer in ChiMed has a flag indicating whether or not the answer has been adopted. While we do not know the exact meaning of the flag and whether the flag is set manually (e.g., by the staff at the forum) or automatically (e.g., according to factors such as the physicians' past performance or seniority), we would like to know whether the flag is a good indicator of relevant or better answers. Among four relevancy labels, we regard answers with label "1" or "2" as relevant answers because they fully or partially answer the question, and answers with label "3" or "4" as irrelevant answers. Table 9 shows that 98.0% of the answers in the Full-100 set are considered to be relevant, according to the Agreed relevancy annotation. In  Table 9: The Adopted flag vs. relevancy label on the Full-100 set. Here, answers with relevancy label "1" or "2" are regarded as relevant answers.
other words, approximately 98% of (Q, A) pairs in the corpus are good question-answer pairs. On the other hand, the adopted answers are not more likely to be relevant to the question than the unadopted ones. Therefore, the Adopted flag is not a good indicator of an answer's relevancy. The next question is whether adopted answers tend to be better answers than unadopted ones. If so, we can use the Adopted flag to infer ranking labels as follows: if a QA record in the Full-100 set has exactly one adopted answer, we rank that answer higher than the unadopted one in the same record; if both answers in a QA record are adopted, they are considered to be equally good. Table 10 shows such inferred labels do not correlate well with human annotation. In fact, the correlation between inferred labels and the Agreed human annotation is only 0.068, when we use the 97 QA records with ranking label "1", "2", or "3". Therefore, the Adopted flag is not a good indicator (a) Agreements between the ranking labels from annotators (I, II, and Agreed) and the labels induced from the adopted flag (Adopted). The Subset-60 is the subset of the Full-100 set where each question has exactly one adopted answer and one unadopted answer (See Section 3). (b) Confusion matrix between the agreed human annotation and ranking labels induced from the adopted flag. The meaning of the five labels are explained in Table 5. for better answers. So far we have demonstrated that the Adopted flag is not a good indicator for relevant or better answers. So what does the Adopted flag really indicate? While we are waiting for responses from the Ask39 forum, there are two possibilities. One is that the flag is intended to mean something totally different from relevant or better answers. The other possibility is that the flag intends to mark relevant or better answers but their criteria for relevant or better answers are very different from ours. Table 11 shows a (Q, A) pair, where the answer is adopted. On the one hand, the answer does not directly answer the question. On the other hand, it does provide some useful information about gallstone, and one can argue that the adopted flag in the original corpus is plausible.

Two Datasets from ChiMed
As shown in Table 9, the majority of answers in ChiMed are relevant to the questions in the same QA records. To create a dataset for the relevancy task, we start with the 25,594 QA records which have exactly one adopted and one unadopted answer (see Table 4), Next, we filter out QA records whose questions or answers are too long or too short, 7 because very short questions or answers 7 We will remove a QA record if it contains a question/answer that is ranked either top 1% or bottom 1% of all questions/answers according to their character-based length.   tend to be lack of crucial information, whereas very long ones tend to include much redundant or irrelevant information. The remaining dataset contains 24,940 QA records. We divide it into training/development/testing set with portions of 80%/10%/10% and call the dataset ChiMed-QA1. Since each QA record has one adopted and one unadopted answer, we will use the dataset to train an adoption predictor.
For the relevancy task, we need both positive and negative examples. We start with ChiMed-QA1, and for each QA record, we keep the adopted answer as a positive instance, and replace the unadopted answer with an adopted answer from another QA record randomly selected from the same training/dev/testing subsets to distinguish relevant vs. irrelevant answers. We call this synthesized dataset ChiMed-QA2. We will use those two datasets for the adoption prediction task and the relevancy task (see the next section). We are not able to use the corpus for the answer ranking task as we cannot infer the ranking label from the Adopted flag.   Table 12 shows the statistics of the two datasets. The first three rows are the same for the two datasets; the average length of As in ChiMed-QA2 is slightly longer than that in ChiMed-QA1 because adopted answers tend to be longer than unadopted ones.

Experiments on Two Prediction Tasks
In this section, we use ChiMed-QA1 and ChiMed-QA2 (See Table 12) to build NLP systems for the adoption prediction task and the relevancy prediction task, respectively. Both tasks are binary classification tasks with the same type of input; the only difference is the meaning of class labels (relevancy vs. adopted flag). Therefore, we build a set of NLP systems and apply them to both tasks.

Systems and Settings
We implemented both CNN-and LSTM-based systems, and applied three state-of-the-art sentence matching systems to the two tasks.  of word embeddings in a question and an answer.
We run our CNN-and LSTM-based systems under four different settings: (1) A-Only where an answer is the only input (See Figure 1); (2) A-A where both answers are input (See Figure 2); (3) Q-A where a question and one of its answers are input (See Figure 3); (4) Q-As where a question and both of its answers are input (See Figure 4). ARC-I, DUET, and DRMM are run under the settings of Q-A and Q-As, because the systems require a question to be one of the input. The reason we apply the A-Only and A-A settings to the adoption prediction task is that it helps identify whether features from an answer itself will contribute to its adopted flag assignment without knowing its question. To compare the relevancy task and the adoption prediction task, we also apply these two settings to the former task although they are not common settings in previous studies (Lai et al., 2018).
Word segmentation has always been a challenge in Chinese NLP especially when it is applied to a particular domain Xia, 2012, 2013). Therefore, instead of word embeddings (Song et al., 2018), we use Chinesecharacter-based embeddings to avoid word segmentation errors. We set the embedding size to 150. We use 155 and 245 as the lengths of questions and answers respectively. Short texts are padded with blank characters. We use 32 filters  with the kernel size 3 for every CNN layer and we set the LSTM hidden size to 32. We apply a pooling size of 2 to all max pooling layers. Besides, the activation function of the output layers under A-Only and Q-A settings is sigmoid, that of output layers under A-A and Q-As settings is sof tmax, and that of all other layers is tanh.
In addition, noting that the two answers for the same question have opposite labels in both tasks, we evaluate all systems in terms of (Q, A) pair predication accuracy with and without conflict resolution (CR), with which the model resolves conflicts when either two relevant/adopted answers or two irrelevant/unadopted answers are predicted. Because the activation function of the output layers under A-A and Q-As settings is sof tmax and because there are always two answers for each question, systems under these two settings never generate conflict predictions. We do not apply MAP (Mean Average Precision) (Lai et al., 2018) to the tasks because the number of candidate answers of each question in the datasets is limited to 2. Table 13 shows the experimental results of running the five predictors on the testing set under four different settings. There are a few observations.

Experimental Results
First, for the relevancy task, by designing only half of the (Q, A) pairs in ChiMed-QA2 come from the same QA records. When Q is not given as part of the input (System 1-4), it is impossible for the predictors to determine whether an answer is relevant; therefore, the system performances are no better than random guesses. In contrast, for the adoption prediction task, by designing all the (Q, A) pairs in ChiMed-QA1 come from the same QA records, and according to Table 9 we also know that about 98% of the answers, regardless of whether they are adopted or not, are relevant. Therefore, the absence of Qs in System 1-4 does not affect system performance a lot.
Second, when both Q and A are present (System 5-9), the accuracy of relevancy prediction is higher than that of adoption prediction, because the former is an easier task (at least for humans). The only exception is ARC-I (System 7), whose results on relevancy is close to random guess (50.34% and 50.60%) while the result on adoption is comparable with other systems. This is due to the way that ARC-I matches questions and answers. Because embeddings of a question and an answer are directly concatenated in ARC-I, Q-A similarity are not fully captured, leading to low performance on relevancy. On the contrary, the adoption prediction does not rely much on the Q-A similarity (as explained above).
Third, for the relevancy task, systems that capture more features of Q-A similarity tend to have a better result. For example, under the Q-A setting, DUET (System 8) outperforms CNN, LSTM and ARC-I (System 5-7) because DUET has an additional model of exact phrase matching between questions and answers. DRMM (System 9) performs better than DUET (System 8) because DRMM uses word embedding instead of exact phrase when matching pairs of phrases between a question and an answer. In contrast, the performances of the five systems on the adoption task are very similar.
In addition, except for the relevancy task evaluated with CR, the contrast between System 10-14 vs. System 5-9 indicates comparing two As always helps predictors in both tasks because intuitively knowing both answers would help us to decide which one is relevant/adopted. On the contrary, the comparison between the same two groups of systems with CR in the relevancy task indicates comparing two As may hurt the relevancy predictors (System 5, 7, 8) because the relevancy is really between Q and A, which might be affected by the existence of other As.
Finally, all the systems under A-Only and Q-A settings (Systems 1-2 and 5-9) benefit from CR. It is also worth noting that running the models under Q-A setting and to evaluate them without CR in previous studies (Lai et al., 2018) is much more common. Under this setting, the highest performance achieved is 93.60% (System 9). The score is not as high as our expectation and there still exist room for improvement.

Error Analysis for Relevancy Prediction
We go though errors of system 9 in the relevancy prediction task without CR and find three main types of errors. Note that we artificially build ChiMed-QA2 for the relevancy prediction task by keeping the adopted answer a of a question q and replacing the unadopted answer of q with an adopted answer a from another question q . And we therefore regard a as a relevant answer of q and a as an irrelevant answer of q (See Section 3.4).
The first type of error is that the answer a is actually irrelevant to the question q. In other words, the gold standard is wrong; system 9 does make a correct prediction. This is not surprising as there are around 2% irrelevant answers in the dataset according to our annotation (See Table 9). Second, the system fails to capture the relationship between a disease and a corresponding treatment. E.g., a patient describes his/her symptoms and asks for treatment. The doctor offers a drug directly without analyzing the symptoms and causes of disease. In that case, the overlap between the question and the answer is relatively low. The system therefore cannot predict the answer to be relevant without the help of a knowledge base.
Finally, it is quite common that a patient describes his/her symptoms at the beginning of the question q and asks something else at the end (e.g. whether drug X will help with his/her illness). In this case, if q (the original question of the irrelevant answer a ) describes similar symptoms, the system may fail to capture what exactly q wants to ask and therefore mistakes a for a relevant answer. Table 14 gives an error in this type where q and q describe similar diseases but they are in fact expecting totally different answers.
Given the three types of errors, we find out the latter two are relatively challenging. This therefore requires further exploration on the way of modeling (Q, A) pairs in the relevancy prediction task. In addition, because current irrelevant answers are randomly sampled from the entire dataset, the current dataset does not include many q 我上周感冒咳嗽，现在感冒好了，但咳 嗽更加厉害了。蜂蜜可以治疗咳嗽吗？ I had a cold and cough last week. Now, the cold has gone, but the cough is even worse. Can honey treat cough? q 我是支气管扩张患者，最近感冒病情加 重。支气管扩张病人感冒怎么治疗？ I am a patient with bronchiectasis. I have recently become worse with a cold. How to treat a cold for a bronchiectasis patient? a 正常的情况下，支气管病人如果感冒， 就应该立即到医院就医，并在医生的指 导下用药物治疗。如果耽误治疗的话病 情会加重，而且会出现一些并发症。 Normally, if a bronchial patient has a cold, he should go to the hospital immediately and take medication under the guidance of a doctor. If the treatment is delayed, the condition will worsen and complications will occur. Table 14: An example where system 9 mistakes irrelevant answer a for a relevant answer. Both questions q and q are talking about cold and cough, but they are totally different because q is asking whether honey is helpful for cough while q is looking for treatment. challenging examples. This makes relevancy prediction task appear easier than what it could be. For future work, we plan to balance the easy and hard instances in the dataset by adding more challenging examples to ChiMed-QA2.

Conclusion and Future Work
In this paper, we present ChiMed, a Chinese medical QA corpus collected from an online medical forum. Our annotation on a small fraction of the corpus shows that the corpus is of high quality as approximately 98% of the answers successfully address the questions raised by the forum users. To demonstrate the usage of the corpus, we extract two datasets and use them for two prediction tasks. A few benchmark systems yield good performance on both tasks.
For the future work, we are collecting data to expand the corpus and plan to add more challenging samples to the datasets. In addition, we plan to use ChiMed for other NLP tasks such as automatic answer generation, keyphrase generation, summarization, and question classification. We also plan to explore various methods of adding more annotations (e.g., answer ranking) to the corpus.