How You Ask Matters: The Effect of Paraphrastic Questions to BERT Performance on a Clinical SQuAD Dataset

Reading comprehension style question-answering (QA) based on patient-specific documents represents a growing area in clinical NLP with plentiful applications. Bidirectional Encoder Representations from Transformers (BERT) and its derivatives lead the state-of-the-art accuracy on the task, but most evaluation has treated the data as a pre-mixture without systematically looking into the potential effect of imperfect train/test questions. The current study seeks to address this gap by experimenting with full versus partial train/test data consisting of paraphrastic questions. Our key findings include 1) training with all pooled question variants yielded best accuracy, 2) the accuracy varied widely, from 0.74 to 0.80, when trained with each single question variant, and 3) questions of similar lexical/syntactic structure tended to induce identical answers. The results suggest that how you ask questions matters in BERT-based QA, especially at the training stage.


Introduction
In clinical NLP, there has been vital interest in developing question-answering (QA) systems, e.g., AskHERMES (Cao et al., 2011), MiPACQ (Cairns et al., 2011), andMEANS (Abacha andZweigenbaum, 2015). One specific type of clinical QA targets on locating any suitable answer within a given document (a.k.a. reading comprehension), which is helpful for answering patient-specific questions based on information mentioned in clinical notes. Recently, BERT (Devlin et al., 2018) and its derivatives have struck impressive success in this task for general English, represented by SQuAD (Rajpurkar et al., 2018), and for clinical text with promising results (Wen et al., 2020;Soni and Roberts, 2020).
However, an under-explored area in performing BERT-assisted QA is: how the system would behave if the input question is asked in a different (paraphrastic) way? Most existing experiments have assumed that the train and test data belong to a closed space with pre-assembled syntactic and lexical diversity (i.e., paraphrastic questions) representing what the users could ever ask, and faithfully evaluate the both diverse train/test sets in a symmetric manner. In practice, there are at least two possible scenarios concerning the potential effect from a differently-asked test question: 1) the system is trained with, or has seen, the question construct, 2) the system has never seen the question construct during training. Here by "construct" we refer to paraphrases like: "why is the patient prescribed medication-X?" versus "why does the patient take medication-X?". More examples are in Table 1.
Ideally, a BERT QA model is supposed to provide a consistent answer as long as the user asked a semantically-equivalent (paraphrastic) question. This is important in a production system because a user should not be required to ask only questions conforming to "template" constructs. Therefore, in this study we set to understand how such paraphrastic perturbation in asking would affect a BERT-based QA model. We used a dataset that contained finite question constructs, but purposefully injected experiments with using limited constructs in the training and/or testing to simulate the asymmetric perturbations of interest. For example, training on only one question construct and testing on the other different constructs (i.e., unseen ways of asking).
Our major findings can be summarized as follows: 1. Models trained with all pooled constructs still gave the best accuracy. 2. When training was limited to each single construct, certain constructs gave overall higher accuracy across all test constructs. Accuracy also varied depending on the test construct, but the effect was not strong as the choice of the training construct. 3. Certain test question constructs tended to induce identifical answers, as revealed via a clustering analysis. Pampari et al. (2018) created the emrQA corpus by template-based semantic extraction from the i2b2 NLP challenge datasets (i2b2 2019). The emrQA includes more than 400,000 QA pairs and has served as a valuable resource in clinical QA research (Wen et al., 2020;Soni and Roberts, 2020). Most of the previous studies reported strong performance by BERT, especially when it was pretrained with domain-specific text, e.g., the Clinical BERT (Alsentzer et al., 2019) was trained with about 2 million clinical notes from the MIMIC-III database (Johnson et al., 2016). For compatibility with using BERT-based reading comprehension QA, the SQuAD format is commonly adopted. Task-wise, the SQuAD 2.0 (Rajpurkar et al., 2018) also introduced unanswerable questions that require QA systems to know when not to answer if no suitable evidence is present in the text.

Related work
Besides the relevant backgrounds above, there have been NLP studies that reported the effect of paraphrastic questions in QA system performance. Buck et al. (2017) and Dong et al. (2017) developed approaches to paraphrasing questions for optimal answer accuracy in retrieval-based QA, where candidate answers were searched and ranked from a large set of documents. The closest work for reading comprehension QA we identified was by Gan and Ng (2019), which investigated the effect of question variants on a general English SQuAD dataset. They demonstrated that unseen paraphrastic test questions hurt the accuracy of deep learning QA models, and proposed a countermeasure by pre-augmenting the training data with machine-generated paraphrastic questions.

Research questions
We designed our experiments around the following three research questions: • How does the accuracy change by training the model with a pool of multiple question constructs versus training by only each construct? • How does the accuracy vary across different training question constructs and across different test question constructs? • Do some of the test question constructs tend to elicit similar answers out of a trained model?

Dataset
We used the emrQA as the base dataset and selected only those "why"-questions in this study due to our application research interest. Within the why-QA subset, we further considered three levels of QAs, from broad to specific: All -all the why-QAs (see Appendix A).
Med -why-QAs about medication. Q0~Q8 -the 9 individual question constructs in the Med set, as elaborated in Table 1. All the QAs were prepared into the SQuAD 2.0 format. The train/dev/test splits are detailed in Table 2, which also breaks down with showing the answerable (HasAns) versus unanswerable (NoAns) QA counts. The dev partition was for setting the optimal threshold of "do not answer" before processing the final held out test partition. Note that the numbers in column Qi were made identical across Q0~Q8 respectively, so there should not be any bias in inflating any of them.

Training and evaluation
All of the models started from the pre-trained Clinical BERT, followed by a modest fine-tuning with 1,833 general English why-QAs from the SQuAD 2.0 corpus. On top of that, the experiments involved three parts ( , test on each Qi test set. This is basically crossover between Q0~Q8. Each fine-tuning (or simply referred as "training") was done with 10 epochs, batch_train_size=32, learning_rate=3e-5, and max_seq_length=128. The jobs were run on a Tesla V100 with compute capability 7.0 and 18 GB of memory.
The official SQuAD 2.0 evaluation script was used, and we reported primarily the accuracy as F1-weighted overlaps between the gold and the system answers. As a semi-qualitative assessment of question similarity (in terms of the triggered model behavior), we computed the number of agreed (case-insensitive and remove articles) answers between each pair of Qi test sets in experiment B above and performed hierarchical clustering to group the 9 question constructs.

Pooled training made stronger model
The model accuracies are reported in Figure 1, where Figure 1b is specifically to show precision (positive predictive value, or PPV) on those HasAns QAs that were indeed answered by each model. It can be seen that the All model (blue line at the top) outperformed the Med model (orange line) and every individual Qi model, suggesting that training with additional non-medication why-QA entries still benefited the accuracy. The benefit is more apparent in PPV (Figure 1b) and fluctuates mildly across the test constructs Q0~Q8 (X-axis). In comparison to the individual Qi models, the pooled Med model also exhibits clear advantage but with varying margins (elaborated in 4.2).

Accuracy varied depending on the question construct
The accuracy appears to be strongly affected by which specific question construct was used for training. For example, Q7 (cyan line in Figure 1) exhibits about 4% drop compared to Med (orange), while Q1 (red) has 10% or wider gap below Med. Manual inspection of 569 disagreements between Q1 and Q7 did show the Q1 model frequently refrained from answering (141/569=25%) or gave irrelevant answers (286/569=50%). In addition, such question-dependent behavior changes again when we look at PPV specifically. For example, in Figure 1a the Q4 model (pink) performs comparably well as the Q7 model, but in Figure 1b its relative rank drops to the middle tier indicating that Q4 gave many incorrect answers.
Within each line (a trained model), the variance of accuracy across different test questions does not appear as drastic (up to ~2%) compared to that observed across models. However, one puzzling observation is that the peak accuracy within each line of Q0~Q8 is usually not at where the train and test question align. (e.g., train on Q0, test on Q0)

Some questions were more likely to obtain same answers
The hierarchical dendrogram for clustering the question constructs is shown in Figure 2. When trained on the pooled of Med constructs (i.e., the orange line in Figure 1a), some of the test question constructs turned out to yield closer answers than others. Specifically, Q0 and Q1 form a cluster (green in Figure 2), Q4 sort of stands alone, and the others form another subtree (red) enclosing further sub-clusters. Some intuitive explanations could be derived by inspecting the lexical/syntactic contents of the questions: For example, between Q0 and Q1 the only difference is an additional "originally" in Q1. Likewise, a single switch of tense between "was" and "is" appears to account for the two tight clusters (Q2 and Q3) as well as (Q5 and Q6).

Recap the rationale
Many would think it a trivial fact that different questions surely contribute to varying answers. However, we believe it is worth a break-down analysis beyond the monolithic thinking of just "the more the better". No matter how well planned, one can always legitimately ask "what if" the training questions were not exhaustive and some real user threw in unexpected questions. Therefore, this study was meant to expose such behavior out of a BERT-based QA model by putting it under the stress of partial train/test questions.

Featured findings, raised questions
Our findings did validate some trivial knowledge such as the more robust models gained by pooling diverse training data and that similar questions tended to elicit similar answers. On the other hand, we have findings to highlight: 1) a model's accuracy is determined strongly by what questions it is trained on and not as much by what test questions it is asked to answer, 2) there appear to be "better ways to ask" especially in training that would yield generally higher accuracy. That said, our findings somewhat circle back to corroborating the common strategy that focuses on enriching the training diversity to achieve robust performanceplus, the "good" questions need adequate presence.
The micro-level, quality-oriented observations could not be revealed without diving into those question-specific comparisons. For fundamental computational linguistics research, our findings pointed an interesting direction to explore: why certain question constructs (e.g., Q7 "Why does the patient take [medication]?") appear to be more transferrable (at least accuracy-wise) after being learned by BERT? Methods for inspecting the attention mechanism under the hood might help, but we suspect that new approaches of even better interpretability likely need to be designed. Specifically, one phenomenon that puzzled us was the peak accuracy did not always happen at the point when a single-question model was tested on questions of the same training construct.

Limitations
Our experiments on why-QAs and the focus on medication-related questions could limit the generalizability. The diversity of those question variants was bound to what emrQA had offered, and we did not know how that compared to the natural distribution of variants asked by humans. Besides, the emrQA corpus might have embedded noise and quality issues (Yue et al., 2020) that affected the results. Lastly, we still do not have explanation to many findings, and it is unclear if BERT can represent other QA models especially in terms of the question-specific behaviors.

Future work
The current study looked mainly into syntactic variants of the questions, and we will further research the independent or interactive effect of lexico-syntactic variants of concepts (e.g., medication) mentioned in both the question and answer document. Based on our findings, we plan to experiment optimizing the QA accuracy through ensemble approaches such as voting -the hypothesis is the chance of achieving a convergent (correct) answer should be increased by asking the same question in different ways.

Conclusion
We found that how you ask matters in a BERTbased clinical QA task, especially at the training stage. By controlling the train and test questions to individual lexical/syntactic constructs, our crossover evaluation showed that certain question constructs consistently yielded higher accuracy. Accordingly, it suggests that the most effective way to secure robust performance is still by training with diverse, sizable questions. Our results also brought up a somewhat nuanced inquiry: how come some question constructs seem to act "linguistically superior" to others, and whether it is a universal or BERT-dependent phenomenon?