Neural Conversational QA: Learning to Reason vs Exploiting Patterns

Neural Conversational QA tasks like ShARC require systems to answer questions based on the contents of a given passage. On studying recent state-of-the-art models on the ShARC QA task, we found indications that the models learn spurious clues/patterns in the dataset. Furthermore, we show that a heuristic-based program designed to exploit these patterns can have performance comparable to that of the neural models. In this paper we share our ﬁndings about four types of patterns found in the ShARC corpus and describe how neural models exploit them. Motivated by the aforementioned ﬁndings, we create and share a modiﬁed dataset that has fewer spurious patterns, consequently allowing models to learn better.


Introduction
ShARC, a conversational QA task (Saeidi et al., 2018), requires a system to answer user questions based on rules expressed in natural language text. An example in Figure 1 shows a user sharing some background information (referred to as scenario) and asking a question about continuing to pay for 'UK National Insurance'. The rule text associated with this dialog exchange defines the policy that guides the conversation flow. At any turn in the conversation, a system may choose to respond with a final Yes/No answer; ask a follow-up question to obtain more information from the user; or reply that the question is irrelevant to the context.
Several deep learning models such as BERT-QA (Devlin et al., 2019), E3 (Zhong and Zettlemoyer, 2019), and BiSon (Lawrence et al., 2019) perform reasonably well on this task. However, our exploration of the ShARC dataset indicates that there are multiple spurious patterns that could be exploited by neural models. We observe that the performance of the models mentioned above drops when they are tested on a perturbed dataset, suggesting that *  Figure 1: Dialog Flow in ShARC ( (Saeidi et al., 2018)) the underlying neural models do not generalize and are rather over-sensitive (Welbl et al., 2020) to minor textual perturbations. By sensitivity we refer to a model's ability to generalize itself but not overfit, while still being invariant to perturbations or text transformations (Teney et al., 2020;Szegedy et al., 2013). Our observations about conversational QA models designed for ShARC learning spurious statistical clues are in line with those reported by Niven and Kao (2019). To the best of our knowledge, we are the first to demonstrate this problem in conversational QA. Patterns in the ShARC dataset: We discover four types of patterns in the ShARC dataset: (1) correlation between the last answer to a followup question and the predicted answer to a user question; (2) a high correlation between asking a new follow-up question and the number of turns in the dialog history; (3) correlation between the sequence of follow-up questions in the dialog history and the sequence of rule clauses in the ruletext; and (4) correlation between an empty history/scenario and the answer being irrelevant. Contributions: The main contributions of this work are as follows: (1) We present a simple heuristics driven program designed to exploit the aforementioned patterns which has performance comparable to the state-of-the-art models. (2) The performance of the state-of-the-art models drops when they are tested on a perturbed test set that has these patterns diluted or removed. (3) We also identify a weakness in the current evaluation procedure, and propose an improved evaluation criteria which penalizes BLEU scores if a follow-up question is not generated when it should be. We refer to this criteria as BLEU-P (BLEU penalized) in the rest of the paper. (4) We generate a new dataset which reduces the patterns identified in the original dataset, and re-benchmark existing state-ofthe-art models published on the leaderboard. We find that the models learn better on this dataset and their performance is consistent across the original and the perturbed dev sets. Our dataset and all accompanying scripts are available at https: //github.com/nikhilweee/neural-conv-qa.

Patterns in the ShARC Dataset
This section describes the spurious patterns in the ShARC dataset and presents a simple heuristic based program designed to exploit these patterns.
Pattern 1: Last follow-up answer is the predicted answer: Based on the asterisk ('*') as a separator between rule clauses, we found that 54.52% of the instances consist of a list of only-conjunctive or only-disjunctive clause conditions. Consider a case where a rule consists of only conjunctive clause statements. If any single follow-up question, generated based on one of these clauses, is answered with a 'No' by a user, the answer to the user's question shall be 'No'. In this case, no follow-up questions need to be asked. Thus, one often finds the sequence of follow-up answers in the dialog history as ('Yes', . . . , 'Yes', 'No'), for which the answer to the user question is a 'No'. Similarly, corresponding to a case where the rule consists of only disjunctive clauses, the follow-up answer sequence is ('No', . . . , 'No', 'Yes'), the answer to which is a 'Yes'. This indicates a high correlation between the final answer and the last answer of the history. We found that 74.6% of the instances in the train set with a 'Yes'/'No' answer have the same answer as that of the last follow-up question. Although this is reflective of real-world conversations, a model can do a good job on this task by exploiting just this pattern.
Pattern 2: Likelihood of asking a follow-up question decreases with number of turns: It is intuitive to expect that as the number of followup questions that have been asked increases, the likelihood of asking another follow-up question decreases (clauses are finite). Appendix A.1, contains an empirical study on the training data, and demonstrates the decrease in the probability of asking a follow-up question with the increase in the number of turns in the dialog history.
Pattern 3: Follow-up questions occur in the same sequence as the rule clauses in the passage: As discussed earlier, many of the rule clauses tend to be conjunctive/disjunctive. Thus, the next follow-up question that one needs to ask is not unique, since one can always consider any of the statements in the clause that has not been considered so far. However, the ground truth data considers these clauses in sequential order to generate the follow-up questions. Among all instances where a conjunctive/disjunctive clause can be discerned and have a follow-up question generated as a part of the ground truth, 62.8% satisfy the condition that the first clause that has not yet been asked is indeed the next follow-up question. We explain this pattern in detail and discuss how it affects computation of the BLEU metric in Appendix A.2.
Pattern 4: Answer as 'Irrelevant': Amongst the train instances where user background information and dialog history is empty, 66.67% have the final answer as Irrelevant.

A Simple Heuristics-based Program
To demonstrate the ease with which these patterns can be exploited by a model, we create a simple program that follows a set of hand-crafted rules. The program takes the following actions: 1. Answer 'Irrelevant': If the following conditions are jointly satisfied: a) no follow-up questions have been asked so far; b) the background information (scenario) is empty; c) there is low word overlap between the rule and the question; then the program answers Irrelevant. 2. Generate 'Follow-up Question': If the previous condition fails and the number of clauses in the rule are more than the number of follow-up questions asked, then a follow-up question is predicted, and the model asks the next clause in the rule text as a question by appending the words "Are you" in the beginning and a question mark at the end. 3. Answer with a 'Yes' or a 'No': If both the above scenarios are false, then the model response to the user question is the user's response to the last follow-up question.

Evaluation Metrics in ShARC
The following metrics are used in the evaluation of the ShARC task: 1. Micro and Macro Accuracy: At each turn, the model response is either a Yes/No/Irrelevant or a follow-up question. The micro and macro accuracy measures the ability of a model to correctly predict these four classes.
2. BLEU: This is used to assess the correctness of the follow-up question generated in case the model chooses to generate one.
Our experiments with the dataset suggest two weaknesses in the evaluation of follow-up questions which we discuss below. 1. Incomplete reference set: Recall that Pattern 3 suggests the existence of a sequential correlation between the rule clauses 1 in the rule text and the follow-up questions. This means that if an out of sequence follow-up question is generated by a model, then it is incorrectly penalized because the evaluation script expects the next follow-up question that would have occured in the original rule sequence. To mitigate this penalization, we create a list of alternative candidate references using the clauses in conjunctive-only or disjunctive-only instances in dataset. We make use of the standard implementation of BLEU which supports multiple references (Papineni et al., 2002). To generate multiple candidate references for the ShARC dataset, we identify instances which have a follow-up question as the gold answer and the follow-up question seems to be based on one of the clause statements in the rule text. We then create alternative follow-up questions from each of the clause statements which have not been a part of any follow-up question in the history. Please see Appendix A.3 for details about the algorithm to generate these alternative follow-ups. These alternative follow-up questions constitute a set of candidate questions. In the rest of the paper, we refer to this BLEU score computed using multiple references as Multi-BLEU, to distinguish it from the officially reported BLEU metric. 2. Improper Penalization for BLEU: As mentioned in the section 1, the official evaluation scripts 1 conjunctive-only / disjunctive-only clauses do not penalize BLEU scores of a model if it does not predict a follow-up question, and rather predicts a final answer (Yes/No or Irrelevant). This is because it considers only those cases where both the ground-truth and the model predictions are followups. A hypothetical model which classifies only 1 test instance as follow-up and produces it perfectly can get a BLEU score of 100 in this metric. We therefore update the evaluation script to penalize BLEU score in such cases, and refer to this as BLEU-P (BLEU-Penalized). This considers all instances where the ground truth is a follow-up question. We use the predicted response from the model as the answer to evaluate. This effectively counts a BLEU of 0 in almost all cases for these instances. When we use multiple references for computing BLEU, we refer to this as Multi-BLEU-P. Table 1 and 2 report results of our experimental study, providing evidence to support the following: (a) The heuristics-based program has performance comparable to the state-of-the-art models. (b) The performance of models drops when they are trained on the original dev set and tested on the perturbed dev set that has diluted or reduced patterns, indicating a reliance on patterns (c) The official evaluation scripts do not penalize BLEU scores and do not consider other candidate answers in their calculations.

Evidence of Patterns
To prepare a perturbed dev set, we modify the official dev set by shuffling the dialog turns of the history of a conversation (We modify approximately 20% of the dev set. For more details, please refer to Appendix A.5). We refer to this set as the "History-Shuffled" dev set, a perturbation introduced by shuffling dialog history to dilute Patterns 1 & 3. Note that shuffling the dialog history introduces examples which are unlikely to occur in real conversations (Eg. Asking a follow-up question based on a set of conjunctive rule-clauses even though a user has already responded with a "No"). For model details, please refer to Section 6.
Using the evaluation metrics (micro and macro accuracy for turn level classification; and Multi-BLEU and Multi-BLEU-P for answer generation accuracy), we report the performance of the top two ranked models 2 (E3 (Zhong and Zettlemoyer,

2019) and
BiSon (Lawrence et al., 2019)), the BERT based model (BERT-QA (Devlin et al., 2019)) and our heuristic based model by training them on the official ShARC training set and evaluating them on the original and History-Shuffled dev sets. We use the code released by respective authors for E3 and BiSon. We also use the same hyperparameters as mentioned in the respective papers.
Weakness in Offical BLEU scores: In Table 1, the BLEU scores are lower than the Multi-BLEU scores, by an average of 16.03%. This is as expected since the official scripts do not account for valid alternatives and the gold answers have been generated in accordance with Pattern 3. Furthermore the models result in significantly lower BLEU scores on the Multi-BLEU-P metric as it penalizes models if they dont generate a follow-up question when they were supposed to. This suggests that the official scripts grossly over-estimate model performance (BiSon's actual Multi-BLEU-P is score is only 14.25). In the rest of the experiments we only report Multi-BLEU and Multi-BLEU-P scores.
Heuristic model and effect of Patterns: Tables  1 and 2 show that the heuristic-based model not only has comparable performance across all other models but also across both dev sets (original and History-Shuffled). If we look at the first two columns (micro and macro accuracy), all models tested on the History-Shuffled dev set report a drop in performance. The average drops in micro & macro accuracies are 8.16% and 5.3% respectively. While changes in performance are can be attributed to change in train and test distributions, the goal of this experiment is to demonstrate that all models are relying on a spurious pattern induced by the sequence of follow-up answers in dialog history. Thus, it is interesting to note that a dilution of just 20% of the patterns leads to a sizeable drop in performance.

Modified ShARC dataset
In an attempt to mitigate the effects of the patterns listed in section 2 and to reduce the sensitivity of neural models, we create a modified version of the ShARC dataset. For each occurence of an instance conforming to any of the patterns, we automatically construct alternatives where we choose to either replace the current instance with an alternative instance which does not exhibit the pattern; or retain the original instance. The alternative instances are generated using pattern-specific modifications. For example, we shuffle dialog history to reduce the effect of Patterns 1 & 3 (For more details, please see appendices A.4 and A.5). We individually modify both the official train and dev datasets and refer to them as ShARC-Mod.

Benchmarking Experiments
We train and evaluate all models using the ShARC-Mod train dataset and then test on ShARC-Mod dev set as well as the original dev set (containing all patterns) and the History-Shuffled dev set (con-   Table 5: Models trained using ShARC-Mod and evaluated on the History-Shuffled dev set taining diluted patterns). Studying tables 3, 4 and 5 shows that all models perform consistently across all dev sets. This suggests that models that were earlier sensitive to perturbations now show consistent performance after being trained on a more robust data set. Note that, except for the heuristic model, this performance is indeed lower than what we had when trained on original data sets. This suggests that the neural models have been merely exploiting patterns in the training data and the performance sharply drops when these cues are absent from training data. The heuristic model has the same numbers as before. This is because the heuristic model is based on rules and is never actually trained. Its performance is indeed invariant to the training data.

Recommendations & Conclusion
In this paper we demonstrate how a popular Neural Conversational QA dataset inadvertently encodes patterns. We would like to emphasize that the patterns found, by their very nature, are likely to occur in real world tasks but the same patterns can also cause neural models to learn poorly. We release a modified version of this dataset and also improve evaluation criteria that better reflects model performance. We conclude the paper with a few recommendations for the community.
For Dataset creators: Patterns may exist in a real-world task and artificially introducing perturbations may be an easy way to help reduce their effects. This may result in 'unnatural' instances in the dataset but could help train better models.
For Model creators: (1) Model probing and experimenting with perturbed inputs can give deep insights about how a model is reasoning (2) Experimenting with adversarial inputs early on in the design process can help build better models.

A Appendix
A.1 Empirical Study for Pattern 2 Figure 2 plots the number of turns in an instance vs the probability of asking a follow-up question. It can be seen that with each response, the probability of asking another follow-up question decreases.

A.2 Detailed explanation for Pattern 3
Due to the presence of conjunctive/disjunctive clause statements, it becomes ambiguous as to which one to consider for framing the next possible question. As an example, consider the instance described in Table 6. The gold answer is a followup question which was framed using the first clause statement. However, follow-up questions framed using any of the other clause statements in the rule such as "Is the item a lifeboat or an associated equipment, including fuel?", "Is it medicine or an ingredient for a medicine?", or "Is it a resuscitation training model?" would have been equally valid. We found that this is a common pattern found in the dataset. More generally, the gold follow-up answer tends to be framed using the first clause which has not been asked so far. To quantify this hypothesis, we followed these steps.
1. We filter the instances with a follow-up question as the gold answer and identify clause statements which start with an asterisk (*).
2. For each follow-up question in the history, we use the longest common substring (LCS) algorithm to compute an intersecting span with the rule text. We then identify the clause statements which intersect with this span.
3. We use the same process as above to find a matching clause statement for the follow-up question listed as the answer of the instance.
Amongst the instances identified, we found that 62.8% of them were such that the follow-up question in the answer (step 3) intersects with the first clause statement identified in step 1 that does not appear in the history of the instance.
In an attempt to break this pattern, we identify clause statements in the rule text in the same way as step 1 above and then shuffle them to create a new instance in ShARC-Mod.

Rule
## Items that qualify for the zero rate You may be able to apply zero VAT when you sell the following to an eligible charity: * equipment for making 'talking' books and newspapers * lifeboats and associated equipment, including fuel * medicine or ingredients for medicine * resuscitation training models Scenario I used to work for the company, but I quit last month. Question Can I apply zero VAT to this item? History Answer Is it equipment for making 'talking' books and newspapers?

A.3 Details on creating multiple references
We discuss the details on how we add additional gold references to the dev dataset. This augmentation only affects the instances which have a followup question as the gold answer. If the LCS of the gold answer with the rule text intersects with one of the clause statements (identified in step 1), we suspect that other clause statements might also have been one of the possible answers. So we first eliminate the clauses that have already been asked in the history, and again use LCS to find the best matching span for each follow-up question that has been asked in the history so far. If the intersection is with one of the clauses, we eliminate the same. The remaining clauses are then considered potential candidates for the next follow-up question to consider. To create a question from the clause statement, we use a simple heuristic as to finding the words preceeding and following the best match span for the gold answer. These words are then prefixed and suffixed with the other candidate clauses to form potential questions. Algorithm 1 describes this process in a formal manner. For more details, please refer to the accompanying repository. On running this algorithm on the original dev dataset, we were able to add additional references in 183 out of the 2270 instances. It is interesting to