Logic-Guided Data Augmentation and Regularization for Consistent Question Answering

Many natural language questions require qualitative, quantitative or logical comparisons between two entities or events. This paper addresses the problem of improving the accuracy and consistency of responses to comparison questions by integrating logic rules and neural models. Our method leverages logical and linguistic knowledge to augment labeled training data and then uses a consistency-based regularizer to train the model. Improving the global consistency of predictions, our approach achieves large improvements over previous methods in a variety of question answering (QA) tasks, including multiple-choice qualitative reasoning, cause-effect reasoning, and extractive machine reading comprehension. In particular, our method significantly improves the performance of RoBERTa-based models by 1-5% across datasets. We advance state of the art by around 5-8% on WIQA and QuaRel and reduce consistency violations by 58% on HotpotQA. We further demonstrate that our approach can learn effectively from limited data.


Introduction
Comparison-type questions (Tandon et al., 2019;Tafjord et al., 2019;Yang et al., 2018) ask about relationships between properties of entities or events such as cause-effect, qualitative or quantitative reasoning. To create comparison questions that require inferential knowledge and reasoning ability, annotators need to understand context presented in multiple paragraphs or carefully ground a question to the given situation. This makes it challenging to annotate a large number of comparison questions. Most current datasets on comparison questions are much smaller than standard machine reading comprehension (MRC) datasets (Rajpurkar et al., 2016;Joshi et al., 2017). This poses new challenges to standard models, which are known to exploit statistical patterns or annotation artifacts in these datasets (Sugawara et al., 2018;Min et al., 2019a). Importantly, state-of-the-art models show inconsistent comparison predictions as shown in Figure 1. Improving the consistency of predictions has been previously studied in natural language inference (NLI) tasks (Minervini and Riedel, 2018;Li et al., 2019), but has not been addressed in QA.
In this paper, we address the task of producing globally consistent and accurate predictions for comparison questions leveraging logical and symbolic knowledge for data augmentation and training regularization. Our data augmentation uses a set of logical and linguistic knowledge to develop additional consistent labeled training data. Subsequently, our method uses symbolic logic to incorporate consistency regularization for additional supervision signal beyond inductive bias given by data augmentation. Our method generalizes previous consistency-promoting methods for NLI tasks (Minervini and Riedel, 2018;Li et al., 2019) to adapt to substantially different question formats.
Our experiments show significant improvement over the state of the art on a variety of QA tasks: a classification-based causal reasoning QA, a multiple choice QA for qualitative reasoning and an extractive MRC task with comparisons between entities. Notably, our data augmentation and consistency constrained training regularization improves performance of RoBERTa-based models (Liu et al., 2019) by 1.0%, 5.0% and 2.5% on WIQA, QuaRel and HotpotQA. Our approach advances the stateof-the-art results on WIQA and QuaRel with 4.7 and 8.4% absolute accuracy improvement, respectively, reducing inconsistent predictions. We further demonstrate that our approach can learn effectively from limited labeled data: given only 20% of the original labeled data, our method achieves performance on par with a competitive baseline learned with the full labeled data.

Related Work
Data augmentation has been explored in a variety of tasks and domains (Krizhevsky et al., 2009;Cubuk et al., 2019;Park et al., 2019). In NLP, using backtranslation (Yu et al., 2018) or dictionary based word replacement (Zhang et al., 2015) has been studied. Most relevant to our work, Kang et al. (2018) study NLI-specific logic and knowledgebased data augmentation. Concurrent to our work, Gokhale et al. (2020) study visual QA models' ability to answer logically composed questions, and show the effectiveness of logic-guided data augmentation. Our data augmentation does not rely on task-specific assumptions, and can be adapted to different formats of QA task. We further leverage consistency-promoting regularization, which gives improvements in accuracy and consistency.
Improving prediction consistency via training regularization has been studied in NLI tasks. Minervini and Riedel (2018) present model-dependent first-order logic guided adversarial example generation and regularization. Li et al. (2019) introduce consistency-based regularization incorporating the first-order logic rules. Previous approach is modeldependent or relies on NLI-specific rules, while our method is model-agnostic and is more generally applicable by combining it with data augmentation.
Regularizing loss to penalize violations of structural constraints in models' output has been also studied in previous work on constraint satisfaction in structured learning (Lee et al., 2019;Ganchev et al., 2010). Our work regularizes models to produce globally consistent predictions among augmented data following logical constraints, while those studies incorporates structured prediction models following linguistics rules.

Method
We present the components of our QA method: first-order logic guided data augmentation (Section 3.1 and Section 3.2), and consistency-based regularization (Section 3.3).

Consistent Question Answering
For globally consistent predictions in QA, we require responses to follow two important general logical rules: symmetric consistency and transitive consistency, which are illustrated in Figure 1 and are formally described below.
Let q, p, a be a question, a paragraph and an answer predicted by a model. A is a set of answer candidates. Each element of A can be a span in p, a class category, or an arbitrary answer choice. X = {q, p, a} represents a logic atom.
Symmetric consistency In a comparison question, small surface variations such as replacing words with their antonyms can reverse the answer, while keeping the overall semantics of the question as before. We define symmetry of questions in the context of QA as follows: (q, p, a * ) ↔ (q sym , p, a * sym ), where q and q sym are antonyms of each other, and a * sym is the opposite of the groundtruth answer a * in A. For example, the two questions in the first row of Figure 1 are symmetric pairs. We define the symmetric consistency of predictions in QA as the following logic rule: (q, p, a) → (q sym , p, a sym ), which indicates a system should predict a sym given (q sym , p), if it predicts a for (q, p).
Transitive consistency. Transitive inference between three predicates A, B, C is represented as: Gazes et al., 2012). In the context of QA, the transitive examples are mainly for causal reasoning questions that inquire about the effect e given the cause c. The second row of Figure 1 shows an example where transitive consistency is violated. For two questions q 1 and q 2 in which the effect of q 1 (= e 1 ) is equal to the cause of q 2 (= c 2 ), we define the transitive consistency of predictions as follows: (q 1 , p, a 1 ) ∧ (q 2 , p, a 2 ) → (q trans , p, a trans ).
(2)  The parts written in red or blue denote antonyms, and highlighted text is negation added by our data augmentation.

Logic-guided Data Augmentation
Given a set of training examples X in the form of (q, p, a * ), we automatically generate additional examples X aug = {q aug , p, a * aug } using symmetry and transitivity logical rules. The goal is to augment the training data so that symmetric and transitive examples are observed during training. We provide some augmented examples in Table 1.
Augmenting symmetric examples To create a symmetric question, we convert a question into an opposite one using the following operations: (a) replace words with their antonyms, (b) add, or (c) remove words. For (a), we select top frequent adjectives or verbs with polarity (e.g., smaller, increases) from training corpora, and expert annotators write antonyms for each of the frequent words (we denote this small dictionary as D). More details can be seen in Appendix A. For (b) and (c), we add negation words or remove negation words (e.g., not). For all of the questions in training data, if a question includes a word in D for the operation (a), or matches a template (e.g., which * is ↔ which * is not) for operations (b) and (c), we apply the operation to generate q sym . 2 We obtain a * sym by re-labeling the answer a * to its opposite answer choice in A (see Appendix B).
Sampling augmented data Adding all consistent examples may change the data distribution from the original one, which may lead to a deterioration in performance (Xie et al., 2019). One can select the data based on a model's prediction inconsistencies (Minervini and Riedel, 2018) or randomly sample at each epoch (Kang et al., 2018).
In this work, we randomly sample augmented data at the beginning of training, and use the same examples for all epochs during training. Despite its simplicity, this yields competitive or even better performance than other sampling strategies. 3

Logic-guided Consistency Regularization
We regularize the learning objective (task loss, L task ) with a regularization term that promotes consistency of predictions (consistency loss, L cons ).
The first term L task penalizes making incorrect predictions. The second term L cons 4 penalizes making predictions that violate symmetric and transitive logical rules as follows: where λ sym and λ trans are weighting scalars to balance the two consistency-promoting objectives.  Previous studies focusing on NLI consistency (Li et al., 2019) calculate the prediction inconsistency between a pair of examples by swapping the premise and the hypothesis, which cannot be directly applied to QA tasks. Instead, our method leverages consistency with data augmentation to create paired examples based on general logic rules. This enables the application of consistency regularization to a variety of QA tasks.

Inconsistency losses
The loss computes the dissimilarity between the predicted probability for the original labeled answer and the one for the augmented data defined as follows: L sym = |log p(a|q, p)−log p(a aug |q aug , p)|. (5) Likewise, for transitive loss, we use absolute loss with the product T-norm which projects a logical conjunction operation (q 1 , p, a 1 ) ∧ (q 2 , c, a 2 ) to a product of probabilities of two operations, p(a 1 |q 1 , p)p(a 2 |q 2 , p), following Li et al. (2019). We calculate a transitive consistency loss as: L trans = |log p(a 1 |q 1 , p) + log p(a 2 |q 2 , p)− log p(a trans |q trans , p)|.
Annealing The model's predictions may not be accurate enough at the beginning of training for consistency regularization to be effective. We perform annealing (Kirkpatrick et al., 1983;Li et al., 2019;Du et al., 2019). We first set λ {sym,trans} = 0 in Eq. (4) and train a model for τ epochs, and then train it with the full objective.  (Yang et al., 2018). 5 We train models on both bridge and comparison questions, and evaluate them on extractive comparison questions only.  We use standard F1 and EM scores for performance evaluation on HotpotQA and use accuracy for WIQA and QuaRel. We report a violation of consistency following Minervini and Riedel (2018) to evaluate the effectiveness of our approach for improving prediction consistencies. We compute the violation of consistency metric v as the percentage of examples that do not agree with symmetric and transitive logical rules. More model and experimental details are in Appendix. Table 2 demonstrates that our methods (DA and DA + Reg) constantly give 1 to 5 points improvements over the state-of-theart RoBERTa QA's performance on all three of the datasets, advancing the state-of-the-art scores on WIQA and QuaRel by 4.7% and 8.4%, respectively. On all three datasets, our method signifi-   Table 2 also shows that our approach is especially effective under the scarce training data setting: when only 20% of labeled data is available, our DA and Reg together gives more than 12% and 14% absolute accuracy improvements over the RoBERTa baselines on WIQA and QuaRel, respectively.

Results with limited training data
Ablation study We analyze the effectiveness of each component on Table 3. DA and Reg each improves the baselines, and the combination performs the best on WIQA and QuaRel. DA (standard) follows a previous standard data augmentation technique that paraphrases words (verbs and adjectives) using linguistic knowledge, namely Word-Net (Miller, 1995), and does not incorporate logical rules. Importantly, DA (standard) does not give notable improvement over the baseline model both in accuracy and consistency, which suggests that logic-guided augmentation gives additional inductive bias for consistent QA beyond amplifying the number of train data. As WIQA consists of some transitive or symmetric examples, we also report the performance with Reg only on WIQA. The performance improvements is smaller, demonstrating the importance of combining with DA.
Qualitative Analysis Table 4 shows qualitative examples, comparing our method with RoBERTa baseline. Our qualitative analysis shows that DA+Reg reduces the confusion between opposite choices, and assigns larger probabilities to the ground-truth labels for the questions where DA shows relatively small probability differences. On HotpotQA, the baseline model shows large consistency violations as shown in Table 2. The HotpotQA example in Table 4 shows that RoBERTa selects the same answer to both q and q sym , while DA answers correctly to both questions, demonstrating its robustness to surface variations. We hypothesize that the baseline model exploits statistical pattern, or dataset bias presented in questions and that our method reduces the model's tendency to exploit those spurious statistical patterns (He et al., 2019;Elkahky et al., 2018), which leads to large improvements in consistency.

Conclusion
We introduce a logic guided data augmentation and consistency-based regularization framework for accurate and globally consistent QA, especially under limited training data setting. Our approach significantly improves the state-of-the-art models across three substantially different QA datasets. Notably, our approach advances the state-of-the-art on QuaRel and WIQA, two standard benchmarks requiring rich logical and language understanding. We further show that our approach can effectively learn from extremely limited training data.

A Details of Human Annotations
In this section, we present the details of human annotations used for symmetric example creation (the (a) operation). We first sample the most frequent 500 verbs, 50 verb phrases and 500 adjectives from from the WIQA and QuaRel training data. Then, human annotators select words with some polarity (e.g., increase, earlier). Subsequently, they annotate the antonyms for each of the selected verbs and adjectives. Consequently, we create 64 antonym pairs mined from a comparison QA dataset. We reuse the same dictionary for all three datasets. Examples of annotated antonym pairs are shown in Table 5  B Details of answer re-labeling on WIQA and HotpotQA We present the details of answer re-labeling operations in WIQA and HotpotQA, where the number of the answer candidates is more than two.
Answer re-labeling in WIQA (symmetric) In WIQA, each labeled answer a * takes one of the following values: {more, less, no effects}. Although more and less are opposite, no effects is a neutral choice. In addition, in WIQA, a question q consists of a cause c and an effect e, and we can operate the three operations (a) replacement, (b) addition and (c) removal of words. When we add the operations to both of c and e, it would convert the question to opposite twice, and thus the original answer remains same. When we add one of the operation to either of c or e, it would convert the question once, and thus, the answer should be the opposite one. Given these two assumption, we re-label answer as: (i) if we apply only one operation to either e or c and a * is more or less, the a * sym will be the opposite of a * , (ii) if we apply only one operation to either e or c and a * is no effect, the a * sym will remain no effect, and (iii) if we apply one operation to each of e and c, the a sym remains the same.
Answer re-labeling in WIQA (transitive) For transitive examples, we re-label answers based on two assumptions on causal relationship. A transitive questions are created from two questions, X 1 = (q 1 , p, a * 1 ) and X 2 = (q 2 , p, a * 2 ), where q 1 and q 2 consist of (c 1 , e 1 ) and (c 2 , e 2 ) and e 1 = c 2 holds. If a 1 for X 1 is "more", it means that the c 1 causes e 1 . e 1 is equivalent to the cause for the second question (c 2 ), and a * 2 represents the causal relationship between c 2 and e 2 . Therefore, if a * 1 is a positive causal relationship, c 1 and e 2 have the relationship defined as a * 2 . We assume that if the a * 1 is "more", a * 3 (= a * trans ) will be same as a 2 , and re-label answer following this assumption.
Answer re-labeling in HotpotQA In Hot-potQA, answer candidates A are not given. Therefore, we extract possible answers from q. We extract two entities included in q by string matching with the titles of the paragraphs given by the dataset. If we find two entities to be compared and both of them are included in the gold paragraphs, we assume the two entities are possible answer candidates. The new answer a * sym will be determined as the one which is not the original answer a * .

C Details of Baseline Models
We use RoBERTa (Li et al., 2019) as our baseline.
Here, we present model details for each of the three different QA datasets.
Classification-based model for WIQA As the answer candidates for WIQA questions are set to {more, less, no effects}, we use a classification based models as studied for NLI tasks. The input for this model is [CLS] p [SEP] q [SEP]. We use the final hidden vector corresponding to the first input token ([CLS]) as the aggregate representation. We then predict the probabilities of an answer being a class C in the same manner as in (Devlin et al., 2019;Liu et al., 2019).

Multiple-choice QA model for QuaRel For
QuaRel, two answer choices are given, and thus we formulate the task as multiple-choice QA. In the original dataset, all of the p, q and A are combined together (e.g., The fastest land animal on earth, a cheetah was having a 100m race against a rabbit. Which one won the race? (A) the cheetah (B) the rabbit), and thus we process the given combined questions into p, q and A (e.g., the question written above will be p =The fastest land animal on earth, a cheetah was having a 100m race against a rabbit. , q =Which one won the race? and A ={the cheetah, rabbit}). Then the input will be [CLS] p [SEP] "Q: " q "A: " a i [SEP], and we will use the final hidden vector corresponding to the first input token ([CLS]) as the aggregate representation. We then predict the probabilities of an answer being an answer choice a i in the same manner as in (Liu et al., 2019).
Span QA model for HotpotQA We use the RoBERTa span QA model studied for SQuAD (Devlin et al., 2019;Liu et al., 2019) for HotpotQA. As we only consider the questions whose answers can be extracted from p, we do not add any modifications to the model unlike some previous studies in HotpotQA (Min et al., 2019b;Ding et al., 2019). Hyper-parameters For HotpotQA, we train a model for six epochs in total. For the model without data augmentation or regularization, we train on the original dataset for six epochs. For the models with data augmentation, we first train them on the original HotpotQA train data (including both bridge and comparison questions) for three epochs, and then train our model with augmented data and regularization for three epochs. For HotpotQA, we train our model with both bridge and comparison questions, and evaluate on comparison questions whose answers can be extracted from the context. Due to the high variance of the performance in the early stages of the training for small datasets such as QuaRel or WIQA, for these two datasets, we set the maximum number of training epochs to 150 and 15, respectively. We terminate the training when we do not observe any performance improvements on the development set for 5 epochs for WIQA and 10 epochs for QuaRel, respectively. We use Adam as an optimizer ( = 1E − 8) for all of the datasets. Other hyper-parameters can be seen from

E Qualitative Examples on HotpotQA
As shown in Table 2, the state-of-the-art RoBERTa model produces a lot of consistency violations.
Here, we present several examples where our competitive baseline model cannot answer correctly, while our RoBERTa+DA model answers correctly.
A question requiring world knowledge One comparison question asks "Who has more scope of profession, B. Reeves Eason or Albert S. Rogell", given context that B. Reeves is an American film director, actor and screenwriter and Albert S. Rogell is an American film director. The model correctly predicts "B. Reeves Eason" but fails to answer correctly to "Who has less scope of profession, B. Reeves Eason or Albert S. Rogell", although the two questions are semantically equivalent.
A question with negation We found that due to this reasoning pattern our model struggles on questions involving negation. Here we show one example. We create a question by adding a negation word, q sym ,"Which species is not native to asia, corokia or rhodotypos?", where we add negation word not and the paragraph corresponding to the question is p ="Corokia is a genus in the Argophyllaceae family comprising about ten species native to New Zealand and one native to Australia. Rhodotypos scandens is a deciduous shrub in the family Rosaceae and is native to China, possibly also Japan.". The model predicts Rhodotypos scandens, while the model predicts the same answer to the original question q, 'which species is native to asia, corokia or rhodotypos?". This example shows that the model strongly relies on surface matching (i.e., "native to") to answer the question, without understanding the rich linguistic phenomena or having world knowledge.