Tackling Adversarial Examples in QA via Answer Sentence Selection

Question answering systems deteriorate dramatically in the presence of adversarial sentences in articles. According to Jia and Liang (2017), the single BiDAF system (Seo et al., 2016) only achieves an F1 score of 4.8 on the ADDANY adversarial dataset. In this paper, we present a method to tackle this problem via answer sentence selection. Given a paragraph of an article and a corresponding query, instead of directly feeding the whole paragraph to the single BiDAF system, a sentence that most likely contains the answer to the query is first selected, which is done via a deep neural network based on TreeLSTM (Tai et al., 2015). Experiments on ADDANY adversarial dataset validate the effectiveness of our method. The F1 score has been improved to 52.3.


Introduction
Question answering is an important task in evaluating the ability of language understanding of machines. Usually, given a paragraph and a corresponding question, a question answering system is supposed to generate the answer of this question from the paragraph. By comparing the predicted answer with human-approved answers, the performance of the system can be assessed. Recently, many systems have achieved great results on this task (Shen et al., 2017b;Wang and Jiang, 2016;Hu et al., 2017). However, Jia and Liang (2017) show that these systems are very vulnerable to paragraphs with adversarial sentences. For instance, the single BiDAF system (Seo et al., 2016), which achieves an F1 of 75.5 on Standford Question Answering Dataset (SQuAD), deteriorates significantly to an F1 of 4.8 on the ADDANY adversarial dataset. Besides the single BiDAF, the single Match LSTM, the ensemble Match LSTM, and the ensemble BiDAF achieve an F1 of 7.6, 11.7, and 2.7 respectively in question answering on ADDANY adversarial dataset (Jia and Liang, 2017). Therefore, question answering with adversarial sentences in paragraphs is a prominent issue and is the focus of this study.
In this paper, we propose a method to improve the performance of the single BiDAF system 1 on ADDANY adversarial dataset. Given a paragraph and a corresponding question, our method works in two steps to generate an answer. In the first step, a deep neural network named the QA Likelihood neural network is deployed to predict the likelihood of each sentence in the paragraph to be an answer sentence, i.e., the sentence that contains the answer. The architecture and the loss of the QA Likelihood neural network follow the neural network for semantic relatedness proposed by Tai et al. (2015). Its main ingredient is the Tree-LSTM model. While the neural network for semantic relatedness is used to predict the similarity between sentence A and B, the QA Likelihood neural network is used to predict if sentence A contains the answer to query B. In the second step, only the sentence with the highest likelihood is paired with the question and passed to the single BiDAF to further output an answer. In summary, compared to the original BiDAF that is an end-to-end question answering system, our method first selects a sentence that is most likely to be an answer sentence. Since adversarial sentences are not supposed to contain the answer, they can be screened out. Therefore, the distractions of adversarial sentences are reduced. Experiments on ADDANY adversarial dataset demonstrates the effectiveness The contributions of this study are in three folds. First, to the best of our knowledge, it's the first work that tries to address the problem of Question Answering with Adversarial Examples. Our results show the effectiveness of answer sentence selection to tackle adverserial sentences in ADDANY dataset. Second, the power of sentence representation of Tree-LSTM has been demonstrated in different NLP tasks, such as semantic relatedness computation, machine translation evaluation and natural language inference (Tai et al., 2015;Gupta et al., 2015;Chen et al., 2017); meanwhile, multiple methods have been proposed for answer sentence selection (Wang and Nyberg, 2015;Rao et al., 2016;Wang et al., 2017;Shen et al., 2017a;Choi et al., 2017). We are the first to design a framework that illustrates the effectiveness of Tree-LSTM in answer sentence selection. Third, two sampling methods are implemented to build the training set for the QA Likelihood neural network. We show that different sampling methods do influence the performance of question answering in this scenario.

Methods
Given a paragraph C and a corresponding query Q, the paragraph is split into a bunch of sentences C = {S i |i = 1, 2, . . . , |C|}. By combining each sentence S i with the query Q, a set of sentence pairs P C,Q = {(S i , Q)|i = 1, 2, . . . , |C|} is obtained. Then, the dependency parsing  is used to get the tree representation T S i for S i and T Q for Q. Based on T S i and T Q , two Tree-LSTMs, Tree-LSTM S i and Tree-LSTM Q , are built respectively (Tai et al., 2015). The inputs to the leafs of both Tree-LSTMs are GloVe word vectors generated by Pennington et al. (2014). The output hidden vectors of the Tree-LSTM for S i and Q are h S i and h Q respectively. Then, h S i and h Q are concatenated and passed to a feed forward neural network to output the likelihood that S i contains the answer to Q. The architecture and the loss of the feed forward neural network follows the neural network for semantic relatedness (Tai et al., 2015). During training, the likelihood is supervised by 1 if S i contains the answer and 0 otherwise. The procedure above is summarized as the QA Likelihood neural network that is illustrated in Part 1 of Figure 1. Following that, the sentence that is most likely to be an answer sentence, is selected, where L stands for the likelihood predicted by the QA Likelihood neural network. After that, a pair of sentences S * and Q are passed to the pre-trained single BiDAF (Seo et al., 2016) to generate an answerâ to Q. This process is illustrated in Part 2 of Figure 1.

Experiments
Dataset for Training. As Figure 1 shows, the input of our system is a pair of sentences. Thus, the training instances for the QA Likelihood neural network are in the form of sentence pairs. They are sampled from the training set of SQuAD v1.1 (Rajpurkar et al., 2016) that contains no adversarial sentences. Specifically, there are 87,599 queries of 18,896 paragraphs in the training set of SQuAD v1.1. While each query refers to one paragraph, a paragraph may refer to multiple queries.
For the k-th query Q k , by splitting its corresponding paragraph C k into separate sentences and combining them with the query, a set of sentence pairs is obtained, where D k represents the set of sentence pairs for the k-th query, m k is the number of sentences in the paragraph C k , S k i is the i-th sentence in C k . A sentence pair (S k i , Q k ) is called a positive instance if S k i contains the answer to Q k ; otherwise, it is called a negative instance. Then, the union of the sets D k for all the 87,599 queries in SQuDA is where d=87,599 is the number of queries. The set D contains 440,135 sentence pairs, among which 87,306 are positive instances and 352,829 are negative instances.
In order to train our model properly and efficiently, both downsampling of D and undersampling of negative instances must be done. In this paper, we implement two different sampling methods: pair-level sampling and paragraph-level sampling. In pair-level sampling, 45,000 positive instances and 45,000 negative instances are randomly selected from D as the training set. By contrast, in paragraph-level sampling, we first randomly select a query Q k without replacement, then one positive instance and one negative instance are randomly sampled from the set of sentence pairs D k . This operation is repeated until we get 45,000 positive instances and 45,000 negative instances. Finally, two different training sets are generated by pair-level sampling and paragraphlevel sampling. Each set has 90,000 instances. The validation set with 3,000 instances are sampled through these two methods as well. Dataset for Testing. Our test set is Jia and Liang (2017)'s ADDANY adversarial dataset. It includes 1,000 paragraphs and each paragraph refers to only one query, i.e., 1,000 (C, Q) pairs. By splitting and combining, 6,154 sentence pairs are obtained. Experimental Settings. The dimension of GloVe word vectors (Pennington et al., 2014) is set as 300. The sentence scoring neural network is trained by Adagrad (Duchi et al., 2011) with a learning rate of 0.01 and a batch size of 25. Model parameters are regularized by a 10 −4 strength of per-minibatch L 2 regularization.

Results
The performance of question answering is evaluated by the Macro-averaged F1 score (Rajpurkar QA Jia and Liang, 2017). It measures the average overlap between the predicted answerâ and real answers on token-level. We also compute the Macro-averaged Precision and Recall following the same procedure. The results are in Table 1. As it shows, both the systems based on pair-level sampling and paragraph-level sampling significantly outperform the single BiDAF system 2 . The Macro-averaged F1 has been improved from 4.8 to 52.3. Besides, the paragraphlevel sampling achieves better results than the pairlevel sampling.
In order to analyze the source of performance improvements, we further evaluate the performance of the QA Likelihood neural network and the single BiDAF system on answer sentence selection 3 . Here, we consider the problem as a binary classification problem. In the test set, positive instances are labeled with 1 and negative ones are labeled with 0. A sentence pair selected by a QA system (QA Likelihood neural network or the single BiDAF) has a predicted label 1, while the others have a predicted label 0. The results are shown in Table 2. It shows that both of our systems outperform the single BiDAF on all of the four metrics in the table.
We further evaluate the performance of the QA Likelihood neural network and the single BiDAF system on answer sentence selection from another perspective. Here, we consider three types of sentences: adversarial sentences, answer sentences, and the sentences that include the answers returned by the single BiDAF system. Given a QA Likelihood neural network, we draw a histogram in which the x-axis denotes the ranked position for each sentence according to its likelihood score 4 , while the y-axis is the number of sentences for each type ranked at this position. The results are presented in Figure 2. It shows that among the 1,000 (C, Q) pairs, 647 and 657 answer sentences are selected by the QA Likelihood neural network based on pair-level sampling and paragraph-level sampling respectively, but only 136 and 141 adversarial sentences are selected by the QA Likelihood neural network. It indicates the effectiveness of the QA Likelihood neural network to reduce the impact of adversarial sentences.

Related works
With the help of deep learning, many techniques have been investigated to achieve exciting results on answer sentence selection and QA. Wang and Nyberg (2015) measure the relevance between sentences through a stacked bidirectional LSTM network. They show that these scores are effective in answer sentence selection. He et al. (2015) embed sentences with CNN at multiple levels of granularity to model the similarity between sentences. Rao et al. (2016) extend the method of Noise-Contrastive Estimation to questions paired with positive and negative sentences. Based on that, they present a pairwise ranking approach to select an answer from multiple candidate sentences. Wang et al. (2017) propose a bilateral multi-perspective matching model which achieves rivaling results in the task of answer sentence selection. Shen et al. (2017a) measure the similarity between sentences by utilizing the word level 4 The x-axis is truncated to save the space. similarity matrix. This approach is validated in answer selection. To efficiently tackle question answering for long documents, Choi et al. (2017) propose a method based on answer sentence selection to first narrow down a document and then use RNN to generate an answer.
However, following the idea of adversarial examples in image recognition (Goodfellow et al., 2014;Kurakin et al., 2016;Papernot et al., 2016), Jia and Liang (2017) point out the unreliability of existing question answering models in the presence of adversarial sentences. In this study, we propose a method to tackle this problem through answer sentence selection. The main component of our system is Tree-LSTM which is a powerful variant of Tree-RNN. Therefore, studies about Tree-RNN (Pollack, 1990;Goller and Kchler, 1996;Socher et al., 2011Socher et al., , 2012Socher et al., , 2013 are also related.

Conclusions
In this paper, we propose a method to address the problem of question answering with adversarial sentences in paragraphs. Specifically, our system via the QA Likelihood neural network based on Tree-LSTMs successfully boost the performance of the single BiDAF on ADDANY adversarial dataset. Experiments show the F1 score has been largely improved from 4.8 to 52.3. To the best of our knowledge, we are the first to apply Tree-LSTMs in answer sentence selection and the first to tackle question answering with adversarial examples on ADDANY adversarial dataset.
However, Jia and Liang (2017) also present the deterioration of QA systems on another dataset, ADDSENT adversarial dataset. Question answer-ing on this dataset remains unsolved. We leave it as a future work.