Improving the Robustness of Deep Reading Comprehension Models by Leveraging Syntax Prior

Despite the remarkable progress on Machine Reading Comprehension (MRC) with the help of open-source datasets, recent studies indicate that most of the current MRC systems unfortunately suffer from weak robustness against adversarial samples. To address this issue, we attempt to take sentence syntax as the leverage in the answer predicting process which previously only takes account of phrase-level semantics. Furthermore, to better utilize the sentence syntax and improve the robustness, we propose a Syntactic Leveraging Network, which is designed to deal with adversarial samples by exploiting the syntactic elements of a question. The experiment results indicate that our method is promising for improving the generalization and robustness of MRC models against the influence of adversarial samples, with performance well-maintained.


Introduction
As one of the ultimate goals of natural language processing, Machine Reading Comprehension (MRC) has been attracting much attention from both the academical and industrial institutions (Richardson et al., 2013;Hermann et al., 2015). Recently, most of the outstanding studies have benefited from the rapid development of machine reading competitions with shared datasets, such as SQuAD (Rajpurkar et al., 2016), MS MARCO (Nguyen et al., 2017). According to the competition results, the Deep Learning based approaches have shown significant strength on MRC tasks and achieved most of the top-ranked positions (Wang et al., 2017;Yu et al., 2018).
Nevertheless, the very recent research in MRC indicates that simply chasing the performance improvement on given datasets is unwise, since the generalization and robustness might be weakened due to the great fitting capability of DL models trained on a specific corpus. Especially, the research on adversarial reading comprehension samples conducted by Jia and Liang (2017) has shown that the performances of most of the DL based MRC models decrease significantly on the adversarial samples. These adversarial samples are constructed by simply appending one sentence similar to the question into the paragraph, without changing the original answer. This work indicates that, apparently, there exists quite a gap between the current MRC approaches and the methodologies that really comprehend natural language passages.
In this paper, we attempt to face the challenge brought by the RC adversarial samples and aim at proposing a reading comprehension system with better generalization and robustness. For this purpose, this paper presents a method to improve the answer inferencing process of MRC, by leveraging the probability function for estimating answer using the information related to sentence-question matching. Moreover, to further improve the robustness of the MRC system, we propose a novel model named syntactic leveraging network which exploits the syntax of the question as the prior information to match the answer-contained sentence and question more precisely.

Methodology
Most existent MRC methods predict answers by calculating probabilities of answer spans (i, j). For an answer a starts at position i, ends at j and locates in sentence k, we denote it as a = {i, j, k}. Given a question q and a paragraph p, the probability of a is computed by: and: Here functions f s and f e are usually implemented by neural networks to predict the probabilities. In most non-inferencing machine reading comprehension datasets such as SQuAD, all information needed to identify answers can be found inside one single sentence (Raiman and Miller, 2017). In such datasets, given one question and one phrase inside a sentence, overall whether this phrase is the answer depends on two conditions: 1) if the phrase itself generally matches with the question; 2) if the syntactic elements in the sentence are precisely consistent with the syntactic elements in the question.
However, the experiment results in Jia and Liang (2017) have shown that the current MRC systems pay less attention to the second condition, thus can be easily attacked by question-related sentences as adversarial samples. We attribute this deficiency to the fact that the current models solely takes the phrase-level information into account when predicting the probability p(a|q, p), but fails to exploit the sentence-level matching between the answer-contained sentence and the question, which is of importance on evaluating the second condition. Consequently, we propose a new probability function for estimating answers by considering the sentence level matching degree: (3) where s k is the k − th sentence in p. In general, p sent predicts if the answer a presents in the k − th sentence from the paragraph, it captures the matching between sentence and question as a leverage to improve the system robustness. α is the leveraging factor for p sent (k|q, p).

Syntactic Leveraging Network
Although theoretically f sent can be implemented by any model aiming at evaluating the matching between two sentences, to correctly identify real answer-contained sentences from semanticallyclosed adversarial sentences, it is necessary to come up with a model which is capable of precisely extracting and comparing the syntactic elements within sentences and questions. Therefore Syntactic Leveraging Network (SLN) is proposed to predict p sent (k|q, p), so as to improve the robustness of MRC models. The structure of SLN is shown in Figure 1, which consists of the SRL (Semantic role labeling) extractor, the CNN en-coder, the Matching operator performing optimal transport (Tam et al., 2019) and a classifier.

SRL Extractor
We utilize SRL (Gildea and Jurafsky, 2002;Khashabi et al., 2018) to analyze the syntax of sentences as prior information. In brief, it automatically produces syntactic analyses by exploiting generalizations from syntax-semantics links and assigns labels to phrases in a sentence based on their syntactic roles.
Given a question q, the SRL extractor separates q into a sequence of phrases Q, specifically: with corresponding lengths L = [l 1 , l 2 , . . . , l n ].
Here each q i represents one syntactic element within the q, and each can also be considered as a condition that answer-contained sentences must satisfy. The SLN model takes such sequence of n-grams as inputs to represent the question.

CNN Encoder
The encoder projects the syntactic elements in Q and s into real-valued vectors. Assuming CNN's filter windows range from w min to w max with each kernel size of k. For q i in Q, it is only transferred into the filter window of size l i in CNN: This CNN is performed following Kim (2014), so that the size of each q v i equals to the kernel size k. For sentence s of length L, it is first split into m 1 separate phrases [s 1 , s 2 , . . . , s m ], which contains all n-grams (w min ≤ n < w max ) in the sentence. Then, each s i is transferred into s v j of size k through CNN filters, such that: where s v j and q v i represent pieces of semantics in the sentence and question.

Matching Operator
The matching operator is designed to evaluate if the sentence generally matches with the syntactic elements of the question. It first computes the cosine-similarity between each q v i and s v j , which gives a similarity matrix S ∈ R n×m . Then we implement the max pooling across the row of S to obtain q sim : The value of each q sim i varies from 0 to 1, which indicates the degree of similarity of each syntactic element q i in s. Meanwhile, q sim i equals to 1 if the syntactic element q i exist exactly in the s, which is a significant signal for the element matching.
Furthermore, given S, for each q v i we compute its corresponding h v i . Specifically: where s v arg max j S ij is the vector representation of the most semantically-similar phase in the sentence given q v i , and q sim i represents the degree of similarity. Overall, h v i represents the most matched phase in the sentence for one syntactic element in the question and its corresponding degree of matching. Finally, h v is transferred from the Matching Operator as the output.

Classifier
The final classifier of SLN is designed to predict if the sentence matches with the question. It first concatenates the outputs h v i from the matching operator with q v i as the LSTM inputs, such that: The last LSTM hidden states c n is then transferred into a dense layer followed by a sigmoid activation function, and binary cross-entropy is adopted as the loss function.

Experimental Setups
Data Description. We implement our method on several end-to-end MRC models trained by SQuAD dataset, and evaluate their robustness before and after considering p sent (k|q, p) using the AddSent adversarial dataset (Jia and Liang, 2017). The training and test sets for MRC models are generated from SQuAD. To compute p sent (k|q, p), we set those answer-contained sentences in SQuAD as positive samples. For each positive sample, three sentences inside the same paragraph which do not contain answer are randomly chosen as negative samples, so that the positive/negative ratio is 1:3. All sentence-level matching models are trained on above samples as a binary-classification task using cross-entropy loss.
Baselines. Besides of SLN, we use relevance-LSTM and Inner-Attention (Liu et al., 2016) as baselines to compute f sent (q, s k ). Relevance-LSTM simply takes the last hidden states of the sentence and question for similarity computation, which is also used in the MRC model of Raiman and Miller (2017); while Inner-Attention is the abbreviation for the Bidirectional LSTM encoders with intra-attention, it utilizes the sentence's representation to attend words appearing in itself. BiDAF (Seo et al., 2017) and MneReader (Hu et al., 2017) are chosen as the back-end MRC models, and the results are obtained by our Keras   (Chollet et al., 2015). Parameter Settings. For SLN, we utilize the Al-lenNLP to perform SRL (Gardner et al., 2017), the filter windows are set from 1 to 8, with each kernel size of 128. The hidden size of LSTM is set as 128, while the size of the dense layer is set as 64. Adam (Kingma and Ba, 2014) with learning rate 0.001 is used to optimize SLN, the batch size is set as 8 and the models are trained for 50 epochs, with the early stop when the loss on validation set starts to drop. Dropout rate is set to 0.2 to prevent overfitting (Srivastava et al., 2014). We utilize the pretrained 100-dim GloVe embeddings for all the models and set it as untrainable during training (Pennington et al., 2014). The leveraging factor α are all set as 0.25 for relevance-LSTM, Inner-Attention, and SLN. For BiDAF and MneReader as back-end MRC models, we follow the exact hyperparameter settings of (Seo et al., 2017;Hu et al., 2017). Task   Table 1 details the performances of models on MRC datasets. The results show that both the performances of BiDAF and MneReader drop significantly on the adversarial dataset, which indicates that current MRC models are not robust enough to distinguish the semantically similar candidates from answers. Concerning robustness, both Inner-Attention and SLN improve the EM and F1 of BiDAF and MneReader on AddSent dataset. This shows evidence that the robustness of MRC models can be improved by properly exploiting the sentence-level matching information. It can be also observed that introducing the sentence-level matching into the models overall is not detrimental to the performances of models on the regular dataset, and the Inner-Attention even slightly increases the EM and F1 on regular SQuAD.

Results of the MRC
By contrary, Relevance-LSTM fails to improve the performance of current MRC models. We attribute this phenomenon to two reasons: 1) Relevance-LSTM mainly focuses on the semantics of the whole sentence to evaluate the relevance of two sentences, but current MRC models have already captured this information; 2) The wordlevel or phrase-level correspondence is important in identifying whether two sentences are talking about the same thing, which is also omitted in current End-to-End metric-oriented MRC models.

Analysis on Sentence Matching
The results of the sentence matching are shown in Table 2. It can be observed that Inner-Attention achieves the best performance. We attribute its high performance to the fact that its attention mechanism helps to capture the semantics clues on detecting answer-related sentences given the question. However, although the Inner-Attention outperforms SLN significantly on sentence matching, the results on Adversarial dataset show that SLN is more effective on robustness-promoting, reflected by the highest EM and F1 achieved by SLN on AddSent. Since most current MRC models have already modeled the high-level semantics in the sentences sufficiently, the attention mechanism in inner-attention might be redundant thus less effective in identifying the adversarial samples. The performance of SLN on robustness-promotion further verifies our hypothesis that introducing the syntax information as leverage on answer prediction is a feasible way to enhance the robustness of MRC systems.

Conclusions
In this paper, we exploit the usage of sentencelevel information, especially sentence syntax as leverage, on machine reading comprehension task. The experiment results show such approach is capable of improving the robustness of MRC systems against adversarial samples, with the performance on regular datasets well maintained, although currently, the improvements on robustness are relatively moderate.