Relation Module for Non-Answerable Predictions on Reading Comprehension

Machine reading comprehension (MRC) has attracted significant amounts of research attention recently, due to an increase of challenging reading comprehension datasets. In this paper, we aim to improve a MRC model’s ability to determine whether a question has an answer in a given context (e.g. the recently proposed SQuAD 2.0 task). The relation module consists of both semantic extraction and relational information. We first extract high level semantics as objects from both question and context with multi-head self-attentive pooling. These semantic objects are then passed to a relation network, which generates relationship scores for each object pair in a sentence. These scores are used to determine whether a question is non-answerable. We test the relation module on the SQuAD 2.0 dataset using both the BiDAF and BERT models as baseline readers. We obtain 1.8% gain of F1 accuracy on top of the BiDAF reader, and 1.0% on top of the BERT base model. These results show the effectiveness of our relation module on MRC.


Introduction
Ever since the release of many challenging large scale datasets for machine reading comprehension (MRC) (Rajpurkar et al., 2016;Joshi et al., 2017;Trischler et al., 2016;Yang et al., 2018;Reddy et al., 2018;Jia and Liang, 2017), there have been correspondingly many models for these datasets (Yu et al., 2018;Seo et al., 2017;Liu et al., 2018b;Hu et al., 2017;Xiong et al., 2017;Wang et al., 2018;Liu et al., 2018c;Tay et al., 2018).Knowing what you don't know (Rajpurkar et al., 2018) is important in real applications of reading comprehension.Unanswerable questions are commonplace in the real world, and SQuAD 2.0 was released specifically to target this problem (see Figure 1 for an example of non-answerable questions).
Example 1 Context: Each year, the southern California area has about 10,000 earthquakes.Nearly all of them are so small that they are not felt.Only several hundred are greater than magnitude 3.0, and only about 1520 are greater than magnitude 4.0.The magnitude 6.7 1994 Northridge earthquake was particularly destructive, causing a substantial number of deaths, injuries, and structural collapses.It caused the most property damage of any earthquake in U.S. history, estimated at over $20 billion.Question: What earthquake caused $20 million in damage?Answer: None.One problem that most of the early MRC readers have in common is the inability to predict non-answerable questions.Readers on the popular SQuAD dataset have to be modified in order to accommodate a non-answerable possibility.Current methods on SQuAD 2.0 generally attempt to learn a single fully connected layer (Clark and Gardner, 2018;Liu et al., 2018a;Devlin et al., 2018) in order to determine whether a question/context pair is answerable.This leaves out relational information that may be useful for determining answerability.We believe that relationships between different high-level semantics in the context are helpful to make better answerable or unanswerable decision.For example, "Northridge earthquake" is mistakenly taken as the answer to the question about what earthquake caused $20 million in damage.Because "$20 billon" is positioned far away from "Northridge earthquake", it is hard for a model to link these two concepts together and recognize the mismatch of "$20 million" in the question and "$20 billion" in the context.
Motivated by exploiting high level semantic relationships in the context, our first step is to extract meaningful high-level semantics from question/context.Multi-head self-attentive pooling (Lin et al., 2017) has shown to be able to extract different views of a sentence with multiple heads.Each head from the multi-head self attentive pooling has different weights on the context with learned parameters.This allows each head to act as a filter in order to emphasize part of the context.By summing up the weighted context, we obtain a vector representing an instance of a highlevel semantic, which we can call it an "object".With multiple heads, we generate different semantic objects, which are then fed in to a relation network.
Relation networks (Santoro et al., 2017) are specifically designed to model relationships between pairs of objects.In the case of reading comprehension, an object would ideally be phrase level semantics within a sentence.Relation networks are able to accomplish modeling these relationships by constraining the network to learn a score for each pair of these objects.After learning all of the pairwise scores, the relation network then summarizes all of the relations to a single vector.By taking a weighted sum of all of the relation scores that the sentence has, we generate a non-answerable score that is trained jointly with answer span scores from any MRC model to determine non-answerability.
In addition, we add in plausible answers from unanswerable examples to help train the relation module.These plausible answers help the base model learn a better span prediction and are also used to help guide our object extractor to extract relevant semantics.We train a separate layer for start-end probabilities based on the plausible answers.We then augment the context vector with hidden states from this layer.This allows the multi-head self-attentive pooling to focus on objects related to the proposed answer span, and differentiate from other objects that are not as relevant in the context.
In summary we propose a new relation module dedicated to learning relationships between highlevel semantics and deciding whether a question is answerable.Our contributions are four-fold: 1. Introduce the concept of using multi-head self-attentive pooling outputs as high level semantic objects.
2. Exploit relation networks to model the relationships between different objects in a context.We then summarize these relationships to get a final decision.
3. Introduce a separate feed-forward layer trained on plausible answers so that we can augment the context vector passed into the object extractor.This results in the object extractor extracting phrases more relevant to the proposed answer span.
4. Combining all of the above into a flexible relation module that can be added to the end of a question answering model to boost nonanswerable prediction.
To our knowledge, this is the first case of utilizing an object extractor to extract high level semantics, and a relation networks to encode relationships between these semantics in reading comprehension.Our results show improvement on top of the baseline BiDAF model and the state-of-the-art reader based on BERT, on the SQuAD 2.0 task.

Related Work
Relation Networks (RN) were first proposed by (Santoro et al., 2017) in order to help neural models to reason over the relationships between two objects.Relation networks learn relationships between objects by learning a pairwise score for each object pair.Relation networks have been applied to CLEVR (Johnson et al., 2017) as well as bAbI (Weston et al., 2015).In the CLEVR dataset, the object inputs to the relation network are visual objects in an image, extracted by a CNN, and in bAbI the object inputs are sentence encodings.In both tasks, the relation network is then used to compute a relationship score over these objects.Relation Networks were further applied to general reasoning by training the model on images (You et al., 2018).
MAC (Memory, Attention and Composition) networks (Hudson and Manning, 2018) are different models that have also been shown to learn relations from the CLEVR dataset.MAC networks operate with read and write cells.Each cell would compute a relation score between a knowledge base and question and write it into memory.Multiple read and write cells are strung together sequentially in order to model long chains of multihop reasoning.Although MAC networks do not explicitly reason between pairwise objects as relation networks do, MAC networks are an interesting way of generating multi-hop reasoning between objects within a context.Another similar line of work investigated pretraining relationship embeddings across word pairs on large unlabelled corpus (Jameel et al., 2018;Joshi et al., 2018).These pre-trained pairwise relational embeddings were added to the attention layers of BiDAF, where higher level abstract reasoning occurs.The paper showed an impressive gain of 2.7% on the SQuAD 2.0 development set on top of their version of BiDAF.
Many MRC models have been adapted to work on SQuAD 2.0 recently (Hu et al., 2019;Liu et al., 2018a;Sun et al., 2018;Devlin et al., 2018).(Hu et al., 2019) added a separately trained answer verifier for no-answer detection with their Mnemonic Reader.The answer sentence that is proposed by the reader and the question are passed to three combinations of differently configured verifiers for fine-grained local entailment recognition.(Liu et al., 2018a) just added one layer as the unanswerable binary classifier to their SAN reader.(Sun et al., 2018) proposed the U-net with a universal node that encodes the fused information from both the question and passage.The summary Unode, question vector and two context vectors are passed to predict whether the question is answerable.Plausible answers were used for no-answer pointer prediction, while in our approach, plausible answers were used to augment context vector for object extraction that later help the no-answer prediction.
Pretraining embeddings on large unlabelled corpus has been shown to improve many downstream tasks (Peters et al., 2018;Howard and Ruder, 2018;Alec et al., 2018).The recently released BERT (Devlin et al., 2018) greatly increased the F1 scores on the SQuAD 2.0 leaderboard.BERT consists of stacked Transformers (Vaswani et al., 2017), that are pre-trained on vast amounts of unlabeled data with a masked language model.The masked language model helps finetuning on downstream tasks, such as SQuAD 2.0.BERT models contains a special CLS token which is helpful for the SQuAD 2.0 task.This CLS token is trained to predict if a pair of sentences follow each other during the pre-training, which helps encode entailment information between the sentence pair.Due to a strong masked language model to help predict answers and a strong CLS token to encode entailment, BERT models are the current state-of-the art for SQuAD 2.0.

Relation Module
Our relation module is flexible, and can be placed on top of any MRC model.We now describe the relation module in detail.

Augmenting Inputs
Figure 2 shows our relation module on top of the base reader BERT.In addition to the original startend prediction layers trained from true answers in the base reader, we include a separate start-end prediction layer, with separate parameters, trained specifically on plausible and true answers available in SQuAD 2.0.The context output C from BERT is projected into two hidden state layers S and E, where C, S and E ∈ R L×h , L is the context length and h is the hidden size.The S and E layers are then projected down to a hidden dimension of 1, and trained with Cross-Entropy Loss against the plausible and true answer starts and ends.The hidden states S and E of this layer are concatenated with the last context layer output C and projected back to the original dimension to obtain the augmented context vector X, which is fused with start-end span information.
where [;;] is concatenation of multiple tensors and X ∈ R L×h .This process is shown in Figure 2, where S and E are hidden states trained on plausible and true answer spans.This tensor X and the last question layer output Q are passed to the object extractor layer.

Object Extractor
The augmented context tensor X (and separately, question tensor Q) is passed through the object extractor to generate object representations from the tensor.We pass the inputs through a multi-head self-attentive pooling layer.This object extractor can be thought of as a set filters extracting out areas of interest within a sentence.We multiply the input tensor X with a multi-head self attention matrix A which is defined as where W 3 ∈ R h×h , and W 4 ∈ R n×h ; σ is an activation function, such as tanh; n is the number of heads, and h is the hidden dimension.The output O ∈ R n×h contains the n objects with hidden dimension h that are passed to the next layer.

Object Extraction Regularization
In order to help encourage the multiple heads to extract different meaningful semantics in the text, a regularization loss (Xia et al., 2018) is introduced to encourage each head to attend to slightly different sections of the context.Overlapping objects centered on the answer span are expected, due to information fused from S and E, but we do not want the entire weight distribution of the head to be solely focused on the answer span.As we show in later figures, many heads heavily weight the answer span, but also weight information relevant to the answer span needed to make a better non-answerable prediction.Our regularization term also helps prevent the multi-headed attentive pooling from learning a noisy distribution over all of the context.This regularization loss is defined as where A is the weight matrix for the attention heads and I is the identity matrix.α is set to be 0.0005 in our experiments.

Relation Networks
Extracted objects are subsequently passed to a relation network.We use two layer MLP g θ (in Fig- ure 3) as a scoring function to compute the similarity between objects.In the question-answering task, the context contains the contextual information necessary to determine whether a question is answerable.Phrases and ideas from various parts of the context need to come together in order to fully understand whether or not a question is answerable.Therefore our relation module takes all pairs of context objects to score, and use the question objects to guide the scoring function.We use 2 question heads q 0 , q 1 , so our scoring function is: where the outputs r i is the weighted sum of the relation values for object o i from O, and z is a summarized relation vector.The weights Ω i = [ω i,0 , ..., ω i,n ] and Γ = [γ 0 , ..., γ n ] are computed by projecting down the relations scores into a hidden size of 1, and applying softmax.
g θ and f φ are two layer MLP with activation function tanh to compute and aggregate relational scores.Figure 3 shows the process of a single relation network, where two context objects and question objects are passed in to g θ to obtain the output z.
We project the weighted sum of f φ with a linear layer to a single value as a representation of the non-answerable score.This score is combined with the start/end logits from the base reader, and trained jointly with the reader's cross-entropy loss.By training jointly, the model is able to make a better prediction based on the confidence of the span prediction, as well as the confidence based on the non-answerable score from the relation module.

Question Answering Baselines
We test the relation module on top of our own Py-Torch implementation of the BiDAF model (Seo et al., 2017), as well as the recent released BERT base model (Devlin et al., 2018) for the SQuAD 2.0 task.For both of these models, we obtain improvement from adding the relation module.Note that, we do not test our relation module on top of the current leaderboard, as the details are not yet out.We also do not test on top of BERT + Synthetic Self Training (Devlin, 2019) due to lack of computational resources available.We are showing the effectiveness of our method and not trying to compete with the top of the leaderboard.

BiDAF
We implement the baseline BiDAF model for SQuAD 2.0 task (Clark and Gardner, 2018) with some modifications: adding features that are commonly used in question answering tasks such as TF-IDF, POS/NER tagging, etc, and the auxiliary training losses from (Hu et al., 2019).These modifications to the original BiDAF bring about 3.8% gain of F1 on the SQuAD 2.0 development set (see Table 1).
The input to the relation module is the context vector that is generated from the bi-directional attention flow layer.This context layer is augmented with the hidden states of linear layers trained against plausible answers, which also takes the context layer from the attention flow layer as input.This configuration is shown in Figure 4.

BERT
BERT is a masked language model pre-trained on large amounts of data that is the core component of all of the current state-of-the-art models on the SQuAD 2.0 task.The input to BERT is the concatenation of a question and context pair in the form of ["CLS"; question; "SEP"; context; "SEP"].BERT comes with its own special "CLS" token, which is pre-trained on a next sentence pair objective in order to encode entailment information between the two sentences during the pretraining scheme.
We leverage this "CLS" node with the relation module by concatenating it with the output of our Relation Module, and projecting the values down to a single dimension.This combines the information stored in the "CLS" token that has been learned from the pre-training, as well as the information that we learn through our relation module.We allow gradients to be passed through all layers of BERT, and finetune the initialized weights with the SQuAD 2.0 dataset.

Experiments
We experiment on the SQuAD 2.0 dataset (Rajpurkar et al., 2018) which contains question and context examples that are crowd-sourced from Wikipedia.Each example contains an answer span in the passage, or an empty string, indicating that an answer doesn't exist.The results are reported on the SQuAD 2.0 development set.
Model EM(%) F1(%) (Clark and Gardner, 2018) 61.9 64.We use the following parameters in our BiDAF experiment: 16 context heads, 2 question heads.We set our regularization loss weight for the object extractor to be 0.0005.We use Adam optimizer (Kingma and Ba, 2014), with a start learning rate of 0.0008 and decay the learning rate by 0.5 with a patience of 3 epochs.We add auxiliary losses for plausible answers, and re-rank the non-answerable loss as in (Hu et al., 2019).
BERT comes in two different sizes, a BERTbase model (comprising of roughly 110 million parameters), and a BERT-large model (comprising of roughly 340 million parameters).We use the BERT-base model to run our experiments due to the limited computing resources that training the BERT-large model would take.We only use the BERT-large model to show that we still get improvements with the relation module.The relation module on top of the BERT-base model only contains roughly 10 million parameters.
We use the BERT-base model to run our experiments with the same hyper-parameters given on the official BERT GitHub repository.We use 16 context objects, 2 question heads, and a regularization loss of 0.0005.We also show that on top of the BERT-large model, on the development set, our relation module still obtains performance gain1 .We use the same number of objects, and the same regularization losses for the BiDAF model experiments.Table 1 presents the results of the baseline readers with and without the relation module on the SQuAD development set.Our proposed relation module improves the overall F1 and EM accuracy: 2.0% gain on EM and 1.8% gain on F1 on the BiDAF, as well as 0.8% gain on EM and 1.0% gain on F1 on the BERT-base model.Our relation module is able to take relational information between object-pairs and form a better no-answer prediction than a model without it.The module obtains less gain (0.5% gain of F1) on BERT large model due to the better performance of BERT large model.This module is reader independent and works for any reading comprehension model related to non-answerable tasks.
Table 2 presents performance of three BERTbase models with minimum additions taken from the official SQuAD 2.0 leaderboard.We see that our relation module gives more gain than an Answer Verifier on top of the BERT-base model.Our module gains 1.3% F1 over the Answer Verifier.
Since our relation module is designed to help a MRC model's ability to judge non-answerable questions, we examine the accuracy when a question is answerable and when a question is nonanswerable.Table 3 compares these accuracy numbers for these questions with and without the relation module on top of the BERT-base model.The relation module improves prediction accuracy for both types of questions, and with more accuracy gain on the non-answerable questions: close to 4% gain on the non-answerable questions, which is more than 200 non-answerable questions are correctly predicted.

Ablation Study
We conduct an ablation study to show how different components of the relation module affects the overall performance for the BERT-base model.First we test only adding plausible answers on top of the BERT-base model, in order to quantify the gain in span prediction that adding these extra answers in would give.We show that with just adding plausible answers, the average of the three seeds gain only about a 0.3 F1.This gain in F1 is due to the BERT layers being fine-tuned on more answer span data that we provide.Next we study the effects of removing augmenting the context vector with plausible answers.We feed the output of our BERT-base model directly into the object extractor and subsequently to the relation network.This quantifies the effect of forcing the self-attentive heads to focus on a plausible answer span.We notice that this performs comparably to just adding plausible answers, also with only around a 0.3 F1 gain.
Finally, we conduct a study to see the effects of different number of heads on our relation module.We experiment with 4, 16, and 64 heads, with 16 heads performing the best out of these three configurations.Having too few heads hinders the performance due to not enough information being propagated for the relation network to operate on.Having too many heads will introduce redundant information, as well as incorporating extraneous noise for our model to sift through to generate meaningful relations.

Analysis
In order to gain better understanding on how the relation module helps on the unanswerable prediction, we examine the objects extracted from the multi-head self-attentive pooling.This is to check whether the relevant semantics are extracted for the relation network.Examples are selected from the development set for data analysis.
In Example 1, the BERT-base model incorrectly outputs "Northridge earthquake" (in red) as the answer.However, after adding our relation module, the model rejects this possible answer and outputs a non-answerable prediction.
The two objects from the question highly attend to token "million" (see the bottom subplot Example 1 Context: Each year, the southern California area has about 10,000 earthquakes.Nearly all of them are so small that they are not felt.Only several hundred are greater than magnitude 3.0, and only about 1520 are greater than magnitude 4.0.The magnitude 6.7 1994 Northridge earthquake was particularly destructive, causing a substantial number of deaths, injuries, and structural collapses.It caused the most property damage of any earthquake in U.S. history, estimated at over $20 billion.Question: What earthquake caused $20 million in damage?Answer: None.
Figure 5: In each subplot, each row represents one object from our object extractor; for each object we highlight the top 5 tokens with highest weights in the entire context and question.We show the two windows where the majority of these top 5 weights occur.For example, the top purple object in the context looks at key phrases such as "##ridge earthquake" in the top subplot and "billion" in the middle subplot; the blue object in the question looks at "20 million in" in the bottom subplot.
in Figure 5).The top row purple object covers token "1994" , "##ridge earthquake" in the possible answer span window, and "billion" near the end of the context window.We hypothesize that the relation network rejects the possible answer "Northridge earthquake" due to the mismatch of "million" in the question objects and "billion" in the purple context object, and relation scores from all other object pairs.Example 2 shows another example of nonanswerable question and context pair.The BERTbase model incorrectly outputs "input encoding" (in red) as its prediction, while adding our relation module on the BERT-base model predicts correctly that the question is not answerable.Fig-

Example 2
Context: Even though some proofs of complexitytheoretic theorems regularly assume some concrete choice of input encoding, one tries to keep the discussion abstract enough to be independent of the choice of encoding.This can be achieved by ensuring that different representations can be transformed into each other efficiently.Question: What is the abstract choice typically assumed by most complexity-theoretic theorems?Answer: None.
Figure 6: In each subplot, each row represents one object from our object extractor; for each object we highlight the top 5 tokens with highest weights in the entire context and question.We show a window where the majority of the top 5 weights occur.For example, there are numerous objects in the context window that look at the key phrase "some concrete" in the top subplot; the two objects in the question look at the key phrase "the abstract" in the bottom subplot.ure 6 gives a visual illustration of objects extracted from context and question.In Figure 6, the upper plot illustrates the 16 semantic objects shown in this context window and the lower plot illustrates the two semantic objects from the question.We see that from the upper plot, "some concrete" and "input encoding" are highlighted, while in the lower plot, "what", "the abstract", "most" are highlighted.The mismatch of "the abstract" from the question objects and "some concrete" from the context objects helps indicate that the question is unanswerable.

Conclusion
In this work we propose a new relation module that can be applied on any MRC reader and help increase the prediction accuracy on non-answerable questions.We extract high level semantics from multi-head self-attentive pooling.The semantic object pairs are fed into the relation network which makes a guided decision as to whether a question is answerable.In addition we augment the context vector with plausible answers, allowing us to extract objects focused on the proposed answer span, and differentiate from other objects that are not as relevant in the context.Our results on the SQuAD 2.0 dataset using the relation module on both BiDAF and BERT models show improvements from the relation module.These results prove the effectiveness of our relation module.
For future work, we plan to generalize the relation module to other aspects of question answering, including span prediction or multi-hop reasoning.

Figure 1 :
Figure 1: An example of non-answerable question in SQuAD 2.0.Highlighted words are the output from the BERT base model.The true answer is "None".

Figure 2 :
Figure 2: Relation Module on BERT.S and E are hidden states trained by plausible answers.We then concatenate S and E with the contextual representation to feed into the object extractor.After we obtain the extracted objects, we then feed into a Relation Network and pass it down for NA predictions.

Figure 3 :
Figure 3: Illustration of a Relation Network.The g θ is a MLP to score relationships between pairs

Table 1 :
Model performance on SQuAD 2.0 development set averaged over three random seeds.

Table 2 :
SQuAD 2.0 leaderboard numbers on the BERT-base Models.Our model shows improvement over the public BERT-base models on the official evaluation.

Table 3 :
Prediction accuracies on answerable and nonanswerable questions on development set.

Table 4 :
Ablation study on our Relation Module.We experiment with just having plausible answers, just having relation network, and different number of heads for the objects extracted by the relation network.Each of these values are averaged over three random seeds.