Look at the First Sentence: Position Bias in Question Answering

Many extractive question answering models are trained to predict start and end positions of answers. The choice of predicting answers as positions is mainly due to its simplicity and effectiveness. In this study, we hypothesize that when the distribution of the answer positions is highly skewed in the training set (e.g., answers lie only in the k-th sentence of each passage), QA models predicting answers as positions learn spurious positional cues and fail to give answers in different positions. We first illustrate this position bias in popular extractive QA models such as BiDAF and BERT and thoroughly examine how position bias propagates through each layer of BERT. To safely deliver position information without position bias, we train models with various de-biasing methods including entropy regularization and bias ensembling. Among them, we found that using the prior distribution of answer positions as a bias model is very effective at reducing position bias recovering the performance of BERT from 35.24% to 81.17% when trained on a biased SQuAD dataset.


Introduction
Question answering (QA) is a task of answering questions given a passage. Large-scale QA datasets have attracted many researchers to build effective QA models, and with the advent of deep learning, recent QA models outperform humans in some datasets (Rajpurkar et al., 2016;Yang et al., 2019). Extractive QA is the task that assumes that answers always lie in the passage. Based on this task assumption, various QA models are trained to predict the start and end positions of the answers. Following the structure of earlier deep learning-based QA models (Wang and Jiang, † Corresponding authors 2016; Seo et al., 2017;Xiong et al., 2017), recent QA models predict positions of answers without much consideration (Yu et al., 2018;Yang et al., 2019). The popularity of predicting the answer positions is credited to the fact that it reduces the prediction space to O(n) where n is the length of an input document. It is more efficient and effective than directly generating answers from a large vocabulary space. Furthermore, it reduces the QA task to a classification task which is convenient to model. Nevertheless, very few studies have discussed the side effects of predicting the answer positions. Could there be any unwanted biases when using answer positions as prediction targets?
In this paper, we demonstrate that the models predicting the position can be severely biased when trained on datasets that have a very skewed answer position distribution. We define this as position bias as shown in Figure 1. Models trained on a biased dataset where answers always lie in the same sentence position mostly give predictions on the To examine the cause of the problem, we thoroughly analyze the learning process of QA models trained on the biased training sets, especially focusing on BERT. Our analysis shows that hidden representations of BERT preserve a different amount of word information according to the word position when trained on the biased training set. The predictions of biased models also become more dependent on the first few words as the input passes each layer.
To tackle the problem, we test various options, ranging from relative position encodings (Yang et al., 2019) to ensemble-based de-biasing methods (Clark et al., 2019;He et al., 2019). While simple baselines motivated by our analysis improve the test performance, our ensemble-based de-biasing method largely improves the performance of most models. Specifically, we use the prior distribution of answer positions as an additional bias model and train models to learn reasoning ability beyond the positional cues.
Contributions of our paper are in threefold; First, we define position bias in extractive question answering and illustrate that common extractive QA models suffer from it. Second, we examine the reason for the failure of the biased models and show that positions can act as spurious biases. Third, we show that the prior distribution of answer positions helps us to build positionally de-biased models, recovering the performance of BERT from 35.24% to 81.17%. We also generalize our findings in many different positions and datasets. 1 1 Our code will be publicly available.

Analysis
We first demonstrate the presence of position bias using synthetically created biased datasets. We sample examples from SQuAD (Rajpurkar et al., 2016) based on the positions of answers. We further visualize the hidden representations of the biased model.

Position Bias on Synthetic Datasets
From the original training set D train , we subsample a biased training set D k train whose answers lie in the k-th sentence. 2 We conduct experiments on SQuAD (D = SQuAD) as most examples in SQuAD are answerable with a single sentence (Min et al., 2018). Our analysis mainly focuses on SQuAD k=1 train (i.e., all answers are in the first sentence), which has the largest proportion of samples compared to other sentence positions in SQuAD (28,263 out of 87,599). The proportion in the development set (SQuAD dev ) is similar, having 3,617 out of 10,570 answers in the first sentence. Note that while our analysis is based on SQuAD k=1 train , we also test various sentence positions in our main experiments (Section 4.2). We experiment with three popular QA models that provide positions as answers: BiDAF (Seo et al., 2017), BERT , and XLNet (Yang et al., 2019). All three models are trained on SQuAD k=1 train and are evaluated on SQuAD dev . For a fair comparison, we also randomly sample examples from the original training set and make SQuAD train (Sampled) which has the same number of examples with SQuAD k=1 train . Table 1 shows the performance of the three models trained on SQuAD k=1 train . The performance of recurrent models (BiDAF) and self-attentive models (BERT, XLNet) drop significantly compared to models trained on SQuAD train or SQuAD train (Sampled). On Average, F1 scores has dropped by 48.26% in all three models which shows position bias of existing QA models. The relative position encodings in XLNet mitigates position bias to some extent, but its performance still degrades significantly.
To better understand the cause of position bias, we additionally perform two pre-processing methods on SQuAD k=1 train . First, we truncate each passage up to the first sentence (SQuAD k=1 train + First Sentence). In this case, most performance is recovered, which indicates that the distributions of answer positions are relatively defined with respect to the maximum sequence length. Shuffling the sentence order of SQuAD k=1 train (SQuAD k=1 train + Sentence Shuffle) also recovers most performance, showing that the spreadness of answers matters. However, these pre-processing methods cannot be a solution to position bias as models cannot learn proper multisentence reasoning from a corrupted context. Also, more fine-grained biases (e.g., word level positions) could cause the problem again.

Visualization of Position Bias
To visualize how position bias propagates throughout the layers, we compare BERT models each trained on SQuAD k=1 train and SQuAD train respectively and BERT without any fine-tuning. The uncased version of BERT-base is used for the analysis. Figure 2 (a) shows the amount of word information preserved in the hidden representations at the last layer of BERT. For each word position, we define the amount of word information as the cosine similarity between the word embedding and its hidden representation at each layer. The similarities are averaged over the passage-side hidden representations in SQuAD dev . We observe that BERT trained on SQuAD k=1 train (FIRST) has higher similarities at the front of the passages compared with BERT trained on SQuAD train (ORIG). Also, in the biased model, similarities become smaller after the first few tokens whereas other models show relatively flat distributions over different word positions. Note that the large variation after word position of 300 is due the the small number of samples at corresponding positions. Figure 2 (b) shows the Spearman's rank correlation coefficient between the final output logits 3 and the amount of word information at each layer defined by the cosine similarity. Higher correlation means that the model is more relying on the word information kept in that layer. The correlation coefficient is much higher in the biased model (FIRST) especially in the last few layers. Combined with the observation from Figure 2 (a), this indicates that the predictions of the biased model are heavily relying on the information of the first few words.
To summarize, our analysis shows that BERT easily exploits spurious positional cues and loses its ability to reason in different word positions.

Method
Based on our observation, we test various debiasing methods to tackle position bias. To prevent models from learning a direct correlation between word positions and answers, we first introduce simple baselines for BERT such as randomized position or entropy regularization. Furthermore, we introduce bias ensemble methods with answer prior distributions to de-bias models from learning the easy shortcuts.

Baselines
Randomized Position To avoid learning the direct correlation between word positions and answers, we randomly perturb input positions. We first randomly sample t indices from a range of 0 to maximum sequence length of BERT. We sample t = 384 when the maximum sequence length is 512. Then, we sort the indices to preserve the ordering of input words. However, sorting in an ascending order could also bias the models to learn that low position indices are more suitable for answers in the case of SQuAD k=1 train . Hence, we randomly choose between ascending and descending orders for each sample during training.
Entropy Regularization Inspired by the observation in Section 2.2, we force our model to preserve a constant amount of word information regardless of the word positions. Maximizing the entropy of normalized cosine similarity between the word embeddings and their hidden representations encourages models to maintain an uniform amount of information. As the cosine similarities are not probabilities, we normalize them to be summed to 1. We compute the entropy regularization term from the last layer and add it to the start/end prediction loss with a scaling factor λ.

Bias Ensemble with Answer Prior
Bias ensemble methods (Clark et al., 2019;He et al., 2019) combine the logits from a pre-defined bias model with the logits of a target model to de-bias. Ensembling makes the target model to learn different logits other than the bias logits. In our case, we define the prior distribution of the answer positions as our bias model. Specifically, we introduce the sentence-level answer prior and the word-level answer prior.
Bias Ensemble Method Given a passage and question pair, a model has to find the optimal start and end positions of the answer in the passage, described as y s , y e . Typically, the model outputs two probability distributions p s and p e for the start and end positions. As our method is applied in the same manner for both start and end predictions, we drop the superscript from p s , p e and subscript from y s , y e whenever possible.
For ensembling two different logits from the bias model and the target model, we use a product of experts (Hinton, 2002). Using the product of experts, a probability at the i-th position is calculated as: where log(p i ) is a logit from the target model and log(b i ) is a logit from the bias model. The ensembled probabilityp is used for the training. To dynamically choose the amount of bias for each sample, Clark et al. (2019) introduce a learned mixing ensemble with a trainable parameter. Probabilities in the training phase are now defined as: where g is a single linear layer (See Appendix A for the detailed description of g). As models often learn to simply ignore the biases and make g(X) to 0, Clark et al. (2019) suggest to add an entropy penalty term to the loss function. However, entropy penalty did not make much difference in our case as g(X) was already large enough. Note that we only use the bias logit log(b i ) during training, and the predictions are solely based on the prediction logit log(p i ) from the model. We use pre-calculated answer priors as our bias model. Using prior distributions in machine learning has a long history such as using class frequency in the class imbalance problem (Domingos, 1999;Japkowicz and Stephen, 2002;Zhou and Liu, 2006;Huang et al., 2016). In our case, the class prior corresponds to the prior distribution of answer positions.
Word-level Answer Prior First, we consider the word-level answer prior. Given the training set having N examples having N answers {y (1) , y (2) , ..., y (N ) }, we compute the word-level answer prior at position i over the training set. In this case, our bias logit at i-th position is: where we use the indicator function 1[cond]. Bias logits for the end position prediction are calculated in a similar manner. Note that the word-level answer prior gives an equal bias logit distribution for each passage while the distribution is more finegrained than the sentence-level prior described in the next section.

Sentence-level Answer Prior
We also use the sentence-level answer prior which dynamically changes depending on the sentence boundaries of each sample. First, we define a set of sentences {S L } for the j-th training passage, where L is the maximum number of sentence in whole training passages. Then, the sentence-level answer prior of the i-th word position (for the start prediction) for the j-th sample, is derived from the frequency of answers appearing in the l-th sentence: Note that as boundaries of sentences in each sample are different, bias logits should be defined in every sample. Again, the bias logits for the end positions are calculated similarly.
It is very convenient to calculate the answer priors for any datasets. For instance, on D k=1 train , we use the first sentence indicator as the sentence-level answer prior as all answers are in the first sentence. More formally, the sentence-level answer prior for D k=1 train is 1 for l = 1, and 0 when l > 1: which is a special case of the sentence-level answer prior. For general datasets where the distributions of answer positions are less skewed, the answer priors are more softly distributed. See Appendix B for a better understanding of the answer priors. Both word-level and sentence-level answer priors are experimented with two bias ensemble methods: product of experts with bias (Bias Product, Equation 1) and learned mixing of two logits (Learned-Mixin, Equation 2).

Experiments
We first experiment the effects of various debiasing methods on three different QA models using both biased and full training sets. Our next experiments generalize our findings in different sentence positions and different datasets such as NewsQA (Trischler et al., 2017) and NaturalQuestions .

Effect of De-biasing Methods
We first train all three models (BiDAF, BERT, and XLNet) on SQuAD k=1 train with our de-biasing methods and evaluate them on SQuAD dev (original development set), SQuAD k=1 dev , and SQuAD k=2,3,...

dev
. Note that SQuAD k=2,3,... dev is another subset of SQuAD dev , whose answers do not appear in the first sentence, but in other sentences. We also experiment with BERT trained on the full training set, SQuAD train .
For all models, we use the same hyperparameters and training procedures as suggested in their original papers (Seo et al., 2017;Yang et al., 2019), except for batch sizes and training epochs (See Appendix A). λ for the entropy regularization is set to 1. Most of our implementation is based on the PyTorch library.

Results with SQuAD k=1
train The results of applying various de-biasing methods on three models with SQuAD k=1 train are in Table 2. Performance of all models without any de-biasing methods (denoted as 'None') is very low on SQuAD k=2,3,... dev , but fairly high on SQuAD k=1 dev . This means that their predictions are highly biased towards the first sentences. In the case of BERT, F1 score on SQuAD k=1 dev is 86.65%, while F1 score on SQuAD k=2,3,... dev is merely 8.25%. Our simple baseline approaches used in BERT improve the performance up to 33.99% F1 score (Random Position) while the entropy regularization is not significantly effective.
Bias ensemble methods using answer priors consistently improve the performance of all models. The sentence-level answer prior works the best, which obtains a significant gain after applying the Learned-Mixin method. We found that the coefficient g(X) in Equation 2 averages to 7.96 during training for BERT + Learned-Mixin, which demonstrates a need of proper balancing between the probabilities. The word-level answer prior does not seem to provide strong signals of position bias as its distribution is much softer than the sentencelevel answer prior.

Results with SQuAD train
The results of training BERT with our de-biasing methods on the full training set SQuAD train are in the bottom of Table 2.
Note that the answer prior is more softened than the answer prior used in SQuAD k=1 train as answers are now spread in all sentence positions. While exploiting the positional distribution of the training set could be more helpful when evaluating on the development set that has a similar positional distribution, our method achieves nontrivial improvement (+1.5% in EM) showing that 1) our method works safely when the positional distribution doesn't change much and 2) position bias might be harmful for the generalization of QA models.
Visualization To investigate the effect of debaising methods, we visualize the word informa-  tion in each layer as done in Section 2.2. We visualize the BERT trained on SQuAD k=1 train ensembled with sentence-level answer prior in Figure 3. Although the bias product method (PRODUCT) makes our model preserve more information after the first sentence compared to the model without any debaising methods (NONE), it still has position bias. The learned-mixin method (MIXIN), on the other hand, safely delivers the word information across different positions.

Generalizing to Different Positions
As the SQuAD training set has many answers in the first sentence, we mainly test our methods on SQuAD k=1 train . However, does our method generalize to different sentence positions? To answer this question, we construct four SQuAD k train datasets based on the sentence positions of answers. Note that unlike SQuAD k=1 train , the number of samples becomes smaller and the sentence boundaries are more blurry when k > 1, making answer pri- train with and without de-biasing method ors much softer. We train three QA models on different biased datasets and evaluate them on SQuAD dev with and without de-biasing methods.
Results As shown in Table 3, all three models suffer from position bias in every sentence position while the learned-mixin method (+Learned-Mixin) successfully resolves the bias. Due to the blurred sentence boundaries, position bias is less problematic when k is large. We observe a similar trend in BERT and XLNet while a huge performance drop is observed in BiDAF even with a large k.    Figure 4 visualizes the sentence-wise position biases. We train BERT, BERT + Bias Product and BERT + Learned Mixin on different subsets of SQuAD training set (SQuAD k train ) and evaluated on every SQuAD k dev whose answers lie only in the k-th sentence. As a result, the low performance in the off-diagonal represent the presence of position bias. The figure shows that the biased model fails to predict the answers in different sentence positions (Figure 4 (a)) while our de-biased model achieves high performance regardless of the sentence position (Figure 4 (c)). Again, as the value of k increases, the boundary of the k-th sentence varies a lot in each sample, which makes the visualization of sentence-wise bias difficult.

NewsQA and NaturalQuestions
We test the effect of de-basing methods on datasets having different domains and different degrees of position bias. NewsQA (Trischler et al., 2017) is an extractive QA dataset includes passages from CNN news articles.
NaturalQuestions  is an open-domain QA dataset containing queries and passages collected from the Google search engine. We train BERT with the sentence-level answer prior to see whether our methodology generalizes to these datasets. For each dataset, we construct two sub-training datasets, one contains samples with answers in the first sentence (k = 1) and the other contains remaining samples (k = 2, 3, ...). Models are trained on the original dataset and two sub-training datasets and evaluated on the original development set. Implementation Details From NewsQA and NaturalQuestions, we construct two sub-training datasets having only the first annotated samples (D k=1 train ) and the remaining samples (D k=2,3,...

train
). For a fair comparison, we fix the size of two subtraining sets to have 17,000 (NewsQA) and 40,000 samples (NaturalQuestions). More details on preprocessing of NewsQA and Natural Questions are in Appendix A.

Results
In Table 4 and Table 5, we show results of applying our methods. In both datasets, BERT, trained on biased datasets (k = 1 and k = 2, 3, ...), significantly suffers from position bias. Position bias is generally more problematic in the k = 1 datasets while for NaturalQuestions, k = 2, 3, ... is also problematic. Our de-biasing methods prevent performance drops in all cases without sacrificing the performance on the full training set (k = All).

Related Work
Various question answering datasets have been introduced with diverse challenges including reasoning over multiple sentences (Joshi et al., 2017), answering multi-hop questions (Yang et al., 2018), and more (Trischler et al., 2017;Welbl et al., 2018;Dua et al., 2019). Introduction of these datasets rapidly progressed the development of effective QA models (Wang and Jiang, 2016;Seo et al., 2017;Xiong et al., 2017;Yu et al., 2018;Yang et al., 2019), but most models predict the answer as positions without much discussion on it. Our work builds on the analyses of dataset biases in machine learning models and ways to tackle them. For instance, sentence classification models in natural language inference and argument reasoning comprehension suffer from word statistics bias (Poliak et al., 2018;Minervini and Riedel, 2018;Kang et al., 2018;Belinkov et al., 2019;Niven and Kao, 2019). On visual question answering, models often ignore visual information due to the language prior bias (Agrawal et al., 2016;Goyal et al., 2017;Johnson et al., 2017;Agrawal et al., 2018). Several studies in QA also found that QA models do not leverages the full information in the given passage Min et al., 2018;Chen and Durrett, 2019;Min et al., 2019). Adversarial datasets have been also proposed to deal with this type of problem (Jia and Liang, 2017;Rajpurkar et al., 2018). In this study, we define position bias coming from the prediction structure of QA models and show that positionally biased models can ignore information in different positions.
Our proposed methods are based on the bias ensemble method (Clark et al., 2019;He et al., 2019). Ensembling with the bias model encourages the model to solve tasks without converging to bias shortcuts. Clark et al. (2019) conducted de-biasing experiments on various tasks including two QA tasks while they use tf-idf and the named entities as the bias models.
It is worth noting that several models incorporate the pointer network to predict the answer positions in QA (Vinyals et al., 2015;Wang and Jiang, 2016;. Also, instead of predicting positions, some models predict the n-grams as answers (Lee et al., 2016;, generate answers in a vocabulary space (Raffel et al., 2019), or use a generative model (Lewis and Fan, 2019). We expect that these approaches suffer less from position bias and leave the evaluation of position bias in these models as our future work.

Conclusion
Most QA studies frequently utilize start and end positions of answers as training targets without much considerations. Our study shows that most QA models fail to generalize over different positions when trained on datasets having answers in a specific position. We introduce several de-biasing methods to make models to ignore the spurious positional cues, and find out that the sentence-level answer prior is very useful. Our findings also generalize to different positions and different datasets. One limitation of our approach is that our method and analysis are based on a single paragraph setting which should be extended to a multiple paragraph setting to be more practically useful.

A Implementation Details
Details of Learned-Mixin Method In Equation 2, we have to define the function g(X), which returns scalar coefficient. We use hidden representations before the softmax layer as inputs of g. g(X) then applies affine transformation on the representations to obtain a scalar value. softplus activation followed by max pooling is used to obtain positive values. As BiDAF has separate hidden representations for the start and end logits, we separately define g(X) for each start and end representation.

Details of Training
For all experiments, we use uncased BERT base and cased XLNet base . We modify the open-sourced Pytorch implementation of models. 4 BiDAF is trained with the batch size of 64 for 30 epochs and BERT and XLNet are trained for 2 epochs with batch size 12 and 10, respectively. The choice of hyperparameters mainly comes from the limitation of our computational resources and mostly follows the default setting used in their original works.
Pre-processing of NewsQA and NaturalQuestions For NewsQA, we truncate each paragraph so that the length of each context is less than 300 words. We eliminate training and development samples that become unanswerable due to the truncation. For NaturalQuestions, we use the pre-processed dataset provided by the MRQA shared task (Fisch et al., 2019). 5 We choose firstly occurring answers for training extractive QA models, which is a common approach in weakly supervised setting (Joshi et al., 2017;Talmor and Berant, 2019).

B Examples of Answer Prior
To provide a better understanding of our methods, Figure B.1 shows examples of answer priors, which are used as bias models. See section 3 for detail.

C Prediction Samples from Biased and De-Biased Models
As shown in In 1957, just before the television network began its first color broadcasts, the ABC logo consisted of a tiny lowercase "abc" in the center of a large lowercase letter a, a design known as the "ABC Circle A" Question What town was actually granted to the Huguenots on arrival? Answer Manakin Town Passage (Sent.1) In 1700 several hundred French Huguenots migrated from England to the colony of Virginia, where the English Crown had promised them land grants in Lower Norfolk County (Sent.2) When they arrived, colonial authorities offered them instead land 20 miles above the falls of the James River, at the abandoned Monacan village known as Manakin Town, now in Powhatan County.  train without any de-biasing methods (NONE), with sentence-level prior bias product (PRODUCT), with learned-mixin (MIXIN). MIXIN preserves consistent information compared with NONE and prevents the bias propagation.