PolicyQA: A Reading Comprehension Dataset for Privacy Policies

Privacy policy documents are long and verbose. A question answering (QA) system can assist users in finding the information that is relevant and important to them. Prior studies in this domain frame the QA task as retrieving the most relevant text segment or a list of sentences from the policy document given a question. On the contrary, we argue that providing users with a short text span from policy documents reduces the burden of searching the target information from a lengthy text segment. In this paper, we present PolicyQA, a dataset that contains 25,017 reading comprehension style examples curated from an existing corpus of 115 website privacy policies. PolicyQA provides 714 human-annotated questions written for a wide range of privacy practices. We evaluate two existing neural QA models and perform rigorous analysis to reveal the advantages and challenges offered by PolicyQA.


Introduction
Security and privacy policy documents describe how an entity collects, maintains, uses, and shares users' information. Users need to read the privacy policies of the websites they visit or the mobile applications they use and know about their privacy practices that are pertinent to them. However, prior works suggested that people do not read privacy policies because they are long and complicated (McDonald and Cranor, 2008), and confusing (Reidenberg et al., 2016). Hence, giving users access to a question answering system to search for answers from long and verbose policy documents can help them better understand their rights.
In recent years, we have witnessed noteworthy progress in developing question answering (QA) systems with a colossal effort to benchmark highquality, large-scale datasets for a few application * Equal contribution.
Website: Amazon.com Information You Give Us: We receive and store any information you enter on our Web site or give us in any other way. Click here to see ...

Question. How do you collect my information?
information you enter on our Web site Promotional Offers: Sometimes we send offers to selected groups of Amazon.com customers on behalf of other businesses. When we do this, we do not give that business your name and address. If you do not want to receive such offers, ...

Question. Is my information shared with others?
we do not give that business your name and address Table 1: Question-answer pairs that we collect from OPP-115 (Wilson et al., 2016a) dataset. The evidence spans are highlighted in color and they are used to form the question-answer pairs. domains (e.g., Wikipedia, news articles). However, annotating large-scale QA datasets for domains such as security and privacy is challenging as it requires expert annotators (e.g., law students). Due to the difficulty of annotating policy documents at scale, the only available QA dataset is PrivacyQA (Ravichander et al., 2019) on privacy policies for 35 mobile applications.
An essential characteristic of policy documents is that they are well structured as they are written by following guidelines set by the policymakers. Besides, due to the homogeneous nature of different entities (e.g., Amazon, eBay), their privacy policies have a similar structure. Therefore, we can exploit the document structure (meta data) to form examples from existing corpora. In this paper, we present PolicyQA, a reading comprehension style question answering dataset with 25,017 question-  ) that focuses on extracting long text spans from policy documents, we argue that highlighting a shorter text span in the document facilitates the users to zoom into the policy and identify the target information quickly. To enable QA models to provide such short answers, PolicyQA provides examples with an average answer length of 13.5 words (in comparison, the PrivacyQA benchmark has examples with an average answer length of 139.6 words). We present a comparison between PrivacyQA and PolicyQA in Table 2.
In this work, we present two strong neural baseline models trained on PolicyQA and perform a thorough analysis to shed light on the advantages and challenges offered by the proposed dataset. The data and the implemented baseline models are made publicly available. 1
the Appendix (in Table 9), we list all the attributes under the First Party Collection/Use category.
In total, OPP-115 contains 23,000 data practices, 128,000 practice attributes, and 103,000 annotated text spans. Each text span belongs to a policy segment, and OPP-115 provides its character-level start and end indices. We provide an example in Table 3. We use the annotated spans, corresponding policy segments, and the associated {Practice, Attribute, Value} triples to form PolicyQA examples. We exclude the spans with practices labeled as "Other" and the values labeled as "Unspecified". Next, we describe the question annotation process.
Question annotations. Two skilled annotators manually annotate the questions. During annotation, the annotators are provided with the triple {Practice, Attribute, Value}, and the associated text span. For example, given the triple {First Party Collection/Use, Personal Information Type, Con-tact} and the associated text span "name, address, telephone number, email address", the annotators created questions, such as, (1) What type of contact information does the company collect?, (2) Will you use my contact information?, etc.   For a specific triple, the process is repeated for 5-10 randomly chosen samples to form a list of questions. We randomly assign a question from this list to the examples associated with the triple that were not chosen during the sampling process. In total, we considered 258 unique triples and created 714 individual questions. In Table 4, we provide an example question for each practice category. Also, we compare the distribution of questions' trigram prefixes in PolicyQA (Figure 1a) with PrivacyQA ( Figure 1b). It is important to note that, PolicyQA questions are written in a generic fashion to become applicable for text spans, associated with the same practice categories. Therefore, PolicyQA questions are less diverse than PrivacyQA questions.
We split OPP-115 into 75/20/20 policies to form training, validation, and test examples, respectively. Table 5 summarizes the data statistics.

Experiment
In this section, we evaluate two neural question answering (QA) models on PolicyQA and present the findings from our analysis.
Baselines. PolicyQA frames the QA task as predicting the answer span that exists in the given policy segment. Hence, we consider two existing neural approaches from literature as baselines for PolicyQA. The first model is BiDAF (Seo et al., 2017) that uses a bi-directional attention flow mechanism to extract the evidence spans. The second baseline is based on BERT (Devlin et al., 2019) with two linear classifiers to predict the boundary of the evidence, as suggested in the original work.
Implementation. PolicyQA has a similar setting as SQuAD (Rajpurkar et al., 2016). Therefore, we pre-train the QA models using their default settings   on the SQuAD dataset. Besides, we consider leveraging unlabeled privacy policies in fine-tuning the models, as noted below.
• Fine-tuning. We train word embeddings using fastText (Bojanowski et al., 2017) based on a corpus of 130,000 privacy policies (137M words) collected from apps on the Google Play Store. 2 These word embeddings are used as fixed word representations in BiDAF while training on PolicyQA. Similarly, to adapt BERT to the privacy domain, we first fine-tune BERT using masked language modeling (Devlin et al., 2019) based on the privacy policies and then train on PolicyQA.
• No fine-tuning. In this setting, we use the publicly available fastText (Bojanowski et al., 2017) embeddings with BiDAF, and the BERT model is not fine-tuned on those privacy policies.
We adopt the default model architecture and optimization setup for the baseline methods. We detail the hyper-parameters in Appendix (in Table 10

Results and Analysis
The experimental results are presented in Table 6. Overall, the BERT-base methods outperform the BiDAF models by 6.1% and 7.6% in terms of EM and F1 score (on the test split), respectively.
Impact of fine-tuning. Table 6 demonstrates that the fine-tuning step improves the downstream task performance. For example, BERT-base performance is improved by 0.5% and 1.0% EM and F1 score, respectively, on the test split. This result encourages to train/fine-tune BERT on a larger collection of security and privacy documents.
Impact of SQuAD pre-training. Given a small number of training examples, it is challenging to train deep neural models. Hence, we pre-train the extractive QA models on SQuAD (Rajpurkar et al., 2016) and then fine-tune on PolicyQA. The additional pre-training step improves performance. For example, in no fine-tuning setting, BiDAF, and BERT-base improve the performance by 1.5% and 0.6% F1 score, respectively (on the test split).
Impact of model size. We experiment with different sized BERT models (Turc et al., 2019) and the results in Table 7 shows that the performance improves with increased model size. The results also indicate that PolicyQA is a challenging dataset, and hence, a larger model performs better.
"Type" "Purpose" "How"  Figure 2: BERT-base model's performance on (a) the three most frequent attributes of "First Party Collection/Use" and "Third Party Sharing/Collection" practice categories, and (b) questions with different answer lengths.
Analysis. We breakdown the test performance of the BERT-base method to examine the model performance across practice categories. The results are presented in Table 8. We see the model performs comparably on the three most frequent categories (comprise 89.5% of the total examples).
We further analyze the performance on questions associated with (1) the top three frequent attributes for the two most frequent practice categories, and (2) different answer lengths. The results are presented in Figure 2a and 2b. Our findings are (1) shorter evidence spans (e.g., evidence spans for Personal Information Type questions) are easier to extract than longer spans; and (2) SQuAD pretraining helps more in extracting shorter evidence spans. Leveraging diverse extractive QA resources may reduce the length bias and boost the QA performance on privacy policies.

Related Work
The Usable Privacy Project (Sadeh et al., 2013) has made several attempts to automate the analysis of privacy policies (Wilson et al., 2016a;Zimmeck et al., 2019). Noteworthy works include identification of policy segments commenting on specific data practices (Wilson et al., 2016b), extraction of opt-out choices, and their provisions in policy text (Sathyendra et al., 2016;Mysore Sathyendra et al., 2017), and others (Bhatia and Breaux, 2015;Bhatia et al., 2016). and models to answer questions with a list of sentences. In comparison to the prior QA approaches, we encourage developing QA systems capable of providing precise answers by using PolicyQA.

Conclusion
This work proposes PolicyQA, a reading comprehension style question answering (QA) dataset. Pol-icyQA can contribute to the development of QA systems in the security and privacy domain that have a sizeable real-word impact. We evaluate two strong neural baseline methods on PolicyQA and provide thorough ablation analysis to reveal important considerations that affect answer span prediction. In our future work, we want to explore how transfer learning can benefit question answering in the security and privacy domain.