Question Answering for Privacy Policies: Combining Computational and Legal Perspectives

Privacy policies are long and complex documents that are difficult for users to read and understand. Yet, they have legal effects on how user data can be collected, managed and used. Ideally, we would like to empower users to inform themselves about the issues that matter to them, and enable them to selectively explore these issues. We present PrivacyQA, a corpus consisting of 1750 questions about the privacy policies of mobile applications, and over 3500 expert annotations of relevant answers. We observe that a strong neural baseline underperforms human performance by almost 0.3 F1 on PrivacyQA, suggesting considerable room for improvement for future systems. Further, we use this dataset to categorically identify challenges to question answerability, with domain-general implications for any question answering system. The PrivacyQA corpus offers a challenging corpus for question answering, with genuine real world utility.


Introduction
Privacy policies are the documents which disclose the ways in which a company gathers, uses, shares and manages a user's data. As legal documents, they function using the principle of notice and choice (Federal Trade Commission, 1998), where companies post their policies, and theoretically, users read the policies and decide to use a company's products or services only if they find the conditions outlined in its privacy policy acceptable. Many legal jurisdictions around the world accept this framework, including the United States and the European Union (Patrick, 1980;OECD, 2004). However, the legitimacy of this framework depends upon users actually reading and understanding privacy policies to determine whether company practices are acceptable to them (Reidenberg et al., 2015). In practice this is seldom the case (Cate, 2010;Cranor, 2012;Gluck et al., 2016;Jain et al., 2016;US Federal Trade Commission et al., 2012;McDonald and Cranor, 2008). This is further complicated by the highly individual and nuanced compromises that users are willing to make with their data (Leon et al., 2015), discouraging a 'one-size-fits-all' approach to notice of data practices in privacy documents.
With devices constantly monitoring our environment, including our personal space and our bodies, lack of awareness of how our data is being used easily leads to problematic situations where users are outraged by information misuse, but companies insist that users have consented. The discovery of increasingly egregious uses of data by companies, such as the scandals involv-ing Facebook and Cambridge Analytica (Cadwalladr and Graham-Harrison, 2018), have further brought public attention to the privacy concerns of the internet and ubiquitous computing. This makes privacy a well-motivated application domain for NLP researchers, where advances in enabling users to quickly identify the privacy issues most salient to them can potentially have large real-world impact.
Motivated by this need, we contribute PRI-VACYQA, a corpus consisting of 1750 questions about the contents of privacy policies 4 , paired with over 3500 expert annotations. The goal of this effort is to kickstart the development of question-answering methods for this domain, to address the (unrealistic) expectation that a large population should be reading many policies per day. In doing so, we identify several understudied challenges to our ability to answer these questions, with broad implications for systems seeking to serve users' information-seeking intent. By releasing this resource, we hope to provide an impetus to develop systems capable of language understanding in this increasingly important domain. 5

Related Work
Prior work has aimed to make privacy policies easier to understand. Prescriptive approaches towards communicating privacy information (Kelley et al., 2009;Micheti et al., 2010;Cranor, 2003) have not been widely adopted by industry. Recently, there have been significant research effort devoted to understanding privacy policies by leveraging NLP techniques Oltramari et al., 2017;Mysore Sathyendra et al., 2017;, especially by identifying specific data practices within a privacy policy. We adopt a personalized approach to understanding privacy policies, that allows users to query a document and selectively explore content salient to them. Most similar is the PolisisQA corpus (Harkous et al., 2018), which examines questions users ask corporations on Twitter. Our approach differs in several ways: 1) The PRIVA-CYQA dataset is larger, containing 10x as many questions and answers. 2) Answers are formulated by domain experts with legal training. 6 3) PRIVACYQA includes diverse question types, including unanswerable and subjective questions.
Our work is also related to reading comprehension in the open domain, which is frequently based upon Wikipedia passages (Rajpurkar et al., 2016(Rajpurkar et al., , 2018Joshi et al., 2017;Choi et al., 2018) and news articles (Trischler et al., 2017;Hermann et al., 2015;Onishi et al., 2016). Table.1 presents the desirable attributes our dataset shares with past approaches. This work is also tied into research in applying NLP approaches to legal documents (Monroy et al., 2009;Quaresma and Rodrigues, 2005;Do et al., 2017;Kim et al., 2015;Mollá and Vicedo, 2007;Frank et al., 2007). While privacy policies have legal implications, their intended audience consists of the general public rather than individuals with legal expertise. This arrangement is problematic because the entities that write privacy policies often have different goals than the audience. Feng et al. (2015); Tan et al. (2016) examine question answering in the insurance domain, another specialized domain similar to privacy, where the intended audience is the general public.

Data Collection
We describe the data collection methodology used to construct PRIVACYQA. With the goal of achieving broad coverage across application types, we collect privacy policies from 35 mobile applications representing a number of different categories in the Google Play Store. 78 One of our goals is to include both policies from well-known applications, which are likely to have carefully-constructed privacy policies, and lesser-known applications with smaller install bases, whose policies might be considerably less sophisticated. Thus, setting 5 million installs as a threshold, we ensure each category includes applications with installs on both sides  of this threshold. 9 All policies included in the corpus are in English, and were collected before April 1, 2018, predating many companies' GDPRfocused (Voigt and Von dem Bussche, 2017) updates. We leave it to future studies (Gallé et al., 2019) to look at the impact of the GDPR (e.g., to what extent GDPR requirements contribute to making it possible to provide users with more informative answers, and to what extent their disclosures continue to omit issues that matter to users). 9 The final application categories represented in the corpus consist of books, business, education, entertainment, lifestyle, music, health, news, personalization, photography, productivity, tools, travel and game applications.

Crowdsourced Question Elicitation
The intended audience for privacy policies consists of the general public. This informs the decision to elicit questions from crowdworkers on the contents of privacy policies. We choose not to show the contents of privacy policies to crowdworkers, a procedure motivated by a desire to avoid inadvertent biases (Weissenborn et al., 2017;Kaushik and Lipton, 2018;Poliak et al., 2018;Gururangan et al., 2018;Naik et al., 2018), and encourage crowdworkers to ask a variety of questions beyond only asking questions based on practices described in the document.
Instead, crowdworkers are presented with public information about a mobile application available on the Google Play Store including its name, description and navigable screenshots. Figure 2 shows an example of our user interface. 10 Crowdworkers are asked to imagine they have access to a trusted third-party privacy assistant, to whom they can ask any privacy question about a given mobile application. We use the Amazon Mechanical Turk platform 11 and recruit crowdworkers who have been conferred "master" status and are located within the United States of America. Turkers are asked to provide five questions per mobile application, and are paid $2 per assignment, taking~eight minutes to complete the task.

Answer Selection
To identify legally sound answers, we recruit seven experts with legal training to construct answers to Turker questions. Experts identify relevant evidence within the privacy policy, as well as provide meta-annotation on the question's relevance, subjectivity, OPP-115 category , and how likely any privacy policy is to contain the answer to the question asked.

Analysis
Table.4 presents aggregate statistics of the PRI-VACYQA dataset. 1750 questions are posed to our imaginary privacy assistant over 35 mobile applications and their associated privacy documents. As an initial step, we formulate the problem of answering user questions as an extractive sentence selection task, ignoring for now background knowledge, statistical data and legal expertise that could otherwise be brought to bear. The dataset is partitioned into a training set featuring 27 mobile applications and 1350 questions, and a test set consisting of 400 questions over 8 policy documents. This ensures that documents in training and test splits are mutually exclusive. Every question is answered by at least one expert. In addition, in order to estimate annotation reliability and provide for better evalu-ation, every question in the test set is answered by at least two additional experts. Table 2 describes the distribution over first words of questions posed by crowdworkers. We also observe low redundancy in the questions posed by crowdworkers over each policy, with each policy receiving~49.94 unique questions despite crowdworkers independently posing questions. Questions are on average 8.4 words long. As declining to answer a question can be a legally sound response but is seldom practically useful, answers to questions where a minority of experts abstain to answer are filtered from the dataset. Privacy policies are~3000 words long on average. The answers to the question asked by the users typically have~100 words of evidence in the privacy policy document.   as 'Other' if atleast one annotator has identified the 'Other' category to be relevant. If neither of these conditions is satisfied, we label the question as having no agreement. The distribution of questions in the corpus across OPP-115 categories is as shown in Table.3. First party and third party related questions are the largest categories, forming nearly 66.4% of all questions asked to the privacy assistant.

Answer Validation
When do experts disagree? We would like to analyze the reasons for potential disagreement on the annotation task, to ensure disagreements arise due to valid differences in opinion rather than lack of adequate specification in annotation guidelines. It is important to note that the annotators are experts rather than crowdworkers. Accordingly, their judgements can be considered valid, legally-informed opinions even when their perspectives differ. For the sake of this question we randomly sample 100 instances in the test data and analyze them for likely reasons for disagreements. We consider a disagree-ment to have occurred when more than one expert does not agree with the majority consensus. By disagreement we mean there is no overlap between the text identified as relevant by one expert and another.
We find that the annotators agree on the answer for 74% of the questions, even if the supporting evidence they identify is not identical i.e full overlap. They disagree on the remaining 26%. Sources of apparent disagreement correspond to situations when different experts: have differing interpretations of question intent (11%) (for example, when a user asks 'who can contact me through the app', the questions admits multiple interpretations, including seeking information about the features of the app, asking about first party collection/use of data or asking about third party collection/use of data), identify different sources of evidence for questions that ask if a practice is performed or not (4%), have differing interpretations of policy content (3%), identify a partial answer to a question in the privacy policy (2%) (for example, when the user asks 'who is allowed to use the app' a majority of our annotators decline to answer, but the remaining annotators highlight partial evidence in the privacy policy which states that children under the age of 13 are not allowed to use the app), and other legitimate sources of disagreement (6%) which include personal subjective views of the annotators (for example, when the user asks 'is my DNA information used in any way other than what is specified', some experts consider the boilerplate text of the privacy policy which states that it abides to practices described in the policy document as sufficient evidence to answer this question, whereas others do not).

Experimental Setup
We evaluate the ability of machine learning methods to identify relevant evidence for questions in the privacy domain. 13 We establish baselines for the subtask of deciding on the answerability ( §4.1) of a question, as well as the overall task of identifying evidence for questions from policies ( §4.2). We describe aspects of the question that can render it unanswerable within the privacy domain ( §5.2).

Answerability Identification Baselines
We define answerability identification as a binary classification task, evaluating model ability to predict if a question can be answered, given a question in isolation. This can serve as a prior for downstream question-answering. We describe three baselines on the answerability task, and find they considerably improve performance over a majority-class baseline. SVM: We define 3 sets of features to characterize each question. The first is a simple bag-ofwords set of features over the question (SVM-BOW), the second is bag-of-words features of 13 The task of evidence identification can serve as a first step for future question answering systems, that can further learn to form abstractive summaries when required based on identifying relevant evidence.   the question as well as length of the question in words (SVM-BOW + LEN), and lastly we extract bag-of-words features, length of the question in words as well as part-of-speech tags for the question (SVM-BOW + LEN + POS). This results in vectors of 200, 201 and 228 dimensions respectively, which are provided to an SVM with a linear kernel.

CNN:
We utilize a CNN neural encoder for answerability prediction. We use GloVe word embeddings (Pennington et al., 2014), and a filter size of 5 with 64 filters to encode questions.
BERT: BERT ) is a bidirectional transformer-based language-model (Vaswani et al., 2017). 14 We fine-tune BERT-base on our binary answerability identification task with a learning rate of 2e-5 for 3 epochs, with a maximum sequence length of 128.

Privacy Question Answering
Our goal is to identify evidence within a privacy policy for questions asked by a user. This is framed as an answer sentence selection task, where models identify a set of evidence sentences from all candidate sentences in each policy.

Evaluation Metric
Our evaluation metric for answer-sentence selection is sentence-level F1, implemented similar to (Choi et al., 2018;Rajpurkar et al., 2016). Precision and recall are implemented by measuring the overlap between predicted sentences and sets of gold-reference sentences. We report the average of the maximum F1 from each n−1 subset, in relation to the heldout reference.

Baselines
We describe baselines on this task, including a human performance baseline.
No-Answer Baseline (NA) : Most of the questions we receive are difficult to answer in a legally-sound way on the basis of information present in the privacy policy. We establish a simple baseline to quantify the effect of identifying every question as unanswerable.
Word Count Baseline : To quantify the effect of using simple lexical matching to answer the questions, we retrieve the top candidate policy sentences for each question using a word count baseline (Yang et al., 2015), which counts the number of question words that also appear in a sentence. We include the top 2, 3 and 5 candidates as baselines.
BERT: We implement two BERT-based baselines  for evidence identification. First, we train BERT on each query-policy sentence pair as a binary classification task to identify if the sentence is evidence for the question or not (BERT). We also experiment with a two-stage classifier, where we separately train the model on questions only to predict answerability. At inference time, if the answerable classifier predicts the question is answerable, the evidence identification classifier produces a set of candidate sentences (BERT + UNANSWERABLE).
Human Performance: We pick each reference answer provided by an annotator, and compute the F1 with respect to the remaining references, as described in section 4.2.1. Each reference answer is treated as the prediction, and the remain-ing n-1 answers are treated as the gold reference. The average of the maximum F1 across all reference answers is computed as the human baseline.

Results and Discussion
The results of the answerability baselines are presented in Table 5, and on answer sentence selection in Table 6. We observe that BERT exhibits the best performance on a binary answerability identification task. However, most baselines considerably exceed the performance of a majority-class baseline. This suggests considerable information in the question, indicating it's possible answerability within this domain. Table.6 describes the performance of our baselines on the answer sentence selection task.
The No-answer (NA) baseline performs at 28 F1, providing a lower bound on performance at this task. We observe that our best-performing baseline, BERT + UNANSWERABLE achieves an F1 of 39.8. This suggest that BERT is capable of making some progress towards answering questions in this difficult domain, while still leaving considerable headroom for improvement to reach human performance. BERT + UNANSWERABLE performance suggests that incorporating information about answerability can help in this difficult domain. We examine this challenging phenomena of unanswerability further in Section ??.

Error Analysis
Disagreements are analyzed based on the OPP-115 categories of each question (Table.7). We compare our best performing BERT variant against the NA model and human performance. We observe significant room for improvement across all categories of questions but especially for first party, third party and data retention categories.
We analyze the performance of our strongest BERT variant, to identify classes of errors and directions for future improvement (Table.8). We observe that a majority of answerability mistakes made by the BERT model are questions which are in fact answerable, but are identified as unanswerable by BERT. We observe that BERT makes 124 such mistakes on the test set. We collect expert judgments on relevance, subjectivity , silence and information about how likely the question is to be answered from the privacy pol-icy from our experts. We find that most of these mistakes are relevant questions. However many of them were identified as subjective by the annotators, and at least one annotator marked 19 of these questions as having no answer within the privacy policy. However, only 6 of these questions were unexpected or do not usually have an answer in privacy policies. These findings suggest that a more nuanced understanding of answerability might help improve model performance in his challenging domain.

What makes Questions Unanswerable?
We further ask legal experts to identify potential causes of unanswerability of questions. This analysis has considerable implications. While past work (Rajpurkar et al., 2018) has treated unanswerable questions as homogeneous, a question answering system might wish to have different treatments for different categories of 'unanswerable' questions. The following factors were identified to play a role in unanswerability: • Incomprehensibility: If a question is incomprehensible to the extent that its meaning is not intelligible.
• Relevance: Is this question in the scope of what could be answered by reading the privacy policy.
• Ill-formedness: Is this question ambiguous or vague. An ambiguous statement will typically contain expressions that can refer to multiple potential explanations, whereas a vague statement carries a concept with an unclear or soft definition.
• Silence: Other policies answer this type of question but this one does not.
• Atypicality: The question is of a nature such that it is unlikely for any policy policy to have an answer to the question.
Our experts attempt to identify the different 'unanswerable' factors for all 573 such questions in the corpus. 4.18% of the questions were identified as being incomprehensible (for example, 'any difficulties to occupy the privacy assistant').
Amongst the comprehendable questions, 50% were identified as likely to have an answer within the privacy policy, 33.1% were identified as being privacy-related questions but not within the scope of a privacy policy (e.g., 'has Viber had any privacy breaches in the past?') and 16.9% of questions were identified as completely out-of-scope (e.g., ''will the app consume much space?'). In the questions identified as relevant, 32% were illformed questions that were phrased by the user in a manner considered vague or ambiguous. Of the questions that were both relevant as well as 'well-formed', 95.7% of the questions were not answered by the policy in question but it was reasonable to expect that a privacy policy would contain an answer. The remaining 4.3% were described as reasonable questions, but of a nature generally not discussed in privacy policies. This suggests that the answerability of questions over privacy policies is a complex issue, and future systems should consider each of these factors when serving user's information seeking intent.
We examine a large-scale dataset of "natural" unanswerable questions (Kwiatkowski et al., 2019) based on real user search engine queries to identify if similar unanswerability factors exist. It is important to note that these questions have previously been filtered, according to a criteria for bad questions defined as "(questions that are) ambiguous, incomprehensible, dependent on clear false presuppositions, opinion-seeking, or not clearly a request for factual information." Annotators made the decision based on the content of the question without viewing the equivalent Wikipedia page. We randomly sample 100 questions from the development set which were identified as unanswerable, and find that 20% of the questions are not questions (e.g., "all I want for christmas is you mariah carey tour"). 12% of questions are unlikely to ever contain an answer on Wikipedia, corresponding closely to our atypicality category. 3% of questions are unlikely to have an answer anywhere (e.g., 'what guides Santa home after he has delivered presents?'). 7% of questions are incomplete or open-ended (e.g., 'the south west wind blows across nigeria between'). 3% of questions have an unresolvable coreference (e.g., 'how do i get to Warsaw Missouri from here'). 4% of questions are vague, and a further 7% have unknown sources of error. 2% still contain false presuppositions (e.g., 'what is the only fruit that does not have seeds?') and the remaining 42% do not have an answer within the document. This reinforces our belief that though they have been understudied in past work, any question answering system interacting with real users should expect to receive such unanticipated and unanswerable questions.

Conclusion
We present PRIVACYQA, the first significant corpus of privacy policy questions and more than 3500 expert annotations of relevant answers. The goal of this work is to promote questionanswering research in the specialized privacy domain, where it can have large real-world impact. Strong neural baselines on PRIVACYQA achieve a performance of only 39.8 F1 on this corpus, indicating considerable room for future research. Further, we shed light on several important considerations that affect the answerability of questions. We hope this contribution leads to multidisciplinary efforts to precisely understand user intent and reconcile it with information in policy documents, from both the privacy and NLP communities.