Norman Sadeh


2022

pdf bib
A Tale of Two Regulatory Regimes: Creation and Analysis of a Bilingual Privacy Policy Corpus
Siddhant Arora | Henry Hosseini | Christine Utz | Vinayshekhar Bannihatti Kumar | Tristan Dhellemmes | Abhilasha Ravichander | Peter Story | Jasmine Mangat | Rex Chen | Martin Degeling | Thomas Norton | Thomas Hupperich | Shomir Wilson | Norman Sadeh
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Over the past decade, researchers have started to explore the use of NLP to develop tools aimed at helping the public, vendors, and regulators analyze disclosures made in privacy policies. With the introduction of new privacy regulations, the language of privacy policies is also evolving, and disclosures made by the same organization are not always the same in different languages, especially when used to communicate with users who fall under different jurisdictions. This work explores the use of language technologies to capture and analyze these differences at scale. We introduce an annotation scheme designed to capture the nuances of two new landmark privacy regulations, namely the EU’s GDPR and California’s CCPA/CPRA. We then introduce the first bilingual corpus of mobile app privacy policies consisting of 64 privacy policies in English (292K words) and 91 privacy policies in German (478K words), respectively with manual annotations for 8K and 19K fine-grained data practices. The annotations are used to develop computational methods that can automatically extract “disclosures” from privacy policies. Analysis of a subset of 59 “semi-parallel” policies reveals differences that can be attributed to different regulatory regimes, suggesting that systematic analysis of policies using automated language technologies is indeed a worthwhile endeavor.

2021

pdf bib
Breaking Down Walls of Text: How Can NLP Benefit Consumer Privacy?
Abhilasha Ravichander | Alan W Black | Thomas Norton | Shomir Wilson | Norman Sadeh
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Privacy plays a crucial role in preserving democratic ideals and personal autonomy. The dominant legal approach to privacy in many jurisdictions is the “Notice and Choice” paradigm, where privacy policies are the primary instrument used to convey information to users. However, privacy policies are long and complex documents that are difficult for users to read and comprehend. We discuss how language technologies can play an important role in addressing this information gap, reporting on initial progress towards helping three specific categories of stakeholders take advantage of digital privacy policies: consumers, enterprises, and regulators. Our goal is to provide a roadmap for the development and use of language technologies to empower users to reclaim control over their privacy, limit privacy harms, and rally research efforts from the community towards addressing an issue with large social impact. We highlight many remaining opportunities to develop language technologies that are more precise or nuanced in the way in which they use the text of privacy policies.

2019

pdf bib
Question Answering for Privacy Policies: Combining Computational and Legal Perspectives
Abhilasha Ravichander | Alan W Black | Shomir Wilson | Thomas Norton | Norman Sadeh
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Privacy policies are long and complex documents that are difficult for users to read and understand. Yet, they have legal effects on how user data can be collected, managed and used. Ideally, we would like to empower users to inform themselves about the issues that matter to them, and enable them to selectively explore these issues. We present PrivacyQA, a corpus consisting of 1750 questions about the privacy policies of mobile applications, and over 3500 expert annotations of relevant answers. We observe that a strong neural baseline underperforms human performance by almost 0.3 F1 on PrivacyQA, suggesting considerable room for improvement for future systems. Further, we use this dataset to categorically identify challenges to question answerability, with domain-general implications for any question answering system. The PrivacyQA corpus offers a challenging corpus for question answering, with genuine real world utility.

2018

pdf bib
Stress Test Evaluation for Natural Language Inference
Aakanksha Naik | Abhilasha Ravichander | Norman Sadeh | Carolyn Rose | Graham Neubig
Proceedings of the 27th International Conference on Computational Linguistics

Natural language inference (NLI) is the task of determining if a natural language hypothesis can be inferred from a given premise in a justifiable manner. NLI was proposed as a benchmark task for natural language understanding. Existing models perform well at standard datasets for NLI, achieving impressive results across different genres of text. However, the extent to which these models understand the semantic content of sentences is unclear. In this work, we propose an evaluation methodology consisting of automatically constructed “stress tests” that allow us to examine whether systems have the ability to make real inferential decisions. Our evaluation of six sentence-encoder models on these stress tests reveals strengths and weaknesses of these models with respect to challenging linguistic phenomena, and suggests important directions for future work in this area.

pdf bib
Supervised and Unsupervised Methods for Robust Separation of Section Titles and Prose Text in Web Documents
Abhijith Athreya Mysore Gopinath | Shomir Wilson | Norman Sadeh
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

The text in many web documents is organized into a hierarchy of section titles and corresponding prose content, a structure which provides potentially exploitable information on discourse structure and topicality. However, this organization is generally discarded during text collection, and collecting it is not straightforward: the same visual organization can be implemented in a myriad of different ways in the underlying HTML. To remedy this, we present a flexible system for automatically extracting the hierarchical section titles and prose organization of web documents irrespective of differences in HTML representation. This system uses features from syntax, semantics, discourse and markup to build two models which classify HTML text into section titles and prose text. When tested on three different domains of web text, our domain-independent system achieves an overall precision of 0.82 and a recall of 0.98. The domain-dependent variation produces very high precision (0.99) at the expense of recall (0.75). These results exhibit a robust level of accuracy suitable for enhancing question answering, information extraction, and summarization.

2017

pdf bib
Identifying the Provision of Choices in Privacy Policy Text
Kanthashree Mysore Sathyendra | Shomir Wilson | Florian Schaub | Sebastian Zimmeck | Norman Sadeh
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Websites’ and mobile apps’ privacy policies, written in natural language, tend to be long and difficult to understand. Information privacy revolves around the fundamental principle of Notice and choice, namely the idea that users should be able to make informed decisions about what information about them can be collected and how it can be used. Internet users want control over their privacy, but their choices are often hidden in long and convoluted privacy policy texts. Moreover, little (if any) prior work has been done to detect the provision of choices in text. We address this challenge of enabling user choice by automatically identifying and extracting pertinent choice language in privacy policies. In particular, we present a two-stage architecture of classification models to identify opt-out choices in privacy policy text, labelling common varieties of choices with a mean F1 score of 0.735. Our techniques enable the creation of systems to help Internet users to learn about their choices, thereby effectuating notice and choice and improving Internet privacy.

2016

pdf bib
The Creation and Analysis of a Website Privacy Policy Corpus
Shomir Wilson | Florian Schaub | Aswarth Abhilash Dara | Frederick Liu | Sushain Cherivirala | Pedro Giovanni Leon | Mads Schaarup Andersen | Sebastian Zimmeck | Kanthashree Mysore Sathyendra | N. Cameron Russell | Thomas B. Norton | Eduard Hovy | Joel Reidenberg | Norman Sadeh
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf bib
Toward Abstractive Summarization Using Semantic Representations
Fei Liu | Jeffrey Flanigan | Sam Thomson | Norman Sadeh | Noah A. Smith
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf bib
Unsupervised Alignment of Privacy Policies using Hidden Markov Models
Rohan Ramanath | Fei Liu | Norman Sadeh | Noah A. Smith
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
A Step Towards Usable Privacy Policy: Automatic Alignment of Privacy Statements
Fei Liu | Rohan Ramanath | Norman Sadeh | Noah A. Smith
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers