Notes on Teaching Ethics in NLP

The following notes were compiled from form submissions after the ACL Tutorial on Integrating Ethics in the NLP Curriculum conducted on July 5, 2020 as part of the virtual ACL 2020 workshop. In the first section, we provide some overarching conclusions; subsequent sections provide quoted and paraphrased text from our contributors with ideas of case studies, exercise formats, and reflections about setting up these exercises.

Contributors: Noëmi Aepli, Nader Akoury, Patrick Alba, Dan Bareket, Steven Bedrick, Emily M. Bender, Su Lin Blodgett, Jamie Brandon, Chris Brew, Trista Cao, Daniel Dahlmeier, Vidas Daudaravicius, Brent Davis, Chad DeChant, Marcia Derr, Guy Emerson, Hannah Eyre, Anjalie Field, Paige Finkelstein, Erick Fonseca, Vasundhara Gautam, Dimitra Gkatzia, Sharon Goldwater, Vivek Gupta, Samar Haider, Maartje ter Hoeve, Dirk Hovy, Aya Iwamoto, Amani Jamal, Micaela Kaplan, Suma Kasa, Katherine Keith, Haley Lepp, Angela Lin, Diane Litman, Yang Liu, Amnon Lotenberg, Emma Manning, Anna Marbut, Krystal Maughan, Sabrina Mielke, Sneha N, Silvia Necsulescu, Isar Nejadgholi, Vlad Niculae, Franziska Pannach, Ted Pedersen, Ben Peters, Gabriela Ramirez, Agata Savary, Tatjana Scheffler, Alexandra Schofield, Nora AlTwairesh, Nandan Thakur, Jannis Vamvas, Esaú Villatoro, Rich Wicentowski, Bock, Cissi, Manchego, Sonali, Tommaso, YeeMan

General Takeaways

These notes attempt to summarize themes and conclusions that arose in discussion repeatedly. While these themes and ideas arose frequently enough in discussion to merit recording, they do not necessarily represent the views of all of the participants above.

Participants reflecting on the communities of learners they were interested in all described them as having little or no common background in discussing research ethics, whether around human subjects research or broader research impact. There is strong agreement that relying on students to come in with the vocabulary to discuss these issues is unlikely to lead to a rich conversation, so participants highlighted the importance of providing advance access to resources that help shape a core vocabulary. This can include classic resources like the Nuremberg Code and Belmont report, as well as more recent readings in NLP (see Ethics in NLP for a thorough list of works and syllabi that may help).
Real-world case studies received much more attention in our discussion than toy problems, mirroring a general interest in talking about actual scenarios where language technology can cause (or has caused) harm. Using these problems as the basis for a short discussion presents a challenge, however, as in a computational setting it may be easy to abstract complex societal problems into classification tasks and metrics. It is therefore crucial to provide resources that reflect meaningfully on the social realities of these problems and that embody the principle of “nothing about us without us” (popularized by disability rights activists Michael Masutha, William Rowland, and later James Charlton) -- for example, case studies related to disability should include readings about disability and the disabled community the technology would support that represents the voices of those affected.
Since many courses or instructional settings related to NLP are embedded within a computer science or data science curriculum, both learners and instructors may be accustomed to rubrics with well-defined “correctness” of a solution. Many of these exercises below present a discussion, investigation, or reflection with inherent subjectivity and unanswerable questions. Learners and instructors will benefit from a clear rubric of what meaningful participation in such an assessment looks like in advance of engaging to ensure they feel confident in how to engage with the exercise.
Depending on background or personal experience with a topic, learners may vary in their comfort with expressing their views or experiences in a group setting and may desire a way to communicate that is more private than a group discussion, whether to ask questions on a social topic with which they are less familiar or to communicate about their own experiences related to discrimination without scrutiny from peers. One can help these students by providing clear information about how to access those channels and, if possible, by providing resources before they are solicited to help learn more or to get support.

Dual Use

Case Studies

A fellow student suggests a group project topic they want to explore: gendered language in the LGBTQ community. They are very engaged in the community themselves and have access to data. Their plan is to write a text classification tool that distinguishes LGBTQ from heterosexual language.
Depression classifier: advertisement targeting for people with depressions. Classification result could be used for various uses, from positive (help depressed people) to advertising, higher insurance premiums or discrimination in job applications. Classification mistakes (false positives) could have further negative effects.
Consider technologies that support therapeutic interventions, e.g. USC ICT's ELITE Lite program to train counselors. Find scenarios where tool may be more or less problematic.
Imagine a lightweight app meant to predict where someone is from using audio to determine accent. Think about different ways such an app could be used/what harm it might provoke.
Consider different perspectives on a dialogue system to assist truck drivers. E.g. the regulator: interests aligned with the state, e.g. automotive regulator who has to approve an interface / the company executive: profits / the consumer / the developer. Truckers complain about being overworked / possible aim of employer: rather than giving less work, contrive a technological solution that organises their attention, so that even if they're exhausted, it's safe for them to be driving on the road
Big language models (e.g. GPT-3) have pros and cons. They improve lots of NLP tasks, many of which we will just naively assume are good, and they generate a lot of fake data so you don’t have to touch real data. However, they can also ease the creation of large bodies of synthetic text that can be harmful in the form of fake news, bot armies, etc. Given the pros of any technology (like Big LMs above), pretend you’re an evil supervillain that wants to wreak havoc---how would you weaponize this technology?
Make-It and Break-It as a group debate: one group accurately flagging disinformation topics, the other group circumventing a disinformation classifier. Discuss what it means to do each task: are their negative aspects to disinformation flagging, e.g. censorship? What about the boundaries of our technology, e.g. sarcasm falsely flagged as misinformation? Cultural aspects, different cultural norms, e.g. homeopathy information or misinformation?

Assessment Options

Class discussion of a case study. Grade participation based on creativity of thinking of the dual uses, depth of discussion of positive and negative consequences. Reaching a level of 'analyze´ according the Bloom's taxonomy is usually a good outcome. If students are able to explain or distinguish why some technologies behaves as they do, it means they have understood the components of the systems and they can visualize the posible outcomes. Would need: articles on dual use
Standard group discussion with slides and report for a small toy classifier from "artificial" data (maybe some less fraught social characteristic) and see what can actually be discovered this way. Would need: Some standard dataset, some programming skill with major classification framework e.g. NLTK.
Short paper on the topic, possibly with guiding questions. Would need: access to an existing system or classifier. Transcript of exploration of system characteristics.
Have them write an obligatory section with an ethical discussion in reports for their own project. Ground it in the three kind of context from before: sociological, philosophical and technical perspective. Would need: large project as part of the course, literature review and prior discussion of case studies in dual use.
Ask students to post one new thought on a class online forum on a topic and respond to a fellow student's ideas. That way, participation is “mandated”, and there is an understanding of what participation a student should have. Additionally, the online format allows people a certain amount of distance from their peers, which may make it easier to express their own opinions or ideas. Would need: online forum, introduction module for online discussion/understanding of dual use, case study or article as basis for discussion
Explore data subjectivity. E.g. show the complexity of labeling depression in a text corpus, e.g. by exploring the data, self-labeling some data, doing inter-annotator agreement study with the class. Would need: dataset, preparation on data labeling, metrics, and agreement
Exercise on taking perspectives on dual use using an essay or short play. Consider effect on specific person or specific groups, take different perspectives (eg. you are a false positive case which gets excluded from insurance vs the insurance company). In groups: Assign roles for students to take in making a decision, or assign secondary work to take a stance and argue why the tool should/should not be used. Would need: background on specific tool and how it would be used, possibly time to prepare with others
Assign groups to focus on the benefits of a mental health app, assign groups to focus on disadvantages, and then encourage discussion or even debate between groups. This would free students from feeling like a certain view is socially acceptable. The debate could be the assessment, where other students could provide feedback on how clearly points were made and how well prepared they were. Setup: The debate assessment would assume a class size of 10-25. You could do this as a webinar on zoom given virtual instruction. Or student groups could pre-record their arguments and other students could watch those arguments and then discuss via small groups and break out discussion and have a group representative report back in a plenary session.
Tertiary assignment for students to respond to their peers position statements--take other side than what they took in second assignment. Goal is to see that students are capable of arguing for both sides. Adds on to an existing reflection or writing assignment.
Students discussing pros and cons of releasing a personal attribute predictor app to the public and the app in general. Discussion of
- Who can access the app.
- Consent from where the data is collected.
- How valid is the data, if the labeling is done by professionals?
- Does the development process include people from NLP and mental health professionals?
- If the data is obtained from social media data
- Datasheets (See Gebru et al, 2018) and transparency of data collection and model development process
- If the data can really be anonymized
- How is the test data obtained? Is it from the same population or different population?
- How to evaluate the effectiveness of the model in the outside world
- Humans tend to over diagnose certain populations - black and indigenous students are diagnosed at much higher rates. Would an app replicate this kind of bias?
- How would such an app be used? Could a potential employer mine your tweets and thereby decide not to hire you? If it were more personal could privacy be guaranteed.
- Should the apps be mandated reporters in the case where depression is detected?
Have teams do error analysis of either existing apps or other teams’ apps to audit for dual use. Perform error analysis, collect data for why things should/shouldn’t be used. Can conduct as peer double-blind review of papers (Practice presenting to unknown audience/reviewer)
Students compose a set of guidelines which could be given to a team of software engineers working on the development of a depression “detection” app or other possible project susceptible to dual use focused on risk vs reward, functionality, framing, and other deployment obligations
In classes like 30 ppl, standard procedure: form little groups (3 ppl), make a plan, have fun, then present to class. This would really be a short exercise, 10 mins for discussion and writing, 1 min for each presentation. Could go through multiple technologies to “fill an hour” and hopefully find tasks of varying “difficulty” and diversity in answers! The same thing would work in industry (considerations: big LMs relevant to speech recognition, ~30 people, interactive, everything needs to fit into an hourlong session). If there are more than a few groups, they should not all work on the same example: cover a few! After presentations, everyone student writes up a short (300 words) defense of building or not building this technology. Bonus points when you write about mitigation of harm. For industry, essays won’t really work so instead do this more informally. Ask people with a show of hands whether yes, they would use the technology or no, they wouldn’t. If they say yes, ask them to informally say why. Assess to a lesser degree based on what cons were found and the quality of the presentation, and to a larger extent based on the final write-up, graded as you would any other essay (i.e., did it engage with the cons brought up in class and does the conclusion follow from the arguments presented).
Industry setting. A team of 4-5 people. Format: go over the ethical implications of the product at a pre-existing company-wide ethics discussion series. The group comes up with arguments for both sides (positive and negative impacts of the product). At least 2-3 arguments on each side. Send the exercise out beforehand to give people time to familiarize themselves with the question. Ask each person to think about the topic by themselves, then work together. At the end, present a list of aspects of the question, and ask them whether their view on any of them has changed and how; Ask them what they think it means for their product, whether they can think of ways to change the product, whether they should make the product at all. Would need: both general terminology and specific domain knowledge for the use case, including community perspectives.

Reflections

What is the point of a "gaydar" classifier? Is the "good" use the betterment of linguistic science? It’s easy to see examples of "bad" uses, harder to see what the good is.
This is a hard topic, and may be better fit for the end of a course to help grapple with it.
Topic requires background reading.
How can we have students look at both sides of an issue, whether or not they actually agree with both sides of an issue?
Classroom facilitation is important. While classes in person lend themselves well to discussion formats, online classes are not as good for engaging conversation. Additionally, it can be hard to have students open up and feel comfortable and safe. If students come from many backgrounds, they may also have different ideas of what is appropriate or how it may be appropriate to participate in a group setting or in the presence of a teacher or professor. Another option is a post-discussion questionnaire for all participants (allows to get "minimal" information from students / course participants who do not engage in discussion/presentation)
Not everyone has the same exposure to thinking about "the sciences" as a place for critical thinking. While the US system puts a lot of emphasis on learning to think critically in college, not every culture does, and many people think of the sciences as being a place where if an expert writes a peer reviewed paper that gets published somewhere, you should trust that. Additionally, you can't ask students to believe in things they don't believe in, so how can you "make students care" in ways that might be contrary to their personal beliefs?
We thought that some sort of “introduction to ethics/thinking critically” lesson for the whole class would be a good way to introduce it to everyone at once, and to make sure that everyone is on the same page and has access to the same language, regardless of the background. Additionally, we thought about the possibility of asking students not just to post their ideas, but to think about/write about/ submit a "pro con" list or one potentially positive outcome and one potentially negative outcome. That way, we take personal opinions out of the equations, which may help encourage looking at/learning to look at both sides of an issue, even if it might not be something you would natively do for a given application.
Allowing students to engage with each other, either in person or virtually, is key to helping them learn to engage with the ideas of ethics.
It's really hard to grade someone on a subjective comment -- there's not a right or a wrong answer in ethics necessarily. An assessment that focuses on participation and demonstrating that they're thinking about it (e.g. post x number of times, make a pro/con list with x number of things) helps to take your personal ideas as the teacher out of the discussion. Reinforcing that you're not grading subjectively is crucial here.
Fundamentally, we need to connect this back to a value system. As computer scientists, we're not trained in this at all.
Set ground rules and expectations - say that we don't have all the answers. As a student, I would appreciate people admitting that they don't know what they don't know. (But could be frustrating!)
CS students are more optimistic about the performance of the models, however sometimes they do not follow any ethical code as they are not trained to do so.
For group discussions/mock reviews: graduate students are more involved in research institutions and reviewing processes, while undergrads may not be. It’s important for students to reflect on their personal stances and what role they might have in a decision like this, and that role is different for undergraduates and graduates (Keeping in mind that graduate students and undergraduate students may not know where they want to go or what their final role will be). In a classroom with both undergrad and grad students, it might be beneficial to have blended groups. This type of assignment might be applicable to corporate workshops also.

Privacy and Anonymity

Case Studies

Considering a trained Naive Bayes Classifier trained on a subset of 20 Newsgroups. How would you test and build for anonymization? Even if de-anonymizing there may be proxies, dual issues with anonymization, different vocabularies across regions and communities that can be an issue in anonymizing data.
Design a small search engine around an inverted index that uses random integer noise from a two-sided geometric distribution (Ghosh et al., 2012) to shape which queries are retrieved. Analyze how much this changes the search results with different noise levels. Are there systematic changes?
Social media data is used for a lot of studies. However, many users of social media do not understand this possible set of use cases when they participate online. How would one establish an appropriate protocol for informed consent in these scenarios?
Differentially private polling for a binary question (coin flip noise stuff). Implement the model from the description and see how good estimates you can get out, given a file of “gold” answers. Visualize and discuss the tradeoff. Then, a new dataset comes in where we have two groups with very different priors on the answer---if they are uneven in size, how does that affect the method?
Build an application for identifying pokemon. then challenge students to think about how this technology with different data could be used for applications that would reveal private data.
Students propose a way to anonymize data from twitter so it can be used in NLP without being traceable to specific Twitter users. Need to think about different levels/definitions of anonymity (e.g. tracing a tweet to a specific user out of a given set, or identifying the country a user’s from).

Assessment Options

Give a thought experiment or small exercises that doesn’t require a computer, work together for 10 minutes to run through the exercise (e.g. to show how re-identification can happen even in anonymised data)/thought experiments with each other, then discuss in small/large groups. Would require: toy dataset for short exercise.
Given a dataset, have students write a report where they propose which features are deanonymizing, how their presence or absence would affect the model, and who would be affected by the ability to look inside the model and work out which documents were in the training set. Could be done individually or in groups. Would need: statistics and programming background, dataset
In groups, make a game of looking for features that have an impact on identifying users: maybe give students a specific scenario, or maybe each group has to decide and justify what they think is an appropriate degree of privacy. Teams first have to come up with a system to privatize their data while still being able to perform certain tasks accurately. Other groups try to identify users in each groups “anonymized” dataset to adversarially test it.
Implement a particular differential privacy mechanism on a dataset and assess the accuracy of models trained with the data. Likely would be done in small groups over 1-2 weeks. Assessment will be based both on implementing the correct (i.e. to specification) model in the first part and finding and articulating problems that arise when imbalance and noise meet in the second part of the exercise. Would need: statistics and programming background, dataset
Short self-assessments about understanding of privacy: comparisons in answers before and after another exercise, presentation, or reading. Ask the students: How would you feel if your data was used or redistributed? What about someone you care about?

Reflections

Two core skills to assess: Understanding of the risks of not protecting privacy, and technical understanding of techniques to improve privacy, as well as adversarial techniques. The latter is more implementation of code, while the former may need more discussion/presentation of external resources.
Implementing anonymization or privacy on a full real-world dataset takes a lot of time.
This topic connects in somewhat with bias. Talking about factors that reveal race and gender, geography can be a good predictor of race (in the U.S.) - some things don’t occur to students, so there’s some preparation for the students in terms of social conditions that they have not been made aware of prior to the exposure.
Students see a separation between the technical skills and the ethics. See ethics as extraneous.
Many students (younger generation) say “but if you put it on the internet you should’ve known that it will be used/shared”
When do we need anonymization? / pseudonymization? We cannot predict the development of technology, so it's hard to perfect how to anonymize data.
It’s hard to get consent for future uses; people don’t understand what their data could be used for in the future
One interesting focus of conversation was around whether it is a moral obligation to de-anonymize data *before* using it for training at all, considering it as completely off the table, even in cases where there is a suspicion that it might decrease model performance in some ways, and how to try to engage students in thinking about that possible "trade-off".
There are cultural and national differences in approaches to the ethics of privacy -- how could you come at it from a different perspective to justify including it in the curriculum?
In industry, how are privacy concerns actually handled? In a workplace setting, who are the domain experts here? E.g. there may be offices or high-level resources related to an ethical conversation, low-level policies in place for applying/ensuring ethics. Management not wanting to make more steps to add those procedures because it might make things take more time/be more difficult. This is difficult when working on such a small component of the system where it's not as clear what the ethical implications could be.
The importance of engaging professors in other domains-- law, health, other domains where the application could be used, as well as people with expertise in ethics in these domains and stakeholders who will be impacted.

Bias

To be added.

Notes on Teaching Ethics in NLP

Contents

General Takeaways

Dual Use

Case Studies

Assessment Options

Reflections

Privacy and Anonymity

Case Studies

Assessment Options

Reflections

Bias

Navigation menu

Search