Difference between revisions of "Notes on Teaching Ethics in NLP"
m |
|||
(7 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | + | The following notes were compiled from discussions during the [https://www.aclweb.org/anthology/2020.acl-tutorials.2/ ACL Tutorial on Integrating Ethics in the NLP Curriculum] ([https://www.cs.hmc.edu/~xanda/files/acl2020tutorial_teachingethicsinnlp.pdf slides]) conducted on July 5, 2020 as part of the virtual ACL 2020 workshop. While the notes below do not necessarily give fully designed assignments, they can provide a starting point for ideas of ethical domains that could be useful for assignments, structures that could work for different classrooms to approach these subjects, and reflections from conversation that arose during the tutorial about what might be helpful or challenging in approaching ethical subjects in NLP. It is the hope of the tutorial organizers to further flesh out assignments using these notes at a later date. | |
− | + | In the first section, we provide some overarching conclusions; subsequent sections provide quoted and paraphrased text from our contributors with ideas of case studies, exercise formats, and reflections about setting up these exercises. | |
'''Contributors:''' Noëmi Aepli, Nader Akoury, Patrick Alba, Dan Bareket, Steven Bedrick, Emily M. Bender, Su Lin Blodgett, Jamie Brandon, Chris Brew, Trista Cao, Daniel Dahlmeier, Vidas Daudaravicius, Brent Davis, Chad DeChant, Marcia Derr, Guy Emerson, Hannah Eyre, Anjalie Field, Paige Finkelstein, Erick Fonseca, Vasundhara Gautam, Dimitra Gkatzia, Sharon Goldwater, Vivek Gupta, Samar Haider, Maartje ter Hoeve, Dirk Hovy, Aya Iwamoto, Amani Jamal, Micaela Kaplan, Suma Kasa, Katherine Keith, Haley Lepp, Angela Lin, Diane Litman, Yang Liu, Amnon Lotenberg, Emma Manning, Anna Marbut, Krystal Maughan, Sabrina Mielke, Sneha N, Silvia Necsulescu, Isar Nejadgholi, Vlad Niculae, Franziska Pannach, Ted Pedersen, Ben Peters, Gabriela Ramirez, Agata Savary, Tatjana Scheffler, Alexandra Schofield, Nora AlTwairesh, Nandan Thakur, Jannis Vamvas, Esaú Villatoro, Rich Wicentowski, Bock, Cissi, Manchego, Sonali, Tommaso, YeeMan | '''Contributors:''' Noëmi Aepli, Nader Akoury, Patrick Alba, Dan Bareket, Steven Bedrick, Emily M. Bender, Su Lin Blodgett, Jamie Brandon, Chris Brew, Trista Cao, Daniel Dahlmeier, Vidas Daudaravicius, Brent Davis, Chad DeChant, Marcia Derr, Guy Emerson, Hannah Eyre, Anjalie Field, Paige Finkelstein, Erick Fonseca, Vasundhara Gautam, Dimitra Gkatzia, Sharon Goldwater, Vivek Gupta, Samar Haider, Maartje ter Hoeve, Dirk Hovy, Aya Iwamoto, Amani Jamal, Micaela Kaplan, Suma Kasa, Katherine Keith, Haley Lepp, Angela Lin, Diane Litman, Yang Liu, Amnon Lotenberg, Emma Manning, Anna Marbut, Krystal Maughan, Sabrina Mielke, Sneha N, Silvia Necsulescu, Isar Nejadgholi, Vlad Niculae, Franziska Pannach, Ted Pedersen, Ben Peters, Gabriela Ramirez, Agata Savary, Tatjana Scheffler, Alexandra Schofield, Nora AlTwairesh, Nandan Thakur, Jannis Vamvas, Esaú Villatoro, Rich Wicentowski, Bock, Cissi, Manchego, Sonali, Tommaso, YeeMan | ||
Line 10: | Line 10: | ||
These notes attempt to summarize themes and conclusions that arose in discussion repeatedly. While these themes and ideas arose frequently enough in discussion to merit recording, they do not necessarily represent the views of all of the participants above. | These notes attempt to summarize themes and conclusions that arose in discussion repeatedly. While these themes and ideas arose frequently enough in discussion to merit recording, they do not necessarily represent the views of all of the participants above. | ||
− | * Participants reflecting on the communities of learners they were interested in all described them as having little or no common background in discussing research ethics, whether around human subjects research or broader research impact. There is strong agreement that relying on students to come in with the vocabulary to discuss these issues is unlikely to lead to a rich conversation, so participants highlighted the importance of providing advance access to resources that help shape a core vocabulary. This can include classic resources like the Nuremberg Code and Belmont report, as well as more recent readings in NLP (see [[Ethics in NLP]] for a thorough list of works and syllabi that may help). | + | * '''Give students the time and resources to talk about ethical topics.''' Participants reflecting on the communities of learners they were interested in all described them as having little or no common background in discussing research ethics, whether around human subjects research or broader research impact. There is strong agreement that relying on students to come in with the vocabulary to discuss these issues is unlikely to lead to a rich conversation, so participants highlighted the importance of providing advance access to resources that help shape a core vocabulary. This can include classic resources like the Nuremberg Code and Belmont report, as well as more recent readings in NLP (see [[Ethics in NLP]] for a thorough list of works and syllabi that may help). |
− | * Real-world case studies received much more attention in our discussion than toy problems, mirroring a general interest in talking about actual scenarios where language technology can cause (or has caused) harm. Using these problems as the basis for a short discussion presents a challenge, however, as in a computational setting it may be easy to abstract complex societal problems into classification tasks and metrics. It is therefore crucial to provide resources that reflect meaningfully on the social realities of these problems and that embody the principle of “nothing about us without us” (popularized by disability rights activists Michael Masutha, William Rowland, and later James Charlton) -- for example, case studies related to disability should include readings about disability and the disabled community the technology would support that represents the voices of those affected. | + | * '''Focus on real problems.''' Real-world case studies received much more attention in our discussion than toy problems, mirroring a general interest in talking about actual scenarios where language technology can cause (or has caused) harm. Using these problems as the basis for a short discussion presents a challenge, however, as in a computational setting it may be easy to abstract complex societal problems into classification tasks and metrics. It is therefore crucial to provide resources that reflect meaningfully on the social realities of these problems and that embody the principle of “nothing about us without us” (popularized by disability rights activists Michael Masutha, William Rowland, and later James Charlton) -- for example, case studies related to disability should include readings about disability and the disabled community the technology would support that represents the voices of those affected. |
− | * Since many courses or instructional settings related to NLP are embedded within a computer science or data science curriculum, both learners and instructors may be accustomed to rubrics with well-defined “correctness” of a solution. Many of these exercises below present a discussion, investigation, or reflection with inherent subjectivity and unanswerable questions. Learners and instructors will benefit from a clear rubric of what meaningful participation in such an assessment looks like in advance of engaging to ensure they feel confident in how to engage with the exercise. | + | * '''Give clear assessment criteria.''' Since many courses or instructional settings related to NLP are embedded within a computer science or data science curriculum, both learners and instructors may be accustomed to rubrics with well-defined “correctness” of a solution. Many of these exercises below present a discussion, investigation, or reflection with inherent subjectivity and unanswerable questions. Learners and instructors will benefit from a clear rubric of what meaningful participation in such an assessment looks like in advance of engaging to ensure they feel confident in how to engage with the exercise. |
− | * Depending on background or personal experience with a topic, learners may vary in their comfort with expressing their views or experiences in a group setting and may desire a way to communicate that is more private than a group discussion, whether to ask questions on a social topic with which they are less familiar or to communicate about their own experiences related to discrimination without scrutiny from peers. One can help these students by providing clear information about how to access those channels and, if possible, by providing resources before they are solicited to help learn more or to get support. | + | * '''Provide multiple paths to participation.''' Depending on background or personal experience with a topic, learners may vary in their comfort with expressing their views or experiences in a group setting and may desire a way to communicate that is more private than a group discussion, whether to ask questions on a social topic with which they are less familiar or to communicate about their own experiences related to discrimination without scrutiny from peers. One can help these students by providing clear information about how to access those channels and, if possible, by providing resources before they are solicited to help learn more or to get support. |
+ | * '''Emphasize that this is a core skill.''' Due to compressed syllabi, ethics is often relegated to one day or one assignment of a course, and changing from that paradigm takes time. However, for students to generalize from these lessons, it is important to emphasize that skills of critical inquiry into possible risks and harms of technology are not a side topic, but part of the work of doing NLP. This can be done through reference back to these exercises in other assignments or in class discussions, as well as reflections on your own work. | ||
== Dual Use == | == Dual Use == | ||
Line 29: | Line 30: | ||
=== Assessment Options === | === Assessment Options === | ||
− | * Class discussion of a case study. Grade participation based on creativity of thinking of the dual uses, depth of discussion of positive and negative consequences. Reaching a level of ' | + | * Class discussion of a case study. Grade participation based on creativity of thinking of the dual uses, depth of discussion of positive and negative consequences. Reaching a level of 'analyze' according the Bloom's taxonomy is usually a good outcome. If students are able to explain or distinguish why some technologies behave as they do, it means they have understood the components of the systems and they can visualize the possible outcomes. Would need: articles on dual use |
* Standard group discussion with slides and report for a small toy classifier from "artificial" data (maybe some less fraught social characteristic) and see what can actually be discovered this way. Would need: Some standard dataset, some programming skill with major classification framework e.g. NLTK. | * Standard group discussion with slides and report for a small toy classifier from "artificial" data (maybe some less fraught social characteristic) and see what can actually be discovered this way. Would need: Some standard dataset, some programming skill with major classification framework e.g. NLTK. | ||
* Short paper on the topic, possibly with guiding questions. Would need: access to an existing system or classifier. Transcript of exploration of system characteristics. | * Short paper on the topic, possibly with guiding questions. Would need: access to an existing system or classifier. Transcript of exploration of system characteristics. | ||
− | * Have them write an obligatory section with an ethical discussion in reports for their own project. Ground it in the three | + | * Have them write an obligatory section with an ethical discussion in reports for their own project. Ground it in the three kinds of context from before: sociological, philosophical and technical perspective. Would need: large project as part of the course, literature review and prior discussion of case studies in dual use. |
* Ask students to post one new thought on a class online forum on a topic and respond to a fellow student's ideas. That way, participation is “mandated”, and there is an understanding of what participation a student should have. Additionally, the online format allows people a certain amount of distance from their peers, which may make it easier to express their own opinions or ideas. Would need: online forum, introduction module for online discussion/understanding of dual use, case study or article as basis for discussion | * Ask students to post one new thought on a class online forum on a topic and respond to a fellow student's ideas. That way, participation is “mandated”, and there is an understanding of what participation a student should have. Additionally, the online format allows people a certain amount of distance from their peers, which may make it easier to express their own opinions or ideas. Would need: online forum, introduction module for online discussion/understanding of dual use, case study or article as basis for discussion | ||
* Explore data subjectivity. E.g. show the complexity of labeling depression in a text corpus, e.g. by exploring the data, self-labeling some data, doing inter-annotator agreement study with the class. Would need: dataset, preparation on data labeling, metrics, and agreement | * Explore data subjectivity. E.g. show the complexity of labeling depression in a text corpus, e.g. by exploring the data, self-labeling some data, doing inter-annotator agreement study with the class. Would need: dataset, preparation on data labeling, metrics, and agreement | ||
Line 104: | Line 105: | ||
== Bias == | == Bias == | ||
− | + | ||
+ | === Case Studies === | ||
+ | * Have students consider an existing system and try to determine ways to measure bias in order to check it, following e.g. the template of [http://blog.conceptnet.io/posts/2017/how-to-make-a-racist-ai-without-really-trying/ How to Make a Racist AI Without Really Trying] by Robyn Speer. | ||
+ | * Consider an automatic speech recognition (ASR) system used to automatically caption conference presentations. What are the impacts of bias in such a system? What kinds of bias may exist? | ||
+ | * Suppose ASR is being used to assess call center performance. If ASR systems are being trained on mostly white voices and are actively used to assess performance of call center agents, how might the results of this bias affect performance metrics, and how might this effect be mitigated? | ||
+ | * Recommender systems are often customized to detect or respond to user preferences by comparing with similar uses. How might bias manifest in these cases? | ||
+ | * Imagine if an MT or summarization system was used to communicate in a scenario with some kind of importance, e.g. something big and scary (refugee/disaster scenario) or just mundane but important (scheduling an appointment, opening a bank account, etc.). What kinds of errors might arise from common sources of data bias? | ||
+ | * Grammatical error correction often represents only a minimal number of dialects. How might these systems be made to better handle linguistic variation? | ||
+ | * Consider a sentiment analysis tool whose text input handles names. How might such a system respond to proper names of individuals? Could systems learn bias in this context? How would one mitigate that bias? | ||
+ | |||
+ | === Assessment Options === | ||
+ | * Consider a training data set for a task (e.g., ASR). Quantify in a short report what sorts of biases these data may have and how this might affect a downstream task. Will need: existing dataset for task, (optionally) an existing model for the task to see how it compares. | ||
+ | * Lead a discussion with seminar students. Prompt them with a case study, and ask them to discuss (in small, then large groups) possible sources of bias, how they might manifest, and how they might be mitigated. Will need: prior background on the domain, prompts and examples to brainstorm from. Include discussion and reflection questions, such as: | ||
+ | ** What should go in documentation so users are aware of bias? | ||
+ | ** How to improve by collecting data? | ||
+ | ** Speculate on important sources of bias based on knowledge about data, insight from observed errors. | ||
+ | ** What sources can be addressed? What cannot? | ||
+ | ** Compare different ASR systems (trained on different dialects, tested on a different dialect) | ||
+ | ** (post-discussion) Do you recognize issues with systems you interact with, which you didn’t recognize before? | ||
+ | * Consider a larger NLP “pipeline” to address a more complex task. Have students first break this process into “bins”, then split into groups that each focus on one “bin” of the pipeline. Groups can switch between parts after leaving feedback that following groups can see. Would need: background on domain of task and bias, whiteboards/collaborative writing technology, guidance on how to critique as constructive feedback | ||
+ | * Ask participants to provide their own examples of problematic NLP. Then, in groups, have team members to later put together presentations or write a paper about problematic bias seen in past projects (or, in industry, in a work context). For less technical team members, ensure that they are able to ask the questions necessary to identify bias in data received. Provide a dataset and have each team member come up with relevant questions. | ||
+ | * Lab exercise for small groups: after implementing a simple recommender system and training it on instructor-provided data, each group is assigned an “identity” (one user of the recommender system, with a particular set of attributes and preferences) and asked to interact with the system as that user, then prepare a short write-up with their findings. Final report includes input from all students and reflects different backgrounds. Final report identifies which groups experience disparate outcomes from the system and how that bias is derived from the data, as well as how this would manifest in the broader context in which these systems are used. This report should include an opportunity for input from all students that reflects different backgrounds. Will need: access to a trained recommender system, background on the model. | ||
+ | * Have students role play the possible outcomes of a malfunctioning system and reflect on how they would intervene in the way the system runs. After the exercises, students do brief presentations in their groups about the outcomes, focusing on different assigned elements of the exercise. Will need: a planned scenario or script we have the students work through actually exhibits the problems we are "hoping" that it does around bias, an MT/related system relevant for the scenario, plus whatever background materials to support the dialogue we want them to try and do. Students might be able to bring their own system- perhaps one they had built in an earlier phase of the class; that could be part of the exercise- looking at their own system’s training and behavior. | ||
+ | * As a lab exercise using e.g. a notebook (as in Robyn Speer’s [http://blog.conceptnet.io/posts/2017/how-to-make-a-racist-ai-without-really-trying/ blog post]), using existing references and models, find metrics that can capture different forms of bias, starting with basic examples. Discuss what the pros and cons of different existing metrics are and why there is no one “bias” metric. Ensure students also take the opportunity to look at misclassifications to understand what these metrics do or don’t capture. Will need: some background/reading on bias metrics, prepared code/notebook with data, prior unit on word embeddings, prerequisite understanding of statistics and rigorous evaluation | ||
+ | * In an industry setting, have a team working on a project with human-centered consequences (e.g. an automatic recognition, scoring, or classification tool for people) read some general background on bias in NLP systems and the ACL/ACM Code of Ethics. Then, have an formal presentation on these topics and an open forum discussion among colleagues to brainstorm and discuss how these relate to the products being worked on. Isolate concrete steps to be taken to modify existing policies to handle data and systems. Meeting notes consolidated from discussions which establish all colleagues are on the same page are also crucial. Assessment will also include examination of additional changes in documentation about testing on different minority populations’ voices or text inputs and gauging associated performances in comparison with majority speakers. | ||
+ | * Instructions can provide an off-the-shelf model with test data sets of different language varieties. Students will experiment with the model and its performance over the different test data sets to appreciate the variation of outcomes. Will need: dataset with linguistic variation (if appropriate, from dialects spoken by members of the class). After the practical exercises, students write a report about their findings that addresses these questions: | ||
+ | ** Describe the potential impact of linguistic variation on the functioning of NLP/speech technology | ||
+ | ** Describe how to choose what kinds of language variation should be tested for | ||
+ | ** Reason about how differential performance for different social groups can lead to adverse impacts | ||
+ | ** Articulate what kind of documentation should accompany NLP/speech technology to facilitate safe deployment | ||
+ | * Have teams start by building some “debiased” model, e.g. for sentiment analysis or classification. Give examples of different stakeholders and different kinds of harm, in maybe a tabular form. Then, have those teams trade models and attempt to determine biases still in these systems. Would need: some shared background on that model and domain, starting models to adapt. | ||
+ | * Have students write a suite of “test cases” for existing models designed to perform a benchmark task to see if these models display certain kinds of biases. Students could design pairs of words/sentences to test out bias in word embeddings / sentiment analysis, e.g., “Emily” and “Shaniqua” as names. Then, have them write an essay about what they learned from designing the exercise and testing it on some public models. Would need: existing background on model, coding for NLP, biases. | ||
+ | |||
+ | |||
+ | === Reflections === | ||
+ | * There is some concern about how to come up with an “exam” question for this topic, but an idea offered was an open-ended question that describes a dataset and algorithm, and have the students describe the potential biases, and how they would measure them, and mitigation strategies (where a thoughtful answer that referenced ideas from the course reading would receive full credit) | ||
+ | * Build specific activities around discussion to avoid people feeling defensive/targeted about products that they’ve built. | ||
+ | * Express an understanding that everything is going to be biased, no matter how hard we work to avoid it. | ||
+ | * For in-class discussion or role play exercises,, the exercise would be something that would need to work in an online/distance setting (COVID, etc.), but would also need to ensure lots of interactivity. The exercise would be designed around making that happen. | ||
+ | * Most examples discussing model bias come from the US/UK - how can they be interesting for all students from various countries and ethnicities? | ||
+ | * When engaging students about bias in data and models, how do students relate to the underlying data, do they feel victims or perpetrators? | ||
+ | * Formats like notebooks with pre-written code can be welcoming to students with non-CS background to still engage with the same questions. When students have mixed backgrounds, build on different students’ strengths (technical vs linguistic/psychology backgrounds). | ||
+ | * Annotators are different from programmers, who are also sometimes different from the people who check the annotations - how do we deal with bias entering from different stages? Can we train annotators using these sorts of exercises? | ||
+ | * Is the important thing to “solve” debiasing word embeddings or to make them more cognizant of the fact that these biases exist? | ||
+ | * Speer’s work has an interesting technical aspect, but to some students the findings that some things are not acceptable to say might be novel. Work immediately critical of these models without explaining what the harms are might be hurtful to people who haven’t thought about these issues yet. The phrasing of “of course we don’t want such issues” can alienate people. | ||
+ | * Potentially have a(n anonymous) survey, to figure out what the consensus is to model teaching based on where we stand. | ||
+ | * Using story telling to tell how biased systems can impact people's lives without technical knowledge. Sometimes stories work. |
Latest revision as of 18:45, 9 January 2021
The following notes were compiled from discussions during the ACL Tutorial on Integrating Ethics in the NLP Curriculum (slides) conducted on July 5, 2020 as part of the virtual ACL 2020 workshop. While the notes below do not necessarily give fully designed assignments, they can provide a starting point for ideas of ethical domains that could be useful for assignments, structures that could work for different classrooms to approach these subjects, and reflections from conversation that arose during the tutorial about what might be helpful or challenging in approaching ethical subjects in NLP. It is the hope of the tutorial organizers to further flesh out assignments using these notes at a later date.
In the first section, we provide some overarching conclusions; subsequent sections provide quoted and paraphrased text from our contributors with ideas of case studies, exercise formats, and reflections about setting up these exercises.
Contributors: Noëmi Aepli, Nader Akoury, Patrick Alba, Dan Bareket, Steven Bedrick, Emily M. Bender, Su Lin Blodgett, Jamie Brandon, Chris Brew, Trista Cao, Daniel Dahlmeier, Vidas Daudaravicius, Brent Davis, Chad DeChant, Marcia Derr, Guy Emerson, Hannah Eyre, Anjalie Field, Paige Finkelstein, Erick Fonseca, Vasundhara Gautam, Dimitra Gkatzia, Sharon Goldwater, Vivek Gupta, Samar Haider, Maartje ter Hoeve, Dirk Hovy, Aya Iwamoto, Amani Jamal, Micaela Kaplan, Suma Kasa, Katherine Keith, Haley Lepp, Angela Lin, Diane Litman, Yang Liu, Amnon Lotenberg, Emma Manning, Anna Marbut, Krystal Maughan, Sabrina Mielke, Sneha N, Silvia Necsulescu, Isar Nejadgholi, Vlad Niculae, Franziska Pannach, Ted Pedersen, Ben Peters, Gabriela Ramirez, Agata Savary, Tatjana Scheffler, Alexandra Schofield, Nora AlTwairesh, Nandan Thakur, Jannis Vamvas, Esaú Villatoro, Rich Wicentowski, Bock, Cissi, Manchego, Sonali, Tommaso, YeeMan
General Takeaways
These notes attempt to summarize themes and conclusions that arose in discussion repeatedly. While these themes and ideas arose frequently enough in discussion to merit recording, they do not necessarily represent the views of all of the participants above.
- Give students the time and resources to talk about ethical topics. Participants reflecting on the communities of learners they were interested in all described them as having little or no common background in discussing research ethics, whether around human subjects research or broader research impact. There is strong agreement that relying on students to come in with the vocabulary to discuss these issues is unlikely to lead to a rich conversation, so participants highlighted the importance of providing advance access to resources that help shape a core vocabulary. This can include classic resources like the Nuremberg Code and Belmont report, as well as more recent readings in NLP (see Ethics in NLP for a thorough list of works and syllabi that may help).
- Focus on real problems. Real-world case studies received much more attention in our discussion than toy problems, mirroring a general interest in talking about actual scenarios where language technology can cause (or has caused) harm. Using these problems as the basis for a short discussion presents a challenge, however, as in a computational setting it may be easy to abstract complex societal problems into classification tasks and metrics. It is therefore crucial to provide resources that reflect meaningfully on the social realities of these problems and that embody the principle of “nothing about us without us” (popularized by disability rights activists Michael Masutha, William Rowland, and later James Charlton) -- for example, case studies related to disability should include readings about disability and the disabled community the technology would support that represents the voices of those affected.
- Give clear assessment criteria. Since many courses or instructional settings related to NLP are embedded within a computer science or data science curriculum, both learners and instructors may be accustomed to rubrics with well-defined “correctness” of a solution. Many of these exercises below present a discussion, investigation, or reflection with inherent subjectivity and unanswerable questions. Learners and instructors will benefit from a clear rubric of what meaningful participation in such an assessment looks like in advance of engaging to ensure they feel confident in how to engage with the exercise.
- Provide multiple paths to participation. Depending on background or personal experience with a topic, learners may vary in their comfort with expressing their views or experiences in a group setting and may desire a way to communicate that is more private than a group discussion, whether to ask questions on a social topic with which they are less familiar or to communicate about their own experiences related to discrimination without scrutiny from peers. One can help these students by providing clear information about how to access those channels and, if possible, by providing resources before they are solicited to help learn more or to get support.
- Emphasize that this is a core skill. Due to compressed syllabi, ethics is often relegated to one day or one assignment of a course, and changing from that paradigm takes time. However, for students to generalize from these lessons, it is important to emphasize that skills of critical inquiry into possible risks and harms of technology are not a side topic, but part of the work of doing NLP. This can be done through reference back to these exercises in other assignments or in class discussions, as well as reflections on your own work.
Dual Use
Case Studies
- A fellow student suggests a group project topic they want to explore: gendered language in the LGBTQ community. They are very engaged in the community themselves and have access to data. Their plan is to write a text classification tool that distinguishes LGBTQ from heterosexual language.
- Depression classifier: advertisement targeting for people with depressions. Classification result could be used for various uses, from positive (help depressed people) to advertising, higher insurance premiums or discrimination in job applications. Classification mistakes (false positives) could have further negative effects.
- Consider technologies that support therapeutic interventions, e.g. USC ICT's ELITE Lite program to train counselors. Find scenarios where tool may be more or less problematic.
- Imagine a lightweight app meant to predict where someone is from using audio to determine accent. Think about different ways such an app could be used/what harm it might provoke.
- Consider different perspectives on a dialogue system to assist truck drivers. E.g. the regulator: interests aligned with the state, e.g. automotive regulator who has to approve an interface / the company executive: profits / the consumer / the developer. Truckers complain about being overworked / possible aim of employer: rather than giving less work, contrive a technological solution that organises their attention, so that even if they're exhausted, it's safe for them to be driving on the road
- Big language models (e.g. GPT-3) have pros and cons. They improve lots of NLP tasks, many of which we will just naively assume are good, and they generate a lot of fake data so you don’t have to touch real data. However, they can also ease the creation of large bodies of synthetic text that can be harmful in the form of fake news, bot armies, etc. Given the pros of any technology (like Big LMs above), pretend you’re an evil supervillain that wants to wreak havoc---how would you weaponize this technology?
- Make-It and Break-It as a group debate: one group accurately flagging disinformation topics, the other group circumventing a disinformation classifier. Discuss what it means to do each task: are their negative aspects to disinformation flagging, e.g. censorship? What about the boundaries of our technology, e.g. sarcasm falsely flagged as misinformation? Cultural aspects, different cultural norms, e.g. homeopathy information or misinformation?
Assessment Options
- Class discussion of a case study. Grade participation based on creativity of thinking of the dual uses, depth of discussion of positive and negative consequences. Reaching a level of 'analyze' according the Bloom's taxonomy is usually a good outcome. If students are able to explain or distinguish why some technologies behave as they do, it means they have understood the components of the systems and they can visualize the possible outcomes. Would need: articles on dual use
- Standard group discussion with slides and report for a small toy classifier from "artificial" data (maybe some less fraught social characteristic) and see what can actually be discovered this way. Would need: Some standard dataset, some programming skill with major classification framework e.g. NLTK.
- Short paper on the topic, possibly with guiding questions. Would need: access to an existing system or classifier. Transcript of exploration of system characteristics.
- Have them write an obligatory section with an ethical discussion in reports for their own project. Ground it in the three kinds of context from before: sociological, philosophical and technical perspective. Would need: large project as part of the course, literature review and prior discussion of case studies in dual use.
- Ask students to post one new thought on a class online forum on a topic and respond to a fellow student's ideas. That way, participation is “mandated”, and there is an understanding of what participation a student should have. Additionally, the online format allows people a certain amount of distance from their peers, which may make it easier to express their own opinions or ideas. Would need: online forum, introduction module for online discussion/understanding of dual use, case study or article as basis for discussion
- Explore data subjectivity. E.g. show the complexity of labeling depression in a text corpus, e.g. by exploring the data, self-labeling some data, doing inter-annotator agreement study with the class. Would need: dataset, preparation on data labeling, metrics, and agreement
- Exercise on taking perspectives on dual use using an essay or short play. Consider effect on specific person or specific groups, take different perspectives (eg. you are a false positive case which gets excluded from insurance vs the insurance company). In groups: Assign roles for students to take in making a decision, or assign secondary work to take a stance and argue why the tool should/should not be used. Would need: background on specific tool and how it would be used, possibly time to prepare with others
- Assign groups to focus on the benefits of a mental health app, assign groups to focus on disadvantages, and then encourage discussion or even debate between groups. This would free students from feeling like a certain view is socially acceptable. The debate could be the assessment, where other students could provide feedback on how clearly points were made and how well prepared they were. Setup: The debate assessment would assume a class size of 10-25. You could do this as a webinar on zoom given virtual instruction. Or student groups could pre-record their arguments and other students could watch those arguments and then discuss via small groups and break out discussion and have a group representative report back in a plenary session.
- Tertiary assignment for students to respond to their peers position statements--take other side than what they took in second assignment. Goal is to see that students are capable of arguing for both sides. Adds on to an existing reflection or writing assignment.
- Students discussing pros and cons of releasing a personal attribute predictor app to the public and the app in general. Discussion of
- Who can access the app.
- Consent from where the data is collected.
- How valid is the data, if the labeling is done by professionals?
- Does the development process include people from NLP and mental health professionals?
- If the data is obtained from social media data
- Datasheets (See Gebru et al, 2018) and transparency of data collection and model development process
- If the data can really be anonymized
- How is the test data obtained? Is it from the same population or different population?
- How to evaluate the effectiveness of the model in the outside world
- Humans tend to over diagnose certain populations - black and indigenous students are diagnosed at much higher rates. Would an app replicate this kind of bias?
- How would such an app be used? Could a potential employer mine your tweets and thereby decide not to hire you? If it were more personal could privacy be guaranteed.
- Should the apps be mandated reporters in the case where depression is detected?
- Have teams do error analysis of either existing apps or other teams’ apps to audit for dual use. Perform error analysis, collect data for why things should/shouldn’t be used. Can conduct as peer double-blind review of papers (Practice presenting to unknown audience/reviewer)
- Students compose a set of guidelines which could be given to a team of software engineers working on the development of a depression “detection” app or other possible project susceptible to dual use focused on risk vs reward, functionality, framing, and other deployment obligations
- In classes like 30 ppl, standard procedure: form little groups (3 ppl), make a plan, have fun, then present to class. This would really be a short exercise, 10 mins for discussion and writing, 1 min for each presentation. Could go through multiple technologies to “fill an hour” and hopefully find tasks of varying “difficulty” and diversity in answers! The same thing would work in industry (considerations: big LMs relevant to speech recognition, ~30 people, interactive, everything needs to fit into an hourlong session). If there are more than a few groups, they should not all work on the same example: cover a few! After presentations, everyone student writes up a short (300 words) defense of building or not building this technology. Bonus points when you write about mitigation of harm. For industry, essays won’t really work so instead do this more informally. Ask people with a show of hands whether yes, they would use the technology or no, they wouldn’t. If they say yes, ask them to informally say why. Assess to a lesser degree based on what cons were found and the quality of the presentation, and to a larger extent based on the final write-up, graded as you would any other essay (i.e., did it engage with the cons brought up in class and does the conclusion follow from the arguments presented).
- Industry setting. A team of 4-5 people. Format: go over the ethical implications of the product at a pre-existing company-wide ethics discussion series. The group comes up with arguments for both sides (positive and negative impacts of the product). At least 2-3 arguments on each side. Send the exercise out beforehand to give people time to familiarize themselves with the question. Ask each person to think about the topic by themselves, then work together. At the end, present a list of aspects of the question, and ask them whether their view on any of them has changed and how; Ask them what they think it means for their product, whether they can think of ways to change the product, whether they should make the product at all. Would need: both general terminology and specific domain knowledge for the use case, including community perspectives.
Reflections
- What is the point of a "gaydar" classifier? Is the "good" use the betterment of linguistic science? It’s easy to see examples of "bad" uses, harder to see what the good is.
- This is a hard topic, and may be better fit for the end of a course to help grapple with it.
- Topic requires background reading.
- How can we have students look at both sides of an issue, whether or not they actually agree with both sides of an issue?
- Classroom facilitation is important. While classes in person lend themselves well to discussion formats, online classes are not as good for engaging conversation. Additionally, it can be hard to have students open up and feel comfortable and safe. If students come from many backgrounds, they may also have different ideas of what is appropriate or how it may be appropriate to participate in a group setting or in the presence of a teacher or professor. Another option is a post-discussion questionnaire for all participants (allows to get "minimal" information from students / course participants who do not engage in discussion/presentation)
- Not everyone has the same exposure to thinking about "the sciences" as a place for critical thinking. While the US system puts a lot of emphasis on learning to think critically in college, not every culture does, and many people think of the sciences as being a place where if an expert writes a peer reviewed paper that gets published somewhere, you should trust that. Additionally, you can't ask students to believe in things they don't believe in, so how can you "make students care" in ways that might be contrary to their personal beliefs?
- We thought that some sort of “introduction to ethics/thinking critically” lesson for the whole class would be a good way to introduce it to everyone at once, and to make sure that everyone is on the same page and has access to the same language, regardless of the background. Additionally, we thought about the possibility of asking students not just to post their ideas, but to think about/write about/ submit a "pro con" list or one potentially positive outcome and one potentially negative outcome. That way, we take personal opinions out of the equations, which may help encourage looking at/learning to look at both sides of an issue, even if it might not be something you would natively do for a given application.
- Allowing students to engage with each other, either in person or virtually, is key to helping them learn to engage with the ideas of ethics.
- It's really hard to grade someone on a subjective comment -- there's not a right or a wrong answer in ethics necessarily. An assessment that focuses on participation and demonstrating that they're thinking about it (e.g. post x number of times, make a pro/con list with x number of things) helps to take your personal ideas as the teacher out of the discussion. Reinforcing that you're not grading subjectively is crucial here.
- Fundamentally, we need to connect this back to a value system. As computer scientists, we're not trained in this at all.
- Set ground rules and expectations - say that we don't have all the answers. As a student, I would appreciate people admitting that they don't know what they don't know. (But could be frustrating!)
- CS students are more optimistic about the performance of the models, however sometimes they do not follow any ethical code as they are not trained to do so.
- For group discussions/mock reviews: graduate students are more involved in research institutions and reviewing processes, while undergrads may not be. It’s important for students to reflect on their personal stances and what role they might have in a decision like this, and that role is different for undergraduates and graduates (Keeping in mind that graduate students and undergraduate students may not know where they want to go or what their final role will be). In a classroom with both undergrad and grad students, it might be beneficial to have blended groups. This type of assignment might be applicable to corporate workshops also.
Privacy and Anonymity
Case Studies
- Considering a trained Naive Bayes Classifier trained on a subset of 20 Newsgroups. How would you test and build for anonymization? Even if de-anonymizing there may be proxies, dual issues with anonymization, different vocabularies across regions and communities that can be an issue in anonymizing data.
- Design a small search engine around an inverted index that uses random integer noise from a two-sided geometric distribution (Ghosh et al., 2012) to shape which queries are retrieved. Analyze how much this changes the search results with different noise levels. Are there systematic changes?
- Social media data is used for a lot of studies. However, many users of social media do not understand this possible set of use cases when they participate online. How would one establish an appropriate protocol for informed consent in these scenarios?
- Differentially private polling for a binary question (coin flip noise stuff). Implement the model from the description and see how good estimates you can get out, given a file of “gold” answers. Visualize and discuss the tradeoff. Then, a new dataset comes in where we have two groups with very different priors on the answer---if they are uneven in size, how does that affect the method?
- Build an application for identifying pokemon. then challenge students to think about how this technology with different data could be used for applications that would reveal private data.
- Students propose a way to anonymize data from twitter so it can be used in NLP without being traceable to specific Twitter users. Need to think about different levels/definitions of anonymity (e.g. tracing a tweet to a specific user out of a given set, or identifying the country a user’s from).
Assessment Options
- Give a thought experiment or small exercises that doesn’t require a computer, work together for 10 minutes to run through the exercise (e.g. to show how re-identification can happen even in anonymised data)/thought experiments with each other, then discuss in small/large groups. Would require: toy dataset for short exercise.
- Given a dataset, have students write a report where they propose which features are deanonymizing, how their presence or absence would affect the model, and who would be affected by the ability to look inside the model and work out which documents were in the training set. Could be done individually or in groups. Would need: statistics and programming background, dataset
- In groups, make a game of looking for features that have an impact on identifying users: maybe give students a specific scenario, or maybe each group has to decide and justify what they think is an appropriate degree of privacy. Teams first have to come up with a system to privatize their data while still being able to perform certain tasks accurately. Other groups try to identify users in each groups “anonymized” dataset to adversarially test it.
- Implement a particular differential privacy mechanism on a dataset and assess the accuracy of models trained with the data. Likely would be done in small groups over 1-2 weeks. Assessment will be based both on implementing the correct (i.e. to specification) model in the first part and finding and articulating problems that arise when imbalance and noise meet in the second part of the exercise. Would need: statistics and programming background, dataset
- Short self-assessments about understanding of privacy: comparisons in answers before and after another exercise, presentation, or reading. Ask the students: How would you feel if your data was used or redistributed? What about someone you care about?
Reflections
- Two core skills to assess: Understanding of the risks of not protecting privacy, and technical understanding of techniques to improve privacy, as well as adversarial techniques. The latter is more implementation of code, while the former may need more discussion/presentation of external resources.
- Implementing anonymization or privacy on a full real-world dataset takes a lot of time.
- This topic connects in somewhat with bias. Talking about factors that reveal race and gender, geography can be a good predictor of race (in the U.S.) - some things don’t occur to students, so there’s some preparation for the students in terms of social conditions that they have not been made aware of prior to the exposure.
- Students see a separation between the technical skills and the ethics. See ethics as extraneous.
- Many students (younger generation) say “but if you put it on the internet you should’ve known that it will be used/shared”
- When do we need anonymization? / pseudonymization? We cannot predict the development of technology, so it's hard to perfect how to anonymize data.
- It’s hard to get consent for future uses; people don’t understand what their data could be used for in the future
- One interesting focus of conversation was around whether it is a moral obligation to de-anonymize data *before* using it for training at all, considering it as completely off the table, even in cases where there is a suspicion that it might decrease model performance in some ways, and how to try to engage students in thinking about that possible "trade-off".
- There are cultural and national differences in approaches to the ethics of privacy -- how could you come at it from a different perspective to justify including it in the curriculum?
- In industry, how are privacy concerns actually handled? In a workplace setting, who are the domain experts here? E.g. there may be offices or high-level resources related to an ethical conversation, low-level policies in place for applying/ensuring ethics. Management not wanting to make more steps to add those procedures because it might make things take more time/be more difficult. This is difficult when working on such a small component of the system where it's not as clear what the ethical implications could be.
- The importance of engaging professors in other domains-- law, health, other domains where the application could be used, as well as people with expertise in ethics in these domains and stakeholders who will be impacted.
Bias
Case Studies
- Have students consider an existing system and try to determine ways to measure bias in order to check it, following e.g. the template of How to Make a Racist AI Without Really Trying by Robyn Speer.
- Consider an automatic speech recognition (ASR) system used to automatically caption conference presentations. What are the impacts of bias in such a system? What kinds of bias may exist?
- Suppose ASR is being used to assess call center performance. If ASR systems are being trained on mostly white voices and are actively used to assess performance of call center agents, how might the results of this bias affect performance metrics, and how might this effect be mitigated?
- Recommender systems are often customized to detect or respond to user preferences by comparing with similar uses. How might bias manifest in these cases?
- Imagine if an MT or summarization system was used to communicate in a scenario with some kind of importance, e.g. something big and scary (refugee/disaster scenario) or just mundane but important (scheduling an appointment, opening a bank account, etc.). What kinds of errors might arise from common sources of data bias?
- Grammatical error correction often represents only a minimal number of dialects. How might these systems be made to better handle linguistic variation?
- Consider a sentiment analysis tool whose text input handles names. How might such a system respond to proper names of individuals? Could systems learn bias in this context? How would one mitigate that bias?
Assessment Options
- Consider a training data set for a task (e.g., ASR). Quantify in a short report what sorts of biases these data may have and how this might affect a downstream task. Will need: existing dataset for task, (optionally) an existing model for the task to see how it compares.
- Lead a discussion with seminar students. Prompt them with a case study, and ask them to discuss (in small, then large groups) possible sources of bias, how they might manifest, and how they might be mitigated. Will need: prior background on the domain, prompts and examples to brainstorm from. Include discussion and reflection questions, such as:
- What should go in documentation so users are aware of bias?
- How to improve by collecting data?
- Speculate on important sources of bias based on knowledge about data, insight from observed errors.
- What sources can be addressed? What cannot?
- Compare different ASR systems (trained on different dialects, tested on a different dialect)
- (post-discussion) Do you recognize issues with systems you interact with, which you didn’t recognize before?
- Consider a larger NLP “pipeline” to address a more complex task. Have students first break this process into “bins”, then split into groups that each focus on one “bin” of the pipeline. Groups can switch between parts after leaving feedback that following groups can see. Would need: background on domain of task and bias, whiteboards/collaborative writing technology, guidance on how to critique as constructive feedback
- Ask participants to provide their own examples of problematic NLP. Then, in groups, have team members to later put together presentations or write a paper about problematic bias seen in past projects (or, in industry, in a work context). For less technical team members, ensure that they are able to ask the questions necessary to identify bias in data received. Provide a dataset and have each team member come up with relevant questions.
- Lab exercise for small groups: after implementing a simple recommender system and training it on instructor-provided data, each group is assigned an “identity” (one user of the recommender system, with a particular set of attributes and preferences) and asked to interact with the system as that user, then prepare a short write-up with their findings. Final report includes input from all students and reflects different backgrounds. Final report identifies which groups experience disparate outcomes from the system and how that bias is derived from the data, as well as how this would manifest in the broader context in which these systems are used. This report should include an opportunity for input from all students that reflects different backgrounds. Will need: access to a trained recommender system, background on the model.
- Have students role play the possible outcomes of a malfunctioning system and reflect on how they would intervene in the way the system runs. After the exercises, students do brief presentations in their groups about the outcomes, focusing on different assigned elements of the exercise. Will need: a planned scenario or script we have the students work through actually exhibits the problems we are "hoping" that it does around bias, an MT/related system relevant for the scenario, plus whatever background materials to support the dialogue we want them to try and do. Students might be able to bring their own system- perhaps one they had built in an earlier phase of the class; that could be part of the exercise- looking at their own system’s training and behavior.
- As a lab exercise using e.g. a notebook (as in Robyn Speer’s blog post), using existing references and models, find metrics that can capture different forms of bias, starting with basic examples. Discuss what the pros and cons of different existing metrics are and why there is no one “bias” metric. Ensure students also take the opportunity to look at misclassifications to understand what these metrics do or don’t capture. Will need: some background/reading on bias metrics, prepared code/notebook with data, prior unit on word embeddings, prerequisite understanding of statistics and rigorous evaluation
- In an industry setting, have a team working on a project with human-centered consequences (e.g. an automatic recognition, scoring, or classification tool for people) read some general background on bias in NLP systems and the ACL/ACM Code of Ethics. Then, have an formal presentation on these topics and an open forum discussion among colleagues to brainstorm and discuss how these relate to the products being worked on. Isolate concrete steps to be taken to modify existing policies to handle data and systems. Meeting notes consolidated from discussions which establish all colleagues are on the same page are also crucial. Assessment will also include examination of additional changes in documentation about testing on different minority populations’ voices or text inputs and gauging associated performances in comparison with majority speakers.
- Instructions can provide an off-the-shelf model with test data sets of different language varieties. Students will experiment with the model and its performance over the different test data sets to appreciate the variation of outcomes. Will need: dataset with linguistic variation (if appropriate, from dialects spoken by members of the class). After the practical exercises, students write a report about their findings that addresses these questions:
- Describe the potential impact of linguistic variation on the functioning of NLP/speech technology
- Describe how to choose what kinds of language variation should be tested for
- Reason about how differential performance for different social groups can lead to adverse impacts
- Articulate what kind of documentation should accompany NLP/speech technology to facilitate safe deployment
- Have teams start by building some “debiased” model, e.g. for sentiment analysis or classification. Give examples of different stakeholders and different kinds of harm, in maybe a tabular form. Then, have those teams trade models and attempt to determine biases still in these systems. Would need: some shared background on that model and domain, starting models to adapt.
- Have students write a suite of “test cases” for existing models designed to perform a benchmark task to see if these models display certain kinds of biases. Students could design pairs of words/sentences to test out bias in word embeddings / sentiment analysis, e.g., “Emily” and “Shaniqua” as names. Then, have them write an essay about what they learned from designing the exercise and testing it on some public models. Would need: existing background on model, coding for NLP, biases.
Reflections
- There is some concern about how to come up with an “exam” question for this topic, but an idea offered was an open-ended question that describes a dataset and algorithm, and have the students describe the potential biases, and how they would measure them, and mitigation strategies (where a thoughtful answer that referenced ideas from the course reading would receive full credit)
- Build specific activities around discussion to avoid people feeling defensive/targeted about products that they’ve built.
- Express an understanding that everything is going to be biased, no matter how hard we work to avoid it.
- For in-class discussion or role play exercises,, the exercise would be something that would need to work in an online/distance setting (COVID, etc.), but would also need to ensure lots of interactivity. The exercise would be designed around making that happen.
- Most examples discussing model bias come from the US/UK - how can they be interesting for all students from various countries and ethnicities?
- When engaging students about bias in data and models, how do students relate to the underlying data, do they feel victims or perpetrators?
- Formats like notebooks with pre-written code can be welcoming to students with non-CS background to still engage with the same questions. When students have mixed backgrounds, build on different students’ strengths (technical vs linguistic/psychology backgrounds).
- Annotators are different from programmers, who are also sometimes different from the people who check the annotations - how do we deal with bias entering from different stages? Can we train annotators using these sorts of exercises?
- Is the important thing to “solve” debiasing word embeddings or to make them more cognizant of the fact that these biases exist?
- Speer’s work has an interesting technical aspect, but to some students the findings that some things are not acceptable to say might be novel. Work immediately critical of these models without explaining what the harms are might be hurtful to people who haven’t thought about these issues yet. The phrasing of “of course we don’t want such issues” can alienate people.
- Potentially have a(n anonymous) survey, to figure out what the consensus is to model teaching based on where we stand.
- Using story telling to tell how biased systems can impact people's lives without technical knowledge. Sometimes stories work.