Contemporary NLP Modeling in Six Comprehensive Programming Assignments

We present a series of programming assignments, adaptable to a range of experience levels from advanced undergraduate to PhD, to teach students design and implementation of modern NLP systems. These assignments build from the ground up and emphasize full-stack understanding of machine learning models: initially, students implement inference and gradient computation by hand, then use PyTorch to build nearly state-of-the-art neural networks using current best practices. Topics are chosen to cover a wide range of modeling and inference techniques that one might encounter, ranging from linear models suitable for industry applications to state-of-the-art deep learning models used in NLP research. The assignments are customizable, with constrained options to guide less experienced students or open-ended options giving advanced students freedom to explore. All of them can be deployed in a fully autogradable fashion, and have collectively been tested on over 300 students across several semesters.


Introduction
This paper presents a series of assignments designed to give a survey of modern NLP through the lens of system-building. These assignments provide hands-on experience with concepts and implementation practices that we consider critical for students to master, ranging from linear feature-based models to cutting-edge deep learning approaches. The assignments are as follows: A1. Sentiment analysis with linear models (Pang et al., 2002) on the Stanford Sentiment Treebank (Socher et al., 2013).
A3. Hidden Markov Models and linear-chain conditional random fields (CRFs) for named entity recognition (NER) (Tjong Kim Sang and De Meulder, 2003), using features similar to those from Zhang and Johnson (2003).
A1-A5 come with autograders. These train each student's model from scratch and evaluate performance on the development set of each task, verifying whether their code behaves as intended. The autograders are bundled to be deployable on Gradescope using their Docker framework. 2 These coding assignments can also be supplemented with conceptual questions for hybrid assignments, though we do not distribute those as part of this release.
Other Courses and Materials Several other widely-publicized courses like Stanford CS224N and CMU CS 11-747 are much more "neural-first" views of NLP: their assignments delve more heavily into word embeddings and low-level neural implementation like backpropagation. By contrast, this course is designed to be a survey that also covers topics like linear classification, generative modeling (HMMs), and structured inference.
Other hands-on courses discussed in prior Teaching NLP papers (Klein, 2005;Madnani and Dorr, 2008;Baldridge and Erk, 2008) make some similar choices about how to blend linguistics and CS concepts, but our desire to integrate deep learning as a primary (but not the sole) focus area guides us towards a different set of assignment topics.

Design Principles
This set of assignments was designed after we asked ourselves, what should a student taking NLP know how to build? NLP draws on principles from machine learning, statistics, linguistics, algorithms, and more, and we set out to expose students to a range of ideas from these disciplines through the lens of implementation. This choice follows the "text processing first" (Bird, 2008) or "core tools" (Klein, 2005) views of the field, with the idea that students can study undertake additional study of particular topic areas and quickly get up to speed on modeling approaches given the building blocks presented here.

Covering Model Types
There are far too many NLP tasks and models to cover in a single course. Rather than focus on exposing students to the most important applications, we instead designed these assignments to feature a range of models along the following typological dimensions.
Output space The prediction spaces of models considered here include binary/multiclass (A1, A2), structured (sequence in A3, span in A6), and natural language (sequence of words in A4, executable query in A5). While structured models have fallen out of favor with the advent of neural networks, we view tagging and parsing as fundamental ped-agogical tools for getting students to think about linguistic structure and ambiguity, and these are emphasized in our courses.

Other Desiderata
A major consideration in designing these assignments was to enable understanding without large-scale computational resources. Maintaining simplicity and tractability is the major reason we do not feature more exploration of pre-trained models (Devlin et al., 2019). These factors are also why we choose character-level language modeling (rather than word-level) and seq2seq semantic parsing (rather than translation): training large autoregressive models to perform well when output vocabularies are in the tens of thousands requires significant engineering expertise. While we teach students skills like debugging and testing models on simplified settings, we still found it less painful to build our projects around these more tractable tasks where students can iterate quickly. Another core goal was to allow students to build systems from the ground-up using simple, understandable code. We build on PyTorch primitives (Paszke et al., 2019), but otherwise avoid using frameworks like Keras, Huggingface, or Al-lenNLP. The code is also somewhat "underengineered:" we avoid an overly heavy reliance on Pythonic constructs like list comprehensions or generators as not all students come in with a high level of familiarity with Python.
What's missing Parsing is notably absent from these assignments; we judged that both chart parsers and transition-based parsers involved too many engineering details specific to these settings. All of our classes do cover parsing and in some cases have other hands-on components that engage with parsing, but students do not actually build a parser. Instead, sequence models are taken as an example of structured inference, and other classification tasks are used instead of transition systems.
From a system-building perspective, the biggest omissions are pre-training and Transformers. These can be explored in the context of final projects, as we describe in the next section.
Finally, our courses integrate additional discussion around ethics, with specific discussions surrounding bias in word embeddings (Bolukbasi et al., 2016;Gonen and Goldberg, 2019) and ethical considerations of pre-trained models (Bender et al., 2021), as well as an open-ended discussion surrounding social impact and ethical considerations of NLP, deep learning, and machine learning. These are not formally assessed at present, but we are considering this for future iterations of the course given these topics' importance.

Deployment
These assignments have been used in four different versions of an NLP survey course: an upperlevel undergraduate course, a masters level course (delivered online), and two PhD-level courses. In the online MS course, these constitute the only assessment. For courses delivered in a traditional classroom format, we recommend choosing a subset of the assignments and supplementing with additional written assignments testing conceptual understanding.
Our undergrad courses use A1, A2, A4, and a final project based on A6. We use additional written assignments covering word embedding techniques, syntactic parsing, machine translation, and pre-trained models. Our PhD-level courses use A1, A2, A3, A5, and an independent final project. The assignments also support further "extension" options: for example, in A3, beam search is presented as optional and students can also explore  parallel decoding for the CRF or features for NER to work better on German. For the seq2seq model, they could experiment with Transformers or implement constrained decoding to always produce valid logical forms. We believe that A1 and A2 could be adapted to use in a wide range of courses, but A3-A6 are most appropriate for advanced undergraduates or graduate students.
Syllabus Table 2 pairs these assignments with readings in texts by Jurafsky and Martin (2021) and Eisenstein (2019). See Greg Durrett's course pages for complete sets of readings.
Logistics We typically provide students around 2 weeks per assignment. Their submission either consists of just the code or a code with a brief report, depending on the course format. Students collaborate on assignments through a discussion board on Piazza as well as in person. We have relatively low incidence of students copying code, assessed using Moss over several semesters.
Pain Points Especially on A3, A4, and A5, we come across students who find debugging to be a major challenge. In the assignments, we suggest strategies to verify parts of inference code independently of training, as well as simplified tasks to test models on, but some students find it challenging or are unwilling to pursue these avenues. On a similar note, students often do not have a prior on what the system should do. It might not raise a red flag that their code takes an hour per epoch, or gets 3% accuracy on the development set, and they end up getting stuck as a result. Understanding what these failures mean is something we emphasize. Finally, students sometimes have (real or perceived) lack of background on either coding or the mathematical fundamentals of the course; however, many such students end up doing well in these courses as their first ML/NLP courses.