Flamingos and Hedgehogs in the Croquet-Ground: Teaching Evaluation of NLP Systems for Undergraduate Students

This report describes the course Evaluation of NLP Systems, taught for Computational Linguistics undergraduate students during the winter semester 20/21 at the University of Potsdam, Germany. It was a discussion-based seminar that covered different aspects of evaluation in NLP, namely paradigms, common procedures, data annotation, metrics and measurements, statistical significance testing, best practices and common approaches in specific NLP tasks and applications.


Motivation
"Alice soon came to the conclusion that it was a very difficult game indeed." 1 When the Queen of Hearts invited Alice to her croquet-ground, Alice had no idea how to play that strange game with flamingos and hedgehogs. NLP newcomers may be as puzzled as her when they enter the Wonderland of NLP and encounter a myriad of strange new concepts: Baseline, F1 score, glass box, ablation, diagnostic, extrinsic and intrinsic, performance, annotation, metrics, humanbased, test suite, shared task. . .
Although experienced researchers and practitioners may easily relate them to the evaluation of NLP models and systems, for newcomers like undergraduate students it is not simply a matter of looking up their definition. It is necessary to show them the big picture of what and how we play in the croquet-ground of evaluation in NLP.
The NLP community clearly cares for doing proper evaluation. From earlier works like the book by Karen Spärck Jones and Julia R. Galliers (1995) to the winner of ACL 2020 best paper award (Ribeiro et al., 2020) and recent dedicated workshops, e.g. Eger et al. (2020), the formulation of evaluation methodologies has been a prominent topic in the field.
Despite its importance, evaluation is usually covered very briefly in NLP courses due to a tight schedule. Teachers barely have time to discuss dataset splits, simple metrics like accuracy, precision, recall and F1 Score, and some techniques like cross validation. As a result, students end up learning about evaluation on-the-fly as they begin their careers in NLP. The lack of structured knowledge may cause them to be unacquainted with the multifaceted metrics and procedures, which can render them partially unable to evaluate models critically and responsibly. The leap from that one lecture to what is expected in good NLP papers and software should not be underestimated.
The course Evaluation of NLP Systems, which I taught for undergraduate Computational Linguistics students in the winter semester of 20/21 at the University of Potsdam, Germany, was a reading and discussion-based learning approach with three main goals: i) helping participants become aware of the importance of evaluation in NLP; ii) discussing different evaluation methods, metrics and techniques; and iii) showing how evaluation is being done for different NLP tasks.
The following sections provide an overview of the course content and structure. With some adaptation, this course can also be suitable for more advanced students.

Paradigms
Kinds of evaluation and main steps, e.g. intrinsic and extrinsic, manual and automatic, black box and glass box.

Common Procedures
Overview about the use of measurements, baselines, dataset splits, cross validation, error analysis, ablation, human evaluation and comparisons.

Annotation
How to annotate linguistic data, evaluate the annotation and how the annotation scheme can affect the evaluation of a system's performance.

Metrics and Measurements
Outline of the different metrics commonly used in NLP, what they aim to quantify and how to interpret them.

Statistical Significance Testing
Hypothesis testing for comparing the performance of two systems in the same dataset.

Best Practices
The linguistic aspect of NLP, reproducibility and the social impact of NLP.

NLP Case Studies
Group presentations about specific approaches in four NLP tasks/applications (machine translation, natural language generation, dialogue and speech synthesis) and related themes (the history of evaluation, shared tasks, ethics and ACL's code of conduct and replication crisis). 2 Course Content and Format Table 1 presents an overview of the topics discussed in the course. Details about the weekly reading lists are available at the course's website. 2 The course happened 100% online due to the pandemic. It was divided into two parts. In the first half of the semester, students learned about the evaluation methods used in general in NLP and, to some extent, machine learning. After each meeting, I posted a pre-recorded short lecture, slides and a reading list about the next week's content. The participants had thus one week to work through the material anytime before the next meeting slot. I provided diverse sources like papers, blogposts, tutorials, slides and videos.
I started the online meetings with a wrap-up and feedback about the previous week's content. Then, I randomly split them into groups of 3 or 4 participants in breakout sessions so that they could discuss a worksheet together for about 45 minutes. I encouraged them to use this occasion to profit from the interaction and brainstorming with their 2 https://briemadu.github.io/evalNLP/schedule peers and exchange arguments and thoughts. After the meeting, they had one week to write down their solutions individually and submit it.
In the second half of the semester, they divided into 4 groups to analyze how evaluation is being done in specific NLP tasks. For larger groups, other NLP tasks can be added. They prepared group presentations and discussion topics according to general guidelines and an initial bibliography that they could expand. Students provided anonymous feedback about each other's presentations for me and I then shared it with the presenters, to have the chance to filter abusive or offensive comments.
The last lecture was a tutorial about useful metrics available in scikit-learn and nltk Python libraries using Jupyter Notebook (Kluyver et al., 2016).
Finally, they had six weeks to work on a final project. Students could select one of the following three options: i) a critical essay on the development and current state of evaluation in NLP, discussing the positive and negative aspects and where to go from here; ii) a hands-on detailed evaluation of an NLP system of their choice, which could be, for example, an algorithm they implemented for another course; or iii) a summary of the course in the format of a small newspaper.

Participants
Seventeen bachelor students of Computational Linguistics attended the course. At the University of Potsdam, this seminar falls into the category of a module called Methods of Computational Linguistics, which is intended for students in the 5 th semester of their bachelor course. Still, one student in the 3 rd and many students in higher semesters also took part.
By the 5 th semester, students are expected to have completed introductory courses on linguistics (phonetic and phonology, syntax, morphology, semantics and psycho-and neurolinguistics), computational linguistics techniques, computer science and programming (finite state automata, advanced Python and other courses of their choice), introduction to statistics and empirical methods and foundations of mathematics and logic, as well as varying seminars related to computational linguistics.
Although there were no formal requirements for taking this course, students should preferably be familiar some common tasks and practices in NLP and the basics of statistics.

Outcomes
I believe this course successfully introduced students to several fundamental principles of evaluation in NLP. The quality of their submissions, especially the final project, was, in general, very high. By knowing how to properly manage flamingos and hedgehogs, they will hopefully be spared the sentence "off with their head!" as they continue their careers in NLP. The game is not very difficult when one learns the rules.
Students gave very positive feedback at the end of the semester about the content, the literature and the format. They particularly enjoyed the opportunity to discuss with each other, saying it was good to exchange what they recalled from the reading. They also stated that what they learned contributed to their understanding in other courses and improved their ability to document and evaluate models they implement. The course was also useful for them to start reading more scientific literature.
In terms of improvements, they mentioned that the weekly workload could be reduced. They also reported that the reading for the week when we covered statistical significance testing was too advanced. Still, they could do the worksheet since it did not dive deep into the theory.
The syllabus, slides and suggested readings are available on the course's website. 3 The references here list the papers and books used to put together the course and has no ambition of being exhaustive. In case this course is replicated, the references should be updated with the most recent papers. I can share the worksheets and guidelines for the group presentation and the project upon request. Feedback from readers is very welcome.

Acknowledgments
In this course, I was inspired and used material available online by many people, to whom I am thankful. I also thank the students who were very engaged during the semester and made it a rewarding experience for me. Moreover, I am grateful for the anonymous reviewers for their detailed and encouraging feedback.