Overview of the SustaiNLP 2020 Shared Task

We describe the SustaiNLP 2020 shared task: efficient inference on the SuperGLUE benchmark (Wang et al., 2019). Participants are evaluated based on performance on the benchmark as well as energy consumed in making predictions on the test sets. We describe the task, its organization, and the submitted systems. Across the six submissions to the shared task, participants achieved efficiency gains of 20× over a standard BERT (Devlin et al., 2019) baseline, while losing less than an absolute point in performance.


Introduction
While ever-larger pretrained language models have led to impressive gains across a variety of natural language processing (NLP) tasks, there is growing concern about the environmental impact of training and deploying these models (Strubell et al., 2019;Schwartz et al., 2019). In response, there has been a growing body of research focusing on making these large models smaller and more efficient with minimal sacrifice to performance Michel et al., 2019, i.a.). The SustaiNLP 2020 shared task focuses on the development of computationally and energy efficient NLP systems. The task uses the SuperGLUE benchmark (Wang et al., 2019), a standard benchmark for natural language understanding. Systems are evaluated on both the benchmark score as well as the energy consumed in evaluating the system on the benchmark. Participants are therefore incentivized to develop models that are energy efficient while maintaining the high performance of recent models. The shared task received six submissions that employed a large variety of optimizations to improve system efficiency. Overall, the submitted systems were on average 20× more efficient than a standard baseline using pretrained language models while nearly matching baseline performance.

Task
The shared task centers on the SuperGLUE benchmark, a suite of eight diverse NLU tasks designed to test a system's ability to perform a broad range of language understanding capabilities. The tasks vary substantially in task type, input size, and textual domain. We use seven of the eight Super-GLUE tasks, as the extremely small nature of the Winograd Schema Challenge (WSC) makes it challenging to obtain meaningful performance while improving the efficiency of the system. We briefly describe the seven tasks used here; see Wang et al. (2019) for an in-depth discussion of the tasks. • Words in Context (WiC) is a word sense disambiguation task where each example consists of a pair of sentence that each contain the same marked word. The task is to determine if the word has the same sense in both sentences. The test set consists of 1400 examples, and the evaluation metric is accuracy.
To participate, each submission produces predictions on the test set of each task and is scored according to the task evaluation metrics. The overall task performance is determined by averaging performance metrics for each task. For tasks with multiple evaluation metrics, we first average within each task.

Efficiency
As the workshop focuses on developing computationally efficient systems, we additionally evaluate systems by how efficiently they produce predictions on the test set. We focus on measuring efficiency during inference rather than training, as, in the current paradigm, models are trained only a handful of (expensive) times but used for inference many more times. Additionally, measuring efficiency during training is complicated by the widespread reliance on pretrained model components.
Though there are many metrics for measuring efficiency, we follow the recommendation of Henderson et al. (2020) and measure efficiency by the power consumed throughout the course of inference. To do so, we use the experiment-impact-tracker library Henderson et al. (2020).

Organization
We consider two 1 tracks: one using GPUs and one restricted to CPU only. All systems were welcome to use any programming language or libraries, but were run on standardized hardware environments. For the GPU track, participants had four Nvidia V100s (32GB) available to them, but all participants chose to use only one GPU due to the cost of parallelization overhead. We run all submissions three times and report the mean task and efficiency scores.

Submissions
We provided participants with a simple baseline that follows the standard paradigm of finetuning a pretrained language model to each task. For pretrained models, we use BERT-base (Devlin et al., 2019) and RoBERTa-large (Liu et al., 2019), as provided by the HuggingFace Transformers library .
There were six submissions to the shared task, four submissions to the GPU track and two submissions to the CPU track. All submissions were provided by Kim and Hassan (2020). We provide a brief description of the six submissions below; see Kim and Hassan (2020) for in-depth descriptions. Systems 1-* are submissions to the GPU track and systems 3-* are submissions to the CPU track.  Table 2: Task performance for various systems. For BoolQ, COPA, RTE, and WiC, the evaluation metric is accuracy. For CB, the evaluation metrics are accuracy and F1. For MultiRC, the evaluation metrics are answerlevel F1 and exact match. For ReCoRD, the evaluation metrics are token-level F1 and exact match. The overall task performance is an unweighted average of performance across tasks.
both task-specific and task-agnostic knowledge distillation (Hinton et al., 2015) from the pretrained and finetuned BERT model. They then reduce the model sizes via network pruning (Karnin, 1990) and further decrease the memory footprint by using 16-bit precision. Finally, they improve the runtime by fusing specific operations using onnxruntime and using a large evaluation batch size.
• 1-2: This submission is the same as 1-1 except they use a modified model for MultiRC.
• 1-3: This submission is a hybrid system that uses the GPU only for ReCoRD due to its much larger size and CPU for all other tasks. It uses the same optimizations as 1-1.
• 1-4: This submission is the same as 1-3 except it uses the modified MultiRC model.
• 3-1: This submission uses the same models as 1-3, but runs only on CPUs. It includes additional CPU-specific optimizations such as 8-bit quantization for some matrix multiplications and optimzed number of CPU processes per task.
• 3-2: This submission uses the same models as 1-4, but only uses CPUs. It uses the same optimizations as 3-1.

Results
Energy and task results are respectively presented in Tables 1 and Table 2. We find that the submitted systems are able to substantially improve total energy consumption over the baseline systems, as much as 20× in both the GPU and CPU settings, while trading off less than one point average task performance. The differences tend to be larger in the CPU setting than the GPU setting, likely because large, unoptimized pretrained language models were developed to be run on GPUs. The improvements of the submitted systems vary wildly between tasks, and do not scale linearly in the size of the test set. On CB and COPA, two of the smallest datasets, the improvements are as much as 50 − 100× in the GPU setting. On WiC and BoolQ, the improvements are a more modest 10×. Similarly, the improvements do not seem to scale in the size of the inputs, as improvements on the paragraph-input tasks (BoolQ, MultiRC, and ReCoRD) are frequently matched and dwarfed by improvements on the sentence-level tasks.
Among the systems, we find that the hybrid submissions (1-3, 1-4) consistently consume more power than the GPU-only counterparts (1-1, 1-2). All of the submissions that use a GPU (1-*) substantially outperform those that do not (3-*), which is in large part due to the large test set for ReCoRD. We observe fairly high variance between similar systems (1-1 and 1-2; 1-3 and 1-4; 3-1 and 3-2). In the worst case, systems 3-1 and 3-2 only differ by the MultiRC model, but the energy consumption varies significantly. We attribute this variance to runtime differences in the environment.
Task performances are consistently around 2 absolute points lower in the submitted systems than the baseline, except for COPA, where the submitted systems outperform the baseline. However, given the large efficiency improvements over the baseline, this tradeoff seems favorable.

Conclusion
We describe the results of the SustaiNLP 2020 Shared Task. The six submissions were able to substantially improve over the baseline systems, obtaining improvements 20× in energy consumption while only losing a point in performance. To achieve these results, the submissions employed efficiency optimizations at numerous levels, including model architecture, storage, and runtime, which hints at the rich design space for efficient machine learning models.