ALICE: Active Learning with Contrastive Natural Language Explanations

Training a supervised neural network classifier typically requires many annotated training samples. Collecting and annotating a large number of data points are costly and sometimes even infeasible. Traditional annotation process uses a low-bandwidth human-machine communication interface: classification labels, each of which only provides several bits of information. We propose Active Learning with Contrastive Explanations (ALICE), an expert-in-the-loop training framework that utilizes contrastive natural language explanations to improve data efficiency in learning. ALICE learns to first use active learning to select the most informative pairs of label classes to elicit contrastive natural language explanations from experts. Then it extracts knowledge from these explanations using a semantic parser. Finally, it incorporates the extracted knowledge through dynamically changing the learning model's structure. We applied ALICE in two visual recognition tasks, bird species classification and social relationship classification. We found by incorporating contrastive explanations, our models outperform baseline models that are trained with 40-100% more training data. We found that adding 1 explanation leads to similar performance gain as adding 13-30 labeled training data points.


Introduction
The de-facto supervised neural network training paradigm requires a large dataset with annotations. It is time-consuming, difficult and sometimes even infeasible to collect a large number of data-points due to task nature. A typical example task is medical diagnosis. In addition, annotating datasets also is costly, especially in domains where experts are difficult to recruit. In a traditional an-1 Co-supervised project.

A B C
Which image is not a Ring-Billed Gull? Figure 1: An example task that would benefit from learning with natural language explanation. The top-left corner shows an example image of a ring-billed gull. In the other three images (A), (B), (C), which one is not a ring-billed gull but a California gull? Given the natural language explanation "Ring-billed gull has a bill with a black ring near the tip while California gull has a red spot near the tip of lower mandible", it would be easier to find that (A) is the correct choice.
notation process, the human-machine communication bandwidth is narrow. Each label provides log C bits per sample for a C-class classification problem. However, humans don't solely rely on such low bandwidth communication to learn. They instead learn through natural language communication, which grounds on abstract concepts and knowledge. Psychologists and philosophers have long posited natural language explanations as central, organizing elements to human learning and reasoning (Chin-Parker and Cantelon, 2017;Lombrozo, 2006;Smith, 2003). Following this intuition, we explore methods to incorporate natural language explanations in learning paradigms to improve learning algorithm's data efficiency.
Let's take a bird species classification task as an example to illustrate the advantage of learning with natural language explanation. Figure 1 shows several bird images. Based on visual dissimilarity, many people mistakenly thought Image C is not a ring-billed gull as it has a different colored coat compared to the example. However, ring-billed gulls change their coat color from light yellow to grey after the first winter. So color is not the deciding factor to distinguish California gull and ringbilled gull. If we receive abstract knowledge from human experts through a natural language format, such as "Ring-billed gull has a bill with a black ring near the tip while California gull has a red spot near the tip of lower mandible" and incorporate it in the model, then the model will discover that Image A is a California gull instead of a ringbilled gull based on its bill.
Previous work has shown that incorporating natural language explanation into the classification training loop is effective in various settings (Andreas et al., 2018;Mu et al., 2020). However, previous work neglects the fact that there is usually a limited time budget to interact with domain experts (e.g., medical experts, biologists)  and high-quality natural language explanations are expensive, by nature. Therefore, we focus on eliciting fewer but more informative explanations to reduce expert involvement.
We propose Active Learning with Contrastive Explanations (ALICE), an expert-in-the-loop training framework that utilizes contrastive natural language explanations to improve data efficiency in learning. Although we focus on image classification in this paper, our expert-in-the-loop training framework could be generalized to other classification tasks. ALICE learns to first use active learning to select the most informative query pair to elicit contrastive natural language explanations from experts. Then it extracts knowledge from these explanations using a semantic parser. Finally, it incorporates the extracted knowledge through dynamically updating the learning model's structure. Our experiments on bird species classification and social relationship classification show that our method that incorporates natural language explanations has better data efficiency compared to methods that increase training sample volume.

Related Work
Learning with Natural Language Explanation Psychologists and philosophers have long posited natural language explanations as central organizing elements to human learning and reasoning (Chin-Parker and Cantelon, 2017). Several attempts have been made to incorporate natural language explanations into supervised classification tasks. Andreas et al. (2018);Mu et al. (2020) adopt a multi-task setting by learning classification and captioning simultaneously. Murty et al. (2020); He and Peng (2017) encode natural language explanations as additional features to assist classification. Orthogonal to their approaches, we focus on eliciting fewer but more informative explanations to reduce expert involvement with class-based active learning. Another line of research collects heuristic rules as explanations (e.g., 'honey month' for predicting SPOUSE relationship) to automatically label unlabeled data (Srivastava et al., 2017;Hancock et al., 2018). Different from their settings, we assume no additional training datapoints. In addition, we leverage natural language explanations by extracting knowledge and incorporate the knowledge into classifiers. Distantly related to our work, Hendricks et al. (2016) propose to generate explanations for image classifiers but they do not explore improving the classifiers with the explanations.

Active Learning
The key hypothesis of active learning is that, if the learning algorithm is allowed to choose the data from which it learns, it will perform better than randomly selecting training samples (Settles, 2009). Existing work in active learning focuses primarily on exploring sampling methods to select additional data-points to label from a pool of unlabeled data (Sener and Savarese, 2018;Settles, 2011Settles, , 2009). Luo and Hauskrecht (2017) propose group-based active learning where the annotator could label a group of data points each time rather than one data point. However, they still rely on classification labels as the interface for human-machine communication. Instead, we focus on incorporating natural language explanations into the classification training framework. Contrastive learning has previously been shown to substantially improve unsupervised learning (Abid et al., 2018), feature learning (Zou et al., 2015), and learning probabilistic models (Zou et al., 2013). However, it has not been applied to the setting of active learning with explanations as we explore here.
Hierarchical Visual Recognition Categorical hierarchy is inherent in visual recognition (Biederman, 1987;. Xiao et al. (2014) propose to expand the model based on category hierarchy for incremental learning. Yan et al. (2015) decompose classification task into a coarse category classification and a fine category classifica- tion. Different from previous work, we focus on incorporating contrastive natural language explanations into the model hierarchy to achieve better data efficiency.

Problem Formulation
Contrastive Natural Language Explanations Existing research in social science and cognitive science (Miller, 2019;Mittelstadt et al., 2019) suggests contrastive explanations are more effective in human learning than descriptive explanations. Therefore, we choose contrastive natural language explanations to benefit our learners. An example contrastive explanation is like "Why P rather than Q?", in which P is the target event and Q is a counterfactual contrast case that did not occur (Lipton, 1990). In the example in Figure 1, if we ask the expert to differentiate between Ring-billed gull against California gull, the expert would output the following natural language explanation: "Ringbilled gull has a bill with a black ring near the tip while California gull has a red spot near tip of lower mandible". Our explanations are classbased and are not specifically associated with any particular images.

Problem Setup
We are interested in a C class classification problem defined over an input space X and a label space Y = {1, ..., C}. Initially, the training set D train = {(x i , y i )} N train 1 is small, since our setting is restricted to be low resource. We also assume that there is a limited budget to ask domain experts to provide explanations during training. Specifically, we consider k rounds of interactions with domain experts and each round has a query budget b. For each query, we need to specify two classes y p , y q for domain experts to compare. Domain experts would return a contrastive natural language explanation e. Each explanation e would guide us to focus on the most discriminating semantic segments to differentiate between y p and y q . In this paper, a semantic segment refers to a semantic segment of an object (e.g., "bill" in bird species classification) or a semantic object (e.g., "soccer" in social relationship classification).
To make our framework more general, we start from a standard image classification neural architecture. We formulate our initial model as M (φ, g pool , f ) = f (g pool (φ(x))): Here φ is an image encoder that maps each input image x to an activation map φ(x) ∈ R H×W ×d . g pool is a global pooling layer g pool (φ(x)) ∈ R d pool . f is a fully connected layer that performs flat C way classification. This formulation covers most of the off-the-shelf pre-trained image classifiers.

ALICE: Active Learning with
Contrastive Explanations

Overview
ALICE is an expert-in-the-loop training framework that utilizes contrastive natural language explanations to improve data efficiency in learning. AL-ICE performs multiple rounds of interaction with domain experts and dynamically updates the learning model's structure during each round. Figure 2 describes ALICE's three-step workflow for each round: (A) Class-based Active Learning: ALICE first projects each class's training data into a shared feature space. Then ALICE selects b most confusing class pairs to query domain experts for explanations. (B) Semantic Explanation Grounding: ALICE then extracts knowledge from b contrastive natural language explanations by semantic parsing. ALICE grounds the extracted knowledge on the training data of b class pairs by cropping the corresponding semantic segments. (C) Neural Architecture Morphing: ALICE finally allocates b new local classifiers and merges b class pairs in the global classifier. The cropped image patches are used as additional training data for a newly added local classifier to emphasize these patches' importance. The model is re-trained after each round.

Class-based Active Learning
ALICE optimizes towards requesting the most informative explanations to reduce expert involvement. Since each explanation provides knowledge to distinguish a class pair, we aim to identify the class pairs that confuse the model most and the explanations on these class pairs would intuitively help the model a lot. ALICE identifies confusing class pairs by first projecting each class's training data into a shared feature space g pool (φ(x)). As shown in Figure 2 (A), if the training data of two classes are close in the feature space, it is usually hard for the model to distinguish them and thus it would be helpful to solicit an explanation on this class pair. Based on this intuition, we first define the distance between two classes and then select the class pairs with the lowest distance. We first profile each class j by fitting a multivariate Gaussian distribution N j (µ j , Σ j ) on class j's training sample features. We define the distance between class j and class k as the JensenShannon Divergence (JSD) between N j and N k .
After calculating the distance between all possible class pairs, we select the b class pairs with the lowest JSD distance to query domain experts.

Semantic Explanation Grounding
After identifying b class pairs that the model is most confused about, we send b query to domain experts. We ask the expert the following question for each query, "How would you differentiate class P and class Q?". Since we want the expert to provide general class-level knowledge, each query only contains text information, and no visual examples are provided to the experts. We obtain b contrastive natural language explanations after the query. Next, we parse the natural language explanations into machine-understandable form.  We choose a simple rule-based semantic parser for simplicity, following Hancock et al. (2018). The simple rule-based semantic parser can be used without any additional training and requires minimum effort to develop. Formally, the parser uses a set of rules in the form α → β, which means that α can be replaced by the token(s) in β. Our rules focus primarily on identifying the discriminating semantic segments ( § 3) mentioned in the explanations (e.g., "bill" for differentiating between ring-billed gull and California gull). We also allow the parser to skip unexpected tokens so that the parser could always succeed in generating a valid output.
Since each explanation e provides class-level knowledge to distinguish class y p , y q , we need to propagate the knowledge to all the training datapoints in class y p , y q so that the learning model could incorporate the knowledge later during training. We denote the semantic segments mentioned in an explanation e as S = {s 1 , s 2 , ..., }. For each training data-point of class y p , y q , we apply off-theshelf semantic segment localization models to crop out the image patch(es) of the semantic segment(s) mentioned S = {s 1 , s 2 , ..., } (Figure 2 (B)). The number of patches cropped from each image equals the number of mentioned semantic segments (i.e., |S|). We then resize the image patches to full resolution. The intuition behind our crop-andresize approach comes from the popular image crop data augmentation: it augments the training data with "sampling of various sized patches of the image whose size is distributed evenly between 8% and 100% of the image area" (Szegedy et al., 2015). This data augmentation technique is widelyadopted and is supported by common deep learning frameworks like PyTorch 1 .
ALICE does not need the localization model during testing (More details in § 4.4). The off-theshelf semantic segment localization models could be the pre-trained localization models on various large-scale datasets like Visual Genome (Krishna et al., 2017) and PASCAL-Part (Chen et al., 2014). If there is no available off-the-shelf localization model, we could recruit non-expert annotators to annotate the location of the semantic segments given that our training set D train is small.

Neural Architecture Morphing
Overview ALICE incorporates contrastive natural language explanations through dynamically updating the learning model's structure. The highlevel idea is to allocate a number of local classifiers to help the origin model guided by the explanations. Specifically, for each explanation e that provides knowledge to distinguish two classes y p , y q , we allocate a local classifier that is dedicated to the binary classification between y p , y q . We incorporate the extracted knowledge from explanation e to the local classifier so that the local classifier learns to focus on the discriminating semantic segments pointed out by the domain experts. We first discuss the case where all local classifiers perform binary classification and then discuss how to extend them to support general m-ary classification.
Progressive Architecture Update The initial flat C-way classification architecture could be viewed as a composition of an image encoder φ and a global classifier f • g pool . We discuss how the local classifiers are progressively added to assist the global classifier. As shown in Figure 2 (C), we first merge b class pairs into b super-classes in the global classifier. For example, in the first round, the global classifier would change from C-way to (C −2b+b)-way. We then allocate b new local classifiers, each for performing binary classification for one class pair. Each local classifier is only called when the global classifier predicts its super-class as the most confident. We delay more complex conditional execution schemes as future work. We also note that the conditional execution schemes have 1 torchvision.transforms.RandomResizedCrop Figure 3: Local classifiers with shared attention mechanism potential for reducing computation runtime (Chen et al., 2020;Mailthody et al., 2019). During training, we fine-tune the image encoder φ and reset the global classifier after each round since it is only a linear layer.

Knowledge Grounded Training
The global classifier is trained on D train , with labels adjusted according to the class pair merging. For a local classifier corresponding to the class pair y p , y q , its training data consists of two parts. One part of the training data is the training data-points of classes y p , y q in D train . The other part is the resized image patches of class y p , y q obtained in semantic explanation grounding ( § 4.3). We use the resized image patches as additional training data to to emphasize these patches' importance. Take the local classifier distinguishing ring-billed gull and California gull as an example (Figure 2 (B, C)). This local classifier is trained on the training images of ring-billed gull and California gull, as well as the bills' patches of each training image of ring-billed gull and California gull. During testing, we only feed the whole image into the model.
Supporting m-ary local classifier So far we have assumed that the local classifier is always a binary classifier. An implicit assumption is that the b class pairs have no overlap. We could support overlapping class pairs as follows. If some class pairs have overlap (e.g., class pair (P, Q), class pair (P, T ), class pair (T, U )), we only allocate one local classifier for them (e.g., a 4-ary local classifier for class (P, Q, T, U )). We also merge all the relevant classes in the global classifier into only one super-class (e.g., super-class {P, Q, T, U }). The local classifier is trained on the union of the overlapping class pairs' training data including patches.
Local Classifier Design Our framework is agnostic to the design choice of the local classifiers. Any design could be plugged into ALICE. We provide a default design as follows. Ideally, each local classifier should learn which semantic segments to focus and how to detect them. Since different local classifier might need to detect the same semantic segments (e.g., bill), the knowledge of detecting semantic segments could be shared among all local classifiers. Therefore, we introduce a shared attention mechanism, which is parameterized using M learnable latent attention queries q 1 , q 2 , ..., q M ∈ R d that represent M different latent semantic segments. To keep our design general, we do not bind each latent attention queries to any concrete semantic segments (e.g., we do not assign binding like q 1 to "bill") and these queries are trained in a weakly-supervised manner. Following Lin et al. (2015); Hu and Qi (2019), we view the activation map φ(x) ∈ R H×W ×d of each image x as H ×W attention keys k 1 , ..., k H×W ∈ R d . We compute the attention by: Each row in the attention output matrix A ∈ R M ×d is the attention output for each attention query q i , which is a descriptor of the i th latent semantic segments. After the shared attention mechanism, each local classifier applies a private fully-connected layer on flattened(A) to make predictions. Each local classifier could ignore irrelevant semantic segments by simply setting the corresponding weights in its fully-connected layer to zero.
Implementation Our image encoder φ could be any off-the-shelf visual backbone model and we use Inception v3 (Szegedy et al., 2016). We implement our semantic parser on top of the Pythonbased SippyCup (Liang and Potts, 2015) following previous work Hancock et al. (2018). Our framework could support applications in other languages by changing a semantic parser for corresponding languages. We provide more details in Appendix.

Bird Species Classification Task
Dataset We use the CUB-200-2011 dataset (Wah et al., 2011), which contains 11, 788 images for 200 species of North American birds. We randomly sample 25 bird species due to limited access to expert query budget. Following Vedantam et al. (2017), We make sure that each sampled species has one or more confusing species from the same subfamilia so that they are challenging to classify.  In addition, each image in the CUB data-set is also annotated with the locations of 15 semantic segments (e.g., "bill", "eye"). We use these location annotations to crop training image patches based on the explanations. We do not use any location annotation during testing. More details are provided in the Appendix, including the list of 25 sampled species. We experiment with a low-resource setting with only 15 images per bird species. We employ an amateur bird watcher as the domain expert since we do not expect general MTurk workers to have enough domain expertise. To further ensure the annotation quality, our domain expert checks the professional birding field guide 2 before writing each explanation. We ask the expert, "How would you differentiate bird species P and bird species Q?". In total, we collect 67 contrastive natural language explanations (avg. length 18.45 words). We collect the explanations in an on-demand manner because our class-based active Figure 5: Comparing the performance gain of adding contrastive natural language explanations and adding training data points on bird species' prediction accuracy. Empirically, adding 1 explanation leads to similar performance gain as adding 30 labeled training data points.
learning is empirically insensitive to the change of random seeds and hyper-parameters. Our semantic parser identifies 2.36 semantic segments per explanation on average. In each experiment, we conduct k = 4 rounds of expert queries, with a query budget b = 3 for each round.

Discussion on CUB Description Dataset
The CUB description dataset collects descriptions of visual appearance for each image rather than explanations of why the bird in the image belongs to a certain class (Reed et al., 2016;Hendricks et al., 2016). For example, an image with a Ringbilled gull has the description: "This is a white bird with a grey wing and orange eyes and beak." However, this description also fits perfectly with a California gull (Figure 1). So the crowd-sourced descriptions in the CUB description dataset is not ideal to support classification. We collected expert explanations: "Ring-billed gull has a bill with a black ring near the tip while California gull has a red spot near the tip of lower mandible." to improve classification data efficiency. In addition, we also conducted experiments to incorporate CUB descriptions (5 sentences per image), but we did not find improved performance in our setting.

Model Ablations and Metrics
We compare AL-ICE to its several ablations (Table 2) and evaluate the performance on the test set. We report classification accuracy on species as well as subfamilia. For subfamilia accuracy, a prediction is counted as correct as long as the predicted species' subfamilia is the same as the labeled species' subfamilia. (1) Base(Inception v3) fine-tunes the pre-trained Inception v3 to perform a flat-25 way classification. (2) ALICE w/o Grounding copies the final neural archi-

No.
Model Accuracy (%)  (3) AL-ICE w/o Hierarchy has the same neural architecture as (1) but has access to the discriminating semantic segments. (4) ALICE w/ Random Grounding has the semantic segments that are randomly sampled.
(5) ALICE w/ Random Pairs replaces class-based active learning with randomly selected class pairs. The randomly selected class pairs are used to query experts and change the learning model's neural architecture. (9-12) ALICE i th round shows ALICE's performance after the i th round of expert queries.

Results
Our first takeaway is that incorporating contrastive natural language explanations is more data-efficient than adding extra training data points. Figure 5 visualizes the performance gain of adding explanations and adding data points. ((6-12) in Table 2). As shown in Figure 5, adding 1 explanation leads to the same amount of performance gain of adding 30 labeled data points. For example, adding 12 explanations (ALICE (4 th round), 76.05%) achieves comparable performance gain of adding 375 training images (RandomSampling + 100% extra data, 75.91%). We note that writing one explanation for an expert is typically faster than labeling 15-30 examples. As an estimate, ; Hancock et al. (2018); Zaidan and Eisner (2008) perform user study and find that collecting natural language explanations is only twice as costly as collecting labels for their tasks. Our experiment shows that adding 1 explanation leads to similar performance gain as adding 30 labeled training data points, yielding a 6× speedup. Our second takeaway is that both the ground-    Table 2).
Visualization We show how the explanations help the learning model as shown in Figure 4. We visualize the saliency maps (Simonyan et al., 2014) corresponding to the correct class on four example images. As shown in Figure 4, the base model does not know which semantic segments to focus and makes wrong predictions. In contrast, ALICE's local classifiers obtain knowledge from the expert explanations and successfully learns to focus on the discriminating semantic segments to make the correct predictions.
6 Social Relationship Classification Task Dataset We also evaluate ALICE on the People in Photo Album Relation dataset . An example is shown in Figure 6. The dataset was originally collected from Flickr photo albums and involves 5 social domains and 16 social relations. We focus on the images that have only two people since handling more than two people requires task-specific neural architecture. The details of dataset pre-processing are included in Appendix. After pre-processing, we obtain 1, 679 training images and 802 testing images. We experiment with a low-resource setting with 15% of the remaining training images (i.e., 264 images). We obtain explanations by converting the knowledge graph collected by  into a parsed format. The semantic segments here are contextual objects like soccer. The knowledge graph contains heuristics to distinguish social relations by the occurrence of contextual objects (e.g., "soccer" for sports v.s. colleagues).
We use a faster-RCNN-based object detector (Ren  (Lin et al., 2014) to localize the semantic segments (contextual objects) during training. The object detector is not used during testing. We set rounds of expert queries k = 2 and the query budget b = 4.

Results
We compare ALICE to its several ablations (Table 4) and evaluate the performance on the testing set. We report classification accuracy on social relationships as well as social domains. We observe similar benefits of incorporating explanations to ALICE as in the bird species classification task. As shown in Table 4, the base model with 40% extra training data (i.e., 105 images) still slightly underperforms ALICE with 8 explanations (Ran-domSampling + 40% extra data, 36.28% v.s. AL-ICE (2 nd round), 36.41%). As shown in Figure 7, adding 1 explanation leads to similar performance gain as adding 13 labeled training data points. Our ablation experiment also confirms the importance of class-based active learning. If we replace classbased active learning with a random selection of class pairs, ALICE learns a bad model structure that leads to reduced performance (ALICE w/ Random Pairs, 22.94%). The performance drop in domain accuracy is also significant. We suspect it is because the bad model structure confuses the global classifier a lot. If the global classifier calls a wrong local classifier, the local classifier is forced to make a prediction on such a out-of-distribution data. In addition, our ablation experiment also verify the importance of having knowledge beyond having the localization model. Substituting the discriminating semantic segments' image patches with other semantic segments' patches leads to worse performance (ALICE w/ Random Grounding, 27.20%).
One reason is that there are many objects in each image. Under our low resource setting, learning on the image patches of random semantic segments may make the model to latch on to sample-specific artifacts in the training images, which leads to poor generalization.

Conclusion
We propose an expert-in-the-loop training framework ALICE to utilize contrastive natural language explanations to improve a learning algorithm's data efficiency. We extend the concept of active learning to class-based active learning for choosing the most informative query pair. We incorporate the extracted knowledge from expert natural language explanation by changing our algorithm's neural network structure. Our experiments on two visual recognition tasks show that incorporating natural language explanations is far more data-efficient than adding extra training data. In the future, we plan to examine the hierarchical classification architecture's potential for reducing computational runtime.