Shaping Visual Representations with Language for Few-Shot Classification

By describing the features and abstractions of our world, language is a crucial tool for human learning and a promising source of supervision for machine learning models. We use language to improve few-shot visual classification in the underexplored scenario where natural language task descriptions are available during training, but unavailable for novel tasks at test time. Existing models for this setting sample new descriptions at test time and use those to classify images. Instead, we propose language-shaped learning (LSL), an end-to-end model that regularizes visual representations to predict language. LSL is conceptually simpler, more data efficient, and outperforms baselines in two challenging few-shot domains.


Introduction
Humans are powerful and efficient learners partially due to the ability to learn from language (Chopra et al., 2019;Tomasello, 1999). For instance, we can learn about robins not by seeing thousands of examples, but by being told that a robin is a bird with a red belly and brown feathers. This language further shapes the way we view the world, constraining our hypotheses for new concepts: given a new bird (e.g. seagulls), even without language we know that features like belly and feather color are relevant (Goodman, 1955).
In this paper, we guide visual representation learning with language, studying the setting where no language is available at test time, since rich linguistic supervision is often unavailable for new concepts encountered in the wild. How can one best use language in this setting? One option is to just regularize, training representations to predict language descriptions. Another is to exploit the compositional nature of language directly by using it as a bottleneck in a discrete latent variable model. Auxiliary training (discard at test) Figure 1: We propose few-shot classification models whose learned representations are constrained to predict natural language task descriptions during training, in contrast to models which explicitly use language as a bottleneck for classification (Andreas et al., 2018).
For example, the recent Learning with Latent Language (L3; Andreas et al., 2018) model does both: during training, language is used to classify images; at test time, with no language, descriptions are sampled from a decoder conditioned on the language-shaped image embeddings. Whether the bottleneck or regularization most benefits models like L3 is unclear. We disentangle these effects and propose language-shaped learning (LSL), an end-to-end model that uses visual representations shaped by language (Figure 1), thus avoiding the bottleneck. We find that discrete bottlenecks can hurt performance, especially with limited language data; in contrast, LSL is architecturally simpler, faster, uses language more efficiently, and outperforms L3 and baselines across two few-shot transfer tasks.

Related Work
Language has been shown to assist visual classification in various settings, including traditional visual classification with no transfer (He and Peng, 2017) and with language available at test time in the form of class labels or descriptions for zero- (Frome et al., 2013;Socher et al., 2013) or fewshot (Xing et al., 2019) learning. Unlike past work, we have no language at test time and test tasks differ from training tasks, so language from training cannot be used as additional class information (cf. He and Peng, 2017) or weak supervision for labeling additional in-domain data (cf. Hancock et al., 2018). Our setting can be viewed as an instance of learning using privileged information (LUPI; Vapnik and Vashist, 2009), where richer supervision augments a model only during training.
In this framework, learning with attributes and other domain-specific rationales has been tackled extensively (Zaidan et al., 2007;Donahue and Grauman, 2011;Tokmakov et al., 2019); language less so. Gordo and Larlus (2017) use METEOR scores between captions as a similarity measure for specializing embeddings for image retrieval, but do not directly ground language explanations. Srivastava et al. (2017) explore a supervision setting similar to ours, except in simple text and symbolic domains where descriptions can be easily converted to executable logical forms via semantic parsing.
Another line of work studies the generation of natural language explanations for interpretability across language (e.g. entailment; Camburu et al., 2018) and vision (Hendricks et al., 2016(Hendricks et al., , 2018 tasks, but here we examine whether predicting language can actually improve task performance; similar ideas have been explored in text (Rajani et al., 2019) and reinforcement learning (Bahdanau et al., 2019;Goyal et al., 2019) domains.

Language-shaped learning
We are interested in settings where language explanations can help learn representations that generalize more efficiently across tasks, especially when training data for each task is scarce and there are many spurious hypotheses consistent with the input. Thus, we study the few-shot (meta-)learning setting, where a model must learn from a set of train tasks, each with limited data, and then generalize to unseen tasks in the same domain.
Specifically, in N -way, K-shot learning, a task t consists of N support classes {S  m as input, the goal is to predict its class y (t) m ∈ {1, . . . , N }. After learning from a set of tasks T train , a model is evaluated on unseen tasks T test .
While the language approach we propose is applicable to nearly any meta-learning framework, we use prototype networks (Snell et al., 2017), which have a simple but powerful inductive bias for fewshot learning. Prototype networks learn an embedding function f θ for examples; the embeddings of the support examples of a class n are averaged to form a class prototype (omitting task (t) for clarity): Given a query example (x m , y m ), we predict class n with probability proportional to some similarity function s between c n and f θ (x m ): f θ is then trained to minimize the classification loss

Shaping with language
Now assume that during training we have for each class S n a set of J n associated natural language descriptions W n = {w 1 , . . . , w Jn }. Each w j should explain the relevant features of S n and need not be associated with individual examples. 1 In Figure 1, we have one description w 1 = (A, red, . . . , square). Our approach is simple: we encourage f θ to learn prototypes that can also decode the class language descriptions. Letc n be the prototype formed by averaging the support and query examples of class n. Then define a language model g φ (e.g., a recurrent neural network), which conditioned on c n provides a probability distribution over descriptions g φ (ŵ j |c n ) with a corresponding natural language loss: (4) i.e. the total negative log-likelihood of the class descriptions across all classes in the task. Since L NL depends on parameters θ through the prototypec n , this objective should encourage our model to better represent the features expressed in language. Now we jointly minimize both losses: where the hyperparameter λ NL controls the weight of the natural language loss. At test time, we simply discard g φ and use f θ to classify. We call our approach language-shaped learning (LSL; Figure 1).

Relation to L3
L3 (Andreas et al., 2018) has the same basic components of LSL, but instead defines the concepts c n to be embeddings of the language descriptions themselves, generated by an additional recurrent neural network (RNN) encoder h η : c n = h η (w n ). During training, the ground-truth description is used for classification, while g φ is trained to produce the description; at test time, L3 samples candidate descriptionsŵ n from g φ , keeping the description most similar to the images in the support set according to the similarity function s (Figure 1). Compared to L3, LSL is simpler since it (1) does not require the additional embedding module h η and (2) does not need the test-time language sampling procedure. 2 This also makes LSL much faster to run than L3 in practice: without the language machinery, LSL is up to 50x faster during inference in our experiments.

Experiments
Here we describe our two tasks and models. For each task, we evaluate LSL, L3, and a prototype network baseline trained without language (Meta; Figure 1). For full details, see Appendix A.
ShapeWorld. First, we use the ShapeWorld (Kuhnle and Copestake, 2017) dataset used by Andreas et al. (2018), which consists of 9000 training, 1000 validation, and 4000 test tasks (Figure 2). 3 Each task contains a single support set of K = 4 images representing a visual concept with an associated (artificial) English language description, generated with a minimal recursion semantics representation of the concept (Copestake et al., 2016). Each concept is a spatial relation between two objects, each object optionally qualified by color and/or shape, with 2-3 distractor shapes present. The task is to predict whether a query image x belongs to the concept.
For ease of comparison, we report results with models identical to Andreas et al. (2018), where f θ is the final convolutional layer of a fixed ImageNetpretrained VGG-16 (Simonyan and Zisserman, 2015) fed through two fully-connected layers: f θ (x) = FC(ReLU(FC(VGG-16(x)))). (6) However, because fixed ImageNet representations may not be the most appropriate choice for artificial data, we also run experiments with convolutional networks trained from scratch: either the 4-layer convolutional backbone used in much of the few-shot literature , as used in the Birds experiments we describe next, or a deeper ResNet-18 (He et al., 2016). This is a special binary case of the few-shot learning framework, with a single positive support class S and prototype c. Thus, we define the similarity function to be the sigmoid function s(a, b) = σ(a · b) and the positive prediction P (ŷ = 1 | x) = s (f θ (x), c). g φ is a 512dimensional gated recurrent unit (GRU) RNN (Cho et al., 2014) trained with teacher forcing. Through a grid search on the validation set, we set λ NL = 20.
Birds. To see if LSL can scale to more realistic scenarios, we use the Caltech-UCSD Birds dataset (Wah et al., 2011), which contains 200 bird species, each with 40-60 images, split into 100 train, 50 validation, and 50 test classes. During training, tasks are sampled dynamically by selecting N classes from the 100 train classes. K support and 16 query examples are then sampled from each class (similarly for val and test). For language, we use the descriptions collected by Reed et al. (2016), where This bird has distinctive-looking brown and white stripes all over its body, and its brown tail sticks up.
The bird has a white underbelly, black feathers in the wings, a large wingspan, and a white beak. AMT crowdworkers were asked to describe individual images of birds in detail, without reference to the species (Figure 2). While 10 English descriptions per image are available, we assume a more realistic scenario where we have much less language available only at the class level: removing associations between images and their descriptions, we aggregate D descriptions for each class, and for each K-shot training task we sample K descriptions from each class n to use as descriptions W n . This makes learning especially challenging for LSL due to noise from captions that describe features only applicable to individual images. Despite this, we found improvements with as few as D = 20 descriptions per class, which we report as our main results, but also vary D to see how efficiently the models use language.

Birds
We evaluate on the N = 5-way, K = 1-shot setting, and as f θ use the 4-layer convolutional backbone proposed by Chen et al. (2019). Here we use a learned bilinear similarity function, s(a, b) = a Wb, where W is learned jointly with the model. g φ is a 200-dimensional GRU, and with another grid search we set λ NL = 5.

Results
Results are in Table 1. For ShapeWorld, LSL outperforms the meta-learning baseline (Meta) by 6.7%, and does at least as well as L3; Table 2 shows similar trends when f θ is trained from scratch. For Birds, LSL has a smaller but still significant 3.3% increase over Meta, while L3 drops below baseline. Furthermore, LSL uses language more efficiently: Figure 3 shows Birds performance as the captions per class D increases from 1 (100 total) to 60 (6000 total). LSL benefits from a remarkably small number of captions, with limited gains past 20; in contrast, L3 requires much more language to This bird has a white belly and breast with brown wings and a black crown. This is a dark gray bird with a light brown belly.
Stripes tarsuses are both light, olive colored head, small songbird edges to light brown.
Dark grey feathers and bright red with a black pointed beak. Figure 4: Examples of language generated by the L3 decoder g φ for Birds validation images. Since the LSL decoder is identically parameterized, it generates similar language. even approach baseline performance.
In the low-data regime, L3's lower performance is unsurprising, since it must generate language at test time, which is difficult with so little data. Example output from the L3 decoder in Figure 4 highlights this fact: the language looks reasonable in some cases, but in others has factual errors (dark gray bird; black pointed beak) and fluency issues.
These results suggest that any benefit of L3 is likely due to the regularizing effect that language has on its embedding model f θ , which has been trained to predict language for test-time inference; in fact, the discrete bottleneck actually hurts in some settings. By using only the regularized visual representations and not relying exclusively on the generated language, LSL is the simpler, more efficient, and overall superior model.

Language ablation
To identify which aspects of language are most helpful, in Figure 5 we examine LSL performance under ablated language supervision: (1) keeping only a list of common color words, (2) filtering out color words, (3) shuffling the words in each caption, and (4) shuffling the captions across tasks (see Figure 6 for examples). We find that while the benefits of color/no-color language varies across tasks, neither component provides the benefit of complete language, demonstrating that LSL leverages both colors and other attributes (e.g. size, shape) described in language. Word order is important for Birds but surprisingly unimportant for ShapeWorld, suggesting that even with decoupled colors and shapes, the model can often infer the correct relation from the shapes that consistently appear in the examples. Finally, when captions are shuffled across tasks, LSL for Birds does no worse than Meta, while ShapeWorld suffers, suggesting that language is more important for ShapeWorld than for the fine-grained, attributebased Birds task.

Discussion
We presented LSL, a few-shot visual recognition model that is regularized with language descriptions during training. LSL outperforms baselines across two tasks and uses language supervision more efficiently than L3. We find that if a model is trained to expose the features and abstractions in language, a linguistic bottleneck on top of these Birds ShapeWorld a cyan pentagon is to the right of a magenta shape cyan magenta a pentagon is to the right of a shape shape right the is a pentagon a of cyan to magenta a green square is below a triangle The bird has a white underbelly, black feathers in the wings, a large wingspan, and a white beak.

white black white
The bird has a underbelly feathers in the wings, a large wingspan, and a beak.
The , a and a . , beak bird in wingspan feathers large the black white underbelly has , white a wings This magnificent fellow is almost all black with a red crest, and white cheek patch. language-shaped representations is unnecessary, at least for the kinds of visual tasks explored here.
The line between language and sufficiently rich attributes and rationales is blurry, and recent work (Tokmakov et al., 2019) suggests that similar performance gains can likely be observed by regularizing with attributes. However, unlike attributes, language is (1) a more natural medium for annotators, (2) does not require preconceived restrictions on the kinds of features relevant to the task, and (3) is abundant in unsupervised forms. This makes shaping representations with language a promising and easily accessible way to improve the generalization of vision models in low-data settings.

A Model and training details
A.1 ShapeWorld f θ . Like Andreas et al. (2018), f θ starts with features extracted from the last convolutional layer of a fixed ImageNet-pretrained VGG-19 network (Simonyan and Zisserman, 2015). These 4608-d embeddings are then fed into two fully connected layers ∈ R 4608×512 , R 512×512 with one ReLU nonlinearity in between.
LSL. For LSL, the 512-d embedding from f θ directly initializes the 512-d hidden state of the GRU g φ . We use 300-d word embeddings initialized randomly. Initializing with GloVe (Pennington et al., 2014) made no significant difference.
L3. f θ and g φ are the same as in LSL and Meta. h η is a unidirectional 1-layer GRU with hidden size 512 sharing the same word embeddings as g φ . The output of the last hidden state is taken as the embedding of the description w (t) . Like Andreas et al. (2018), a total of 10 descriptions per task are sampled at test time.
Training. We train for 50 epochs, each epoch consisting of 100 batches with 100 tasks in each batch, with the Adam optimizer (Kingma and Ba, 2015) and a learning rate of 0.001. We select the model with highest epoch validation accuracy during training. This differs slightly from Andreas et al. (2018), who use different numbers of epochs per model and did not specify how they were chosen; otherwise, the training and evaluation process is the same.
Data. We recreated the ShapeWorld dataset using the same code as Andreas et al. (2018), except generating 4x as many test tasks (4000 vs 1000) for more stable confidence intervals.
Note that results for both L3 and the baseline model (Meta) are 3-4 points lower than the scores reported in Andreas et al. (2018) (because performance is lower for all models, we are not being unfair to L3). This is likely due to differences in model initialization due to our PyTorch reimplementation and/or recreation of the dataset with more test tasks.
A.2 Birds f θ . The 4-layer convolutional backbone f θ is the same as the one used in much of the few-shot literature Snell et al., 2017). The model has 4 convolutional blocks, each consisting of a 64-filter 3x3 convolution, batch normalization, ReLU nonlinearity, and 2x2 max-pooling layer. With an input image size of 84 × 84 this results in 1600-d image embeddings. Finally, the bilinear matrix W used in the similarity function has dimension 1600 × 1600.
LSL. The resulting 1600-d image embeddings are fed into a single linear layer ∈ R 1600×200 which initializes the 200-d hidden state of the GRU. We initialize embeddings with GloVe. We did not observe significant gains from increasing the size of the decoder g φ .
L3. f θ and g φ are the same. h η is a unidirectional GRU with hidden size 200 sharing the same embeddings as g φ . The last hidden state is taken as the concept c n . 10 descriptions per class are sampled at test time. We did not observe significant gains from increasing the size of the decoder g φ or encoder h η , nor increasing the number of descriptions sampled per class at test.
Training. For ease of comparison to the few-shot literature we use the same training and evaluation process as Chen et al. (2019). Models are trained for 60000 episodes, each episode consisting of one randomly sampled task with 16 query images per class. Like Chen et al. (2019), they are evaluated on 600 episodes. We use Adam with a learning rate of 0.001 and select the model with the highest validation accuracy after training.
Data. Like Chen et al. (2019), we use standard data preprocessing and training augmentation: Ima-geNet mean pixel normalization, random cropping, horizontal flipping, and color jittering.