ZEST: Zero-shot Learning from Text Descriptions using Textual Similarity and Visual Summarization

We study the problem of recognizing visual entities from the textual descriptions of their classes. Specifically, given birds' images with free-text descriptions of their species, we learn to classify images of previously-unseen species based on specie descriptions. This setup has been studied in the vision community under the name zero-shot learning from text, focusing on learning to transfer knowledge about visual aspects of birds from seen classes to previously-unseen ones. Here, we suggest focusing on the textual description and distilling from the description the most relevant information to effectively match visual features to the parts of the text that discuss them. Specifically, (1) we propose to leverage the similarity between species, reflected in the similarity between text descriptions of the species. (2) we derive visual summaries of the texts, i.e., extractive summaries that focus on the visual features that tend to be reflected in images. We propose a simple attention-based model augmented with the similarity and visual summaries components. Our empirical results consistently and significantly outperform the state-of-the-art on the largest benchmarks for text-based zero-shot learning, illustrating the critical importance of texts for zero-shot image-recognition.


Introduction
In computer vision, zero shot-learning (ZSL) for image classification is the problem of classifying images given auxiliary information. An image classification model is trained to classify images from a pre-defined set of classes. At test time, images from new classes are given, and the task is to transfer knowledge learned from seen classes during training to unseen test classes. (1) we aim to leverage the similarity within texts (red) via document clustering (bottom box); (2) we aim to extract similar (red) and dissimilar (black) visual descriptions, and remove non-visually relevant (blue) one.
A common setup for ZSL assumes that the auxiliary information is a set of semantically meaningful properties (called attributes) describing the class (e.g., black-beak, long-tail) (Wah et al., 2011;Farhadi et al., 2009). A different ZSL setup uses image captions as auxiliary information (Reed et al., 2016;Felix et al., 2018). Typically, this auxiliary information is manually collected by human raters for each image (test and train alike) and averaged across images. A more realistic approach relies on available online text descriptions of classes (e.g., Wikipedia) (Elhoseiny et al., 2017). It avoids ex-pensive annotation and exposure to test images.
In this work, we classify bird species according to Wikipedia descriptions. This task raises many challenges: (1) Differences between the birds are very small, which makes it a fine-grained classification task; (2) This is an expert task, and the text contains terminology that is unlikely to be familiar to a layman; and, on top of that (3) The text descriptions of the classes are long, containing few visually relevant sentences.
As opposed to previous work on text-based ZSL employing textual descriptions (Zhu et al., 2018;Elhoseiny et al., 2017) that focused on the visual modality, here we focus on the text modality, and address a key question in ZSL: How can we identify text components that are visual in nature?
To get an intuition about the task setup and our proposed solution, consider the following situation. Imagine you have never seen a zebra but have seen a horse. What if you were given a text describing a zebra: "Zebras have hooves, mane, tail, pointed ears, and white and black stripes". This description would probably be very close to a description of a horse having "hooves, mane, tail, pointed ears" and you would probably be looking for an image that reminds you of a horse but has "white and black stripes". So, even without ever seeing a zebra, using text-descriptions of the zebra and knowledge already acquired about horses, one can correctly classify unknown classes like a zebra.
Our proposed solution has two-phases. First, based on the intuition that similar objects (or images thereof) tend to have similar texts, we encode a similarity feature that enhances text descriptions' separability. In addition, we leverage the intuition that the differences between text descriptions of species would be their most salient visual features, and extract visually relevant descriptions from the text.
Our experiments empirically demonstrate both the efficacy and generalization capacity of our proposed solution. On two large ZSL datasets, in both the easy and hard scenarios, the similarity method obtains a ratio improvement of up to 18.3%. With the addition of extracting visually relevant descriptions, we obtain a ratio improvement of up to 48.16% over the state-of-the-art. We further show that our visual-summarization method generalizes from the CUB dataset (Wah et al., 2011) to the NAB dataset (Van Horn et al., 2015), and we demonstrate its contribution to additional models by a ratio improvement of up to 59.62%.
The contributions of this paper are threefold. First, to the best of our knowledge, we are the first to showcase the critical importance of the text representation in zero-shot image-recognition scenarios, and we present two concrete text-based processing methods that vastly improve the results. Second, we demonstrate the efficacy and generalizability of our proposed methods by applying them to both the zero-shot and generalized zero-shot tasks, outperforming all previously reported results on the CUB and NAB Benchmarks. Finally, we show that visual aspects learned from one dataset can be transferred effectively to another dataset without the need to obtain dataset-specific captions. The efficacy of our proposed solution on these benchmarks illustrates that purposefully exposing the visual features in texts is indispensable for tasks that learn to align the vision-and-language modalities.

Background and Related Work
Zero-shot learning (ZSL) aims at overcoming the need to label massive datasets for new categories, by learning the connections between images and prior auxiliary knowledge about their classes. At test-time, this auxiliary information compensates for the lack of previously-attained visual information about the new categories.
Text-based ZSL is a specific multimodal instantiation of this learning task that uses natural language descriptions as the auxiliary information. Models for text-based ZSL are typically composed of three parts: (1) the text representation; (2) the image representation; (3) a compatibility function between the two. While most previous work focused mainly on the latter two components, here we focus on the text.
Most ZSL studies for object recognition are aimed at processing the image modality. For example, Xu et al. (2018); Lei ; Qiao et al. (2016);  rely on visual features extracted using Convolutional Neural Network (CNN). More recent studies use object detection to detect the semantic parts of the object and extract visual features at the part-level (Elhoseiny et al., 2017;Zhu et al., 2018;Zhang et al., 2016). This approach makes the image more compatible with the text, as it enables text-terms such as "crest" to be linked to the visual representation of parts like "head".
The auxiliary information provided to ZSL tasks may be of various kinds, ranging from pre-defined semantic attributes (Lampert et al., 2009;Changpinyo et al., 2020;Atzmon and Chechik, 2018), to captions (Xian et al., 2018;Sariyildiz and Cinbis, 2019) to Wikipedia article describing the species (Elhoseiny et al., 2017). Here we assume the latter scenario. ZSL studies that rely on Wikipedia articles as auxiliary information improve the visual representation and the compatibility function, and use text representations such as Bag-of-Words and TF-IDF, without further text processing. (Lei Ba et al., 2015;Elhoseiny et al., 2013Elhoseiny et al., , 2017Zhu et al., 2018). Qiao et al. (2016) used a simple BOW and a L 1,2 -norm objective to suppress the noisy signal in the text. However, this basic treatment of the text is problematic, as it misses crucial information for detecting the correct class. Recent studies (Lu et al., 2019;Tan and Bansal, 2019) have shown improved performance on multiple vision-and-language tasks using pre-trained BERT-based models that jointly learn a representation for vision and language. However, they are tuned on relatively short texts and are not optimal for classifying long textual descriptions.
In this work, we proceed in a different, yet complementary, direction to previous work, aiming to purposefully model the contribution of the textual modality to ZSL. We aim to establish the importance of adequately processing the text into a sound representation of visually salient features, in order to increase the vision-and-language compatibility, which can then be effectively learned in an end-toend manner.

Strong Baseline Model
The basic architecture, which term ZEST vanilla , is a simple multiplicative attention mechanism (Luong et al., 2015) inspired by Romera-Paredes and Torr (2015). We model the problem using an attention-based model, where the image is queried against a set of candidate documents.
Formally, let x S 1 , . . . , x S M be image feature vectors from a training-set, where x S i ∈ R m . The set of M training images corresponds to a set of L seen classes. Each class has a single "class description" which is a document written by experts in free language (e.g. Wikipedia). We denote d S 1 , . . . , d S L as a set of L document feature vectors, where d S i ∈ Rm. Likewise, let x U 1 , . . . , x U N be the image feature vectors from a test set, where x U i ∈ R m . The set of test images corresponds to a set of K unseen classes.
Likewise, each class has a single "class description". We denote d U 1 , . . . , d U K as a sets of document feature vectors, where x U i ∈ Rm. Finally, W ∈ R m×m is our learned matrix. At inference, the label assignment of an image x U i is defined as: For an image representation x S i and a text repre- i corresponds to the class described by d S j and 0 otherwise. The matrix W is then learned by minimizing the categorical crossentropy loss: Image Encoding The image encoder's goal is to transform the image into a vector representation of the most salient visual features for the classification. We adopt the image encoder for text-based ZSL of Zhang et al. (2016); Zhu et al. (2018); Elhoseiny et al. (2017). It is based on a Fast R-CNN with (Girshick, 2015) a VGG16 backbone for object detection to detect seven semantic parts in the CUB dataset: "head","back","belly","breast","leg","wing","tail". Each visual part's encoded features are then concatenated into a feature vector that functions as the image representation for the text-based ZSL.
Text Processing Our basic encoder processes the text into a feature vector. Similar to previous studies, we employ a TF-IDF representation (Salton and Buckley, 1988). We preprocess the text to tokenize words, remove stop words, and stem the remaining words. Then, we extract a feature vector using TF-IDF. This processing procedure is similar to the text processing presented by Zhu et al. (2018). The dimensionalities of TF-IDF features for CUB and NAB are 7,551 and 13,217, respectively.

The Proposed Approach
Our solution's key idea is to replace the general class's text representation with a text representation focusing on the most salient features for the visual recognition task. To do so, we employ two different (complementary) methods: (i) induce a similarity measure used for clustering; and (ii) extract visually relevant text descriptions. Both methods are incorporated in our proposed end-to-end

The Importance of being Similar
Our proposed method leverages the similarities between images and texts. That is, when the images look similar, the texts describing their classes are also similar, and vice versa. Here, we propose to reconstruct this similarity link.
To this end, we propose two models: (1) a strong baseline based on two nearest neighbors, which create a link between images and texts; (2) adding a similarity component to our model ZEST vanilla . For both models we use the Image Encoder (section 3) to process the images x, and the Text Processing (section 3) to process documents D.

Nearest Neighbor Similarity (NNS)
Figure 3 presents our Nearest Neighbor Similarity (NNS) method, which aims to reconstruct the parallel similarity links between the vision and text latent spaces.
The algorithm is as follows. Given an image x U from an unseen class in the zero-shot phase, we first look for the nearest neighbor image in the set of training images, using cosine similarity. The closest image from the training set x S k corresponds to a document from training d Ŝ k . We then look for the nearest neighbor text in from test set d U y and predict the corresponding class label y.

ZEST similarity
A different way to incorporate textual similarity into the classification is to embed it into our model ZEST vanilla , to benefit from it in the learning procedure. To this aim, we want to add on top of our text feature vector a representation of the text's similarity to its neighbors.
The Basic Encoder captures similarities and differences at the word-level. However, to find similarities at the document level we add to this vector our similarity component, which applies unsupervised clustering to all class descriptions in the training and test texts. We use two different clustering methods that capture different aspects of text similarities. The cluster indexes are then embedded as a BOW (hence cluster embedding).
We hypothesize that the similarity component will work well on the "easy" scenario -where closely related birds are seen during training, and their text can cluster together to indicate these similarities.

The Importance of Being Seen
Here we extract visually relevant features from the text, making the texts that enter the classification more compatible with the salient visual information typically reflected in images.
While the similarity method takes advantage of the similarity between objects seen in training and objects seen at test time, here we want to address the harder scenario, where similar objects are observed together during test time only (e.g. zebras and mules), and they may be very different from those observed during training.
To differentiate between classes in the test set we need to emphasize the parts that are different, both in the image and the text -and these are typically their most salient visual features.

Visually Relevant Summaries (VRS)
Our method for enhancing the textual description is based on visually relevant extractive summaries. Extractive summarization is the task of extracting a small number of sentences that summarize a given document. In this work, we define visually relevant extractive summarization (VRS) as the task of extracting only sentences that represent visually relevant language. The term visually relevant language (VRL) was coined by Winn et al. (2016) to indicate sentences which are visually descriptive with respect to the object (i.e., bird species).
A naïve approach for VRS would be to extract sentences with parts that we know are visually salient in our domain (e.g., the 7 parts employed by the vision recognition representation). However, this naïve approach has several drawbacks. First, bird parts can be described using many different terms and paraphrases; additionally, a bird can be described by its property values (e.g., black), without any mention of the attribute (e.g., beak). Instead, we propose to use the similarity of sentences in the documents and compare them to naturally occurring sentences ('in the wild') containing VRL.
Note that we cannot rely on descriptions of particular species due to the zero-shot setup. We must do with descriptions of objects in the general domain of objects we are interested in classifying.

ZEST similarity +VRS
One way to obtain naturally-occurring descriptions of birds is from captions that describe bird images. Critically, these captions need not be from our dataset, they can describe any bird image.
We propose to use a set of L bird captions to create an unsupervised classifier. The classifier will receive a set of sentences (assembled as a document), and for each sentence, the classifier will predict whether the sentence is relevant, that is, whether it contains descriptions that can be seen in a bird image.
For each document, we propose to calculate the pairwise similarity between captions and sentences in the Wikipedia description, and based on this similarity, assign a VRS-score to each sentence.
We calculate the VRS-score of a sentence s j to a caption by computing the cosine similarity of the embeddings of both the captions (c 0:L ) and sentences (s 0:M ) in the document. For a fixed-size sentence embeddings, we use a pre-trained siameseand-triplet network (Reimers and Gurevych, 2019;Schroff et al., 2015) on top of a pre-trained BERT network (Devlin et al., 2019).
The VRS-score of sentence s j with respect to all available captions c 1:L is thus defined to be: We then take the highest k scoring sentences from s 0:K to be the visually relevant extractive summary of the document. We can then concatenate the similarity embedding to the VRS summary of the text, and perform the multiplicative attention on this revised encoding of the documents and the same image encoding as before.
A bird's eye overview of our overall architecture is presented in Figure 2. The text that enters the similarity (clustering) component is the original Wikipedia document, not the document's VRS summary. Documents contain many non-visual descriptions that are unobserved in the images. However, these non-visual descriptions might still be essential to capture the similarity between documents. For example, similar-looking birds are likely to be in the same habitat. Thus, the VRL sentence extraction and the similarity enhancement operate in parallel on the original document.

Experiment setting
Datasets: We evaluate our method 1 on the Caltech UCSD Birds-2011 dataset (CUB) Wah et al. (2011) and the North America's birds dataset (NAB) (Van Horn et al., 2015), using class descriptions obtained from Wikipedia and the Al-laboutBirds website 2 , collected by Elhoseiny et al. (2017). Both are fine-grained datasets of birds but from different species. The CUB dataset contains 11,788 images of 200 bird species, and the NAB is a larger dataset of birds with 48,562 images of 404 classes 3 . The texts of both CUB and NAB are long, containing non-visual information. CUB has an average of 869 tokens and 42 sentences in class documents. NAB has an average of 1277 tokens and 58 sentences in class documents.
Two split Settings We use the two splits presented by Elhoseiny et al. (2017): (1) Super Category-Shared (SCS), also referred to as the 'easy' split; and (2) Super-Category-Exclusive (SCE), also referred to as the 'hard' split. In the SCS, for each class in the test set, at least one class in the training set belongs to the same category (categories are organized taxonomically). For example, in Figure 1, the Rufous Hummingbird and the Ruby-throated Hummingbird are both from the Hummingbird category. In the SCE, all classes in a category are in the same set. Namely, if a class is in the test set, then other classes from the same category are also in the test set, and will never be seen during training. Intuitively, classes from the same category have high similarity in both images and texts, so while in SCS similar images have been seen during training, in the SCE a class from an entirely new category is seen for the first time.
Training Details: The parameters of our model include cluster parameters. We use two clustering methods: (1) Density-based spatial clustering of applications with noise (DBSCAN) (Ester et al., 1996); (2) Hierarchical DBSCAN (McInnes et al., 2017). The DBSCAN algorithm takes two parameters: (1) "minimal cluster" -the number of samples in a neighborhood for a point to be considered as a core point; (2) "max distance" the maximum distance between two samples for one to be considered as in the neighborhood of the other. The "minimal cluster" is chosen to be two as two birds are the minimal similarity we want (similar to the NNS model). The "max distance" parameter we optimize on validation sets (10% of data) according to the two splits. In addition, the similarity model includes a threshold for performing the similarity component, also optimized over the validation set. The VRS algorithm includes a sentence score threshold for the number of sentences to be extracted. This threshold was chosen on the validation set.
The weights W were initialized with normalized initialization (Glorot and Bengio, 2010). The crossentropy loss function was optimized with Adam optimizer (Kingma and . Human Summarization: To evaluate our proposed VRS extraction method, we designed an oracle experiment using ground-truth visually relevant summarization. To this end, two independent human experts manually annotated the CUB dataset by reading each sentence in the document and marking the sentence as yes\no VRL. We set guidelines to resolve disagreements (e.g. hatchlings descriptions were marked as not VRL). On average, only 11.9% of the sentences were found to include VRL.
Image Captions: To create visual summaries we use image captions of birds from the CUB train set, provided by Reed et al. (2016). Each image in the CUB dataset has been annotated with ten finegrained captions. These captions describe only the birds' visual appearance while avoiding mentioning the names of the bird species. E.g., "This bird has a long beak, a creamy breast, and body, with brown wings". In this work, we use the first five captions of each image.
To showcase this approach's generality, we use these captions in both in-domain (CUB) and outof-domain (NAB) scenarios. In all cases, we avoid using captions of unseen (test) bird classes. In NAB, we effectively use captions from CUB to extract VRS for entirely-different species presented in NAB. Note that only models that include the VRS component (+VRS) employ these image captions. We report the accuracy achieved per the number of captions used in the VRS, to indicate the number of captions that are realistically needed.
Baselines: Our approach is compared asainst ten leading algorithms (see Table 1): MCZSL , WAC-Linear (Elhoseiny et al., 2013), Wac-Kernel , ES-ZSL (Romera-Paredes and Torr, 2015), SJE (Akata et al., 2015), Sync f ast , Sync OV O , ZSLNS (Qiao et al., 2016), and GAZSL (Zhu et al., 2018).  34.7 ---WAC-Linear Elhoseiny et al. (2013) 27.0 5.0 --WAC-Kernel  33.5 7.7 11.4 6.0 ESZSL Romera-Paredes and Torr (2015) 28.5 7.4 24.3 6.3 SJE Akata et al. (2015) 29.9 ---ZSLNS Qiao et al. (2016) 29.1 7.3 24.5 6.8 SynC f ast  28  Generalized Zero-Shot Learning: The conventional zero-shot learning task considers only unseen classes during the zero-shot phase. However, in a realistic scenario, seen objects might also appear . In Generalized Zero-Shot Learning (GZSL), test data might also come from seen classes, and the labeling space is the union of both types of seen and unseen classes. GZSL is thus considered a more challenging problem setting than ZSL due to the model's bias towards the seen classes. We follow the metric present by  to evaluate our models on the GZSL task. We evaluate the accuracy of a Seen-Unseen accuracy Curve (SUC) and use Area Under SUC to measure the general capability of ZSL methods.   ZEST vanilla In contrast to the very sophisticated approaches of Zhu et al. (2018), the vanilla crossentropy based approach outperforms all previous methods on the SCE-split on both CUB (+14.27% ratio of improvement) and NAB (+18.37% ratio of improvement). As the SCE-split is a more challenging split, this sheds light on the strength as well as limitations of this simple framework.

Results
ZEST similarity We then combined strengths of ZEST vanilla and NNS models over the two different scenarios: "hard" and "easy", respectively. The ZEST similarity model adds the cluster index embedding to the TF-IDF representation, only if a significant percentage of the documents from the Sentence HUMAN VRS Model 1 After nesting, north american birds move in flocks further north along the coasts, returning to warmer waters for winter.

2
Red foxes and coyotes readily predate colonies that they can access, the later being the only known species to hunt adult pelicans (which are too large for most bird predators to subdue).
3 when foraging, they dive bill-first like a kingfisher often submerging completely below the surface momentarily as they snap up prey.
4 It is one of only three pelican species found in the western hemisphere. 5 Due to their small size, they are vulnerable to insect-eating birds and animals. 6 Hummingbirds show a slight preference for red, tubular flowers as a nectar source. 7 The head is white but often gets a yellowish wash in adult birds. test-set are clustered with documents from the train set. The threshold picked over the validation set is a 15%. Thus, in the case of the SCE-split, no or few similarities are found, and the ZEST similarity preforms at the same level as the ZEST vanilla model. The threshold parameter was optimized on the validation set.
The two clustering algorithms we applied find real similarities, achieving high accuracy when tested on predicting the correct label according to the ground-truth taxonomical category. The HDB-SCAN, and DBSCAN achieved 88% and 84.5% accuracy on the CUB, and 93.07% and 95.05% on the NAB, accordingly.
Interestingly though, different clustering find different sources of similarities, that are essentially additive. In Table 3 we can see a comparison between different similarity enhancing methods. The ZEST vanilla +bird category method is a BOW of the bird category added to the original text embedding and then passed as before to a ZEST vanilla model. The use of two clusters that capture different similarities performs better than embedding the bird category in the text representation, by a ratio improvement of up to 8.63%. This suggests that our ZEST similarity method captures similarities that are beyond the bird category.
Finally, in Table 4 we present the results of ZEST similarity in the GZSL setup. On both datasets and splits, the ZEST similarity achieves state-of-the-art results with up to 30.88% ratio improvement.
ZEST vanilla +VRS and ZEST similarity +VRS use the captions from training images in the CUB in order to generate visually relevant extractive summaries of the original Wikipedia documents.
We test the summarized representation on the ZEST vanilla model, the ZEST similarity , and the GAZSL (Zhu et al., 2018) model. In Table 2 we show the experimental results. We compare the models before and after the use of the Visually Relevant Extractive Summarization component. We see an improvement in accuracy in both models on both datasets and on both splits.
In contrast to the ZEST similarity , the GAZSL does not have a component that embeds similarities. The VRS reduces similarity by removing non-VRL that might be similar between documents. The HUMAN summary is an especially lean summary with only 11.9% sentences extracted. Thus, the similarity between texts of similar objects diminishes. The GAZSL+HUMAN in the SCS-split performs poorly due to the diminished similarity. In contrast, The GAZSL+HUMAN+our VRS adds the similarity that was lost and the performance improves.
To assess the quality of the VRS summarization performance, we treat HUMAN summarization as the ground truth. The VRS method succeeds in removing 49.4% of the sentences in the CUB dataset with 96.23% recall and 22.59% precision. For comparison, removing 49.4% of the sentences randomly produces a recall of 50.6% and a precision of 11.9%. Table 5 shows a qualitative analysis of our VRS results. In sentences 1-3, the VRS model correctly marked the sentences as non-VRL: sentence 1 is a typical case of non-visually-relevant language describing birds migration; in sentence 2 the VRS model correctly marks the sentence as non-VRL despite the mention of color (red) -since the color does not refer to the object to be classified (the bird); in sentence 3 the VRS model correctly marks Figure 4: Accuracy per number of captions used to focus summarization, measured on the hard SCE split of CUB. Showing that as little as 5 captions in total are sufficient to focus the summarization process. the sentence as non-VRL despite the mention of a body part (bill) -since that description it is not visually relevant in that particular context. Sentences 4-6 show examples of false-positive predictions of the VRS model. E.g., in sentence 5 the VRS model incorrectly predicted VRL, which we attribute to the mention: "their small size". In sentence 6 the VRS model incorrectly marks the sentence as VRL, a mistake we attribute to the mention of the flower's "red" color.
We then compare both ZEST similarity and the GAZSL to the use of HUMAN summarization in the CUB dataset and see additional improvement in both models on the two splits. The gap between the performance on the VRS and the Human summarization indicates that improvement in the summarization of documents will improve the models' performance, and is, therefore, a promising path for text-based zero-shot learning research.
Finally, we experiment to assess the number of captions that are realistically needed for the VRS method. The results, presented in Table 4, show that only a few (∼ 5) sentences (captions) from arbitrary birds are needed to achieve the maximum accuracy with this method. Testing the VRS with five arbitrary captions from CUB dataset on the NAB dataset with SCS-split, we achieved a 39.28% accuracy.
For comparison, Reed et al. (2016) showed that their model needed at least 512 captions per class to achieve the maximum accuracy -i.e., had it used all the captions available.

Conclusion
This work aims to establish a better way to represent the language modality in text-based ZSL for image classification. Our approach only relies on semantic information about visual features, and not on the visual features themselves. Specifically, our two orthogonal text-processing methods, employ-ing textual similarity and visually-relevant summaries, lead to significant improvements across models, splits, and datasets, and illustrate that adequate text-processing is essential in text-based ZSL tasks. We conjecture that text-processing methods will be essential in a range of vision and languagebased tasks, and hope this work will assist future research in better representing the language modality in various multi-modal tasks.