Detecting Visually Relevant Sentences for Fine-Grained Classification

Detecting discriminative semantic attributes from text which correlate with image features is one of the main challenges of zero-shot learning for ﬁne-grained image classiﬁcation. Particularly, using full-length encyclopedic articles as textual descriptions has had limited success, one reason being that such documents contain many non-visual or unrelated sentences. We propose a method to automatically extract visually relevant sentences from Wikipedia documents. Our model, based on a convolutional neural network, is robustly tested through ground truth labeling obtained via Amazon Mechanical Turk, achieving 81.73% F1 measure.


Introduction
Current research in multimodal fusion and crossmodal mapping relies primarily on pre-aligned datasets of images and their short captions or tags, where the text is known to contain visually descriptive content directly related to its image (Baroni, 2016). These texts are usually manually collected, and restricted in length to words, phrases, and sentences. Using full-length documents such as Wikipedia articles would potentially allow automated access to already available rich descriptive content and would greatly aid the task of finegrained classification across numerous domains, many of which have rich image datasets (such as birds (Welinder et al., 2010), flowers (Nilsback and Zisserman, 2008), aircraft (Maji et al., 2013), and dogs (Khosla et al., 2011).) Unfortunately, most full-length documents contain predominantly non-visual text, making them noisy with respect to visual information and limiting the success of zero-shot learning techniques for fine-grained classification (Elhoseiny et al., 2013;Elhoseiny et al., 2015;Lei Ba et al., 2015). Furthermore, the visual portion of the text often describes objects outside the classifier's interest, such as the color of a bird's eggs when the task is identifying bird species (see Figure 1).
Thus, the question we address in this paper is as follows: can we automatically identify visually descriptive sentences relevant to a particular object from documents that may contain predominantly non-visual text? We refer to this type of sentence as 'visually relevant'. Answering this question would allow us to automatically build aligned datasets of images with rich sentence-level descriptions, removing the necessity of manually creating aligned image-text datasets.
In this work, we focus on bird species, as this is one of the most well-studied and challenging finegrained classification domains, using Wikipedia articles as our text (Section 2). To build our computational models, we must first define the notion of 'visually relevant' sentences. We use the defini-tion of Visually Descriptive Language (VDL) introduced by Gaizauskas et al. (2015), with some restrictions. Like VDL, we aim to identify 'visually confirmed' rather than 'visually concrete' segments of text as our descriptions correspond to a class (the bird species) rather than a particular image. For example, a sentence describing a bird's feet can be a 'visually relevant' sentence for a bird, though it would not be 'visually concrete' for an image of the bird flying with its feet hidden. Unlike VDL, for the scope of this paper we are interested only in the sentences which are visually descriptive with respect to the object (i.e., bird species). We define such sentences as containing visually relevant language (VRL).
To build our training data, we make a simplifying assumption: a sentence is only considered to contain visually relevant language if it is in the 'Description' section of the article. While other sections may contain visually descriptive language, we assume they describe other objects such as the eggs. This simplifying assumption allows us to approach our problem as a sentence classification task (is a sentence VRL or non-VRL), and provides an automatic, though noisy, approach for labeling the training data. We collect a dataset of 1150 Wikipedia articles about birds to train the non-linear, non-consecutive convolution neural network architecture proposed by . The architecture of this particular CNN is well suited to model sentences in our corpus such as "Adults have upperparts streaked with brown, grey, black and white" as it captures nonconsecutive grams such as "upperparts brown", "upperparts gray", "streaked white", etc.
To test our model in a robust manner, we use crowdsourcing to manually annotate all sentences as either VRL or non-VRL from an unseen set of 200 Wikipedia articles (for a total of 6342 sentences) (Section 2), corresponding to the bird classes in the Caltech-UCSD Birds-200-2011 dataset (Welinder et al., 2010).
Our experiments show that the CNN model trained on the noisy VRL dataset performs very well when tested on a human-labeled VRL dataset: 83.4% Precision, 80.13% Recall, 81.73% F1 measure (Section 4). Our analysis highlights several findings: 1) VRL sentences outside of the description section, or in documents with no Description section, are properly labeled by the model as VRL; 2) non-VRL sentences within the Description sec- . This dataset will be useful to advance research on fine-grained classification, given that the Caltech-UCSD Birds-200-2011 is one of the most highly used datasets for this task.

Datasets
To train our models we collected a set of 1150 Wikipedia articles of bird species. As a future goal of this work is to correlate the extracted textual information with image data, the training documents were specifically chosen not to correspond to the 200 birds species in the Caltech-UCSD Birds-200-11 dataset, which were set aside as test data. Of these 1150 documents, 690 of them contained sections labeled "Description" or related headings such as "Appearance", which allowed us to build our training and development sets. All sentences in the sections labeled "Description", "Appearance" and "Identification" were considered instances of the VRL class and everything else as instances of the non-VRL class; this labeling scheme we refer to as 'noisy'. Table 1 shows the statistics of the number of training and development instances used to build the computational models. The dataset is highly unbalanced: VRL sentences comprise 19% of both training and development. This skew is typical of many descriptive documents, and as such provides an appropriate model to train on.
To test our models we use the Wikipedia articles of the 200 birds in the Caltech-UCSD Birds-200-11 previously collected by Elhoseiny et al. (2013), consisting of 6342 sentences, which we call 200 V RL . To see whether our computational models trained on the noisy VRL dataset are able to detect VRL sentences as judged by humans, we conducted a crowdsourcing experiment.

Crowdsourcing to Annotate Sentences as Visually Relevant
We define a sentence-level annotation task, where each sentence in a document is assigned one of the following labels: 1 -the sentence contains visually relevant language (VRL), i.e. it is visually descriptive with respect to the object under consideration (birds species) (see examples (1) and (2)); and 0 -the sentence does not contain visually relevant language (see examples (3), (4), (5)). Label 1 (VRL sentence) is assigned when the entire sentence is visually relevant (ex (1)) or when it is partially visually relevant (e.g., in example (2) only the underlined part is visually relevant): (1) It has a black cap and a chestnut lower belly (2) Males give increasingly vocal displays and show off the white markings of the wings in flight and of the tail [...] Label 0 (non-VRL sentence) is assigned when the sentence describes the object of interest (bird species) but it is not visually descriptive (ex (3)), when it is visually descriptive but not relevant to the object (ex (4)), or when it is neither visually descriptive nor associated with the bird species. (3) Males have 2 distinct types of songsclassified as short and long songs.
The egg coloring is a brown spotted greenish-white.
Finally volcanic eruptions on Torishima continues to be a threat.
In addition to the above labeling, for cases where a Turker chose the label 1 they were asked to provide information about the particular visually relevant text segments by specifying the bird, the body part and the description. While these phrase-level annotations are not used for our current task, they could be used in future work when joint-learning from text and images, especially to align information related to each body part of the bird. In addition, they could be used to build a graph-based representation of image descriptions similar to scene graphs (Schuster et al., 2015).
The annotation task was done at the sentence level and each sentence was annotated by three Turkers on Amazon Mechanical Turk. Besides the two labels 1 and 0, the Turkers could also select "I don't know" and provide an explanation for why they could not determine whether or not the sentence contains VRL. We used highly skilled Turkers (≥ 500 completed HITS and ≥ 95% approval rate) and we paid 5 cents per HIT (each HIT contained only one sentence). The inter-annotator agreement was very high, with a Fleiss K score of 0.8273. Only 8.64% of the sentences did not have a unanimous vote. Less than 2% of the sentences had at least one Turker vote 'I don't know'; of these, less than 0.05% garnered one vote each of 1, 0 and 'I don't know'.
To build the test set for the computational models we use majority voting (at least two annotators selected the label). For the few cases where we did not have majority voting (0.05% of data) we selected the 0 label, as only one Turker voted 1 while the other two said 0 and 'I don't know'. This test set, which we call 200 HumV RL , contains 1248 sentences of class 1 (VRL) and 5094 sentences of class 0 (non-VRL).

Detecting Visually Relevant Sentences
As mentioned earlier, our task can be framed as a binary sentence classification problem, where each sentence is labeled either as VRL or non-VRL. Deep learning methods, and in particular convolutional neural networks (CNNs), have become some of the top performing methods on various NLP tasks that can be modeled as sentence classification (e.g, sentiment analysis, question type classification) (Kim, 2014;Kalchbrenner et al., 2014;. We use the non-linear, non-consecutive convolution neural network architecture proposed by , which we refer to as CNN Lei . This CNN uses tensor products to combine nonconsecutive n-grams of each sentence to create an embedding per sentence. The non-consecutive aspect of the n-gram allows it to capture cooccurrence of words spread across sentences: "yellow crown, rump and flank patch" will generate representations of the relevant noun-adjective pairs "yellow crown", "yellow rump", and "yellow flank patch". The tensor product is used as a "generalized approach" to linear concatenation of the n-grams, as concatenation is "insufficient to directly capture relevant information in the n-gram" (Lei et al., 2015, p 1). We use the training and development set described in Table 1 that comes from the 690 documents with 'Description' headings.
Hyperparameters and Word Vectors. The word vectors are pre-trained on the entire set of 1150 Wikipedia articles about birds using the word2vec model of Mikolov et al. (2013) with a window context of 20 words and vectors of 150 dimensions. Notice that we do not use the documents in the test set 200 V RL for training the word vectors. We chose to use domain specific text to pre-train the word vectors in order to make sure we are capturing domain specific semantics such as proper word senses. Words such as "crown", when trained on a different corpus, would typically have an embedding very close to words such as "royalty", "tiara", etc; in the domain of bird descriptions, "crown" maps most closely to "feathers" and "head". The hyperparameters for the CNN model are: L2 regularization weight is 0.0001, ngram order is 3 and hidden feature dimension is 50.

Experimental Setup and Results
Test Datasets. We first evaluate the CNN Lei model on the 200 HumV RL dataset described in Section 2, which contains the 6342 sentences labeled by Turkers (class distribution: 1248 sentences in class 1 and 5094 sentences in class 0). Since our computational model was trained on the noisy visually relevant sentences (where the labels were determined by the 'Description' section of the documents), we wanted to evaluate how the model performed on a similarly constructed test set. Thus, instead of considering the human labels for the 6342 sentences, a sentence was assigned to class 1 if it belonged to the Description, Appearance or Identification sections and to class 0 otherwise. We call this dataset 200 N oisyV RL (class distribution: 1258 sentences in class 1 and 5084 sentences in class 0). Note that while it seems as if only 10 sentences changed, many of the sentences in the 'Description' sections were labeled by humans as class 0, and many sentences outside these sections labeled as class 1. However, one possible issue with the 200 N oisyV RL dataset is that some documents do not contain any descriptiontype sections and thus all sentences are labeled 0, which might affect measuring the performance of the model. Thus, we considered additional test sets containing only the documents that had sections labeled with 'Description', 'Appearance' or 'Identification' (142 documents out of the original 200 documents). Using these documents, we constructed a dataset 142 N oisyV RL , where class 1 contained sentences that were part of the three description-type sections, and class 0 contained all other sentences (class distribution: 1156 class 1 and 3836 class 0). In addition, we also used the Turkers' labels (majority voting) for the corresponding sentences in these 142 documents. We call this dataset 142 HumV RL (class distribution: 992 class 1 and 4000 class 0). Since the CNN model was trained on the noisy labeling, a reasonable assumption is that the classification results would be better on the 200 N oisyV RL and 142 N oisyV RL datasets than on the 200 HumV RL and 142 HumV RL datasets.
Baseline. As baseline, we used the same neural bag-of-words model (nBoW) as . We use the same training and development sets as for the CNN model (Table 1), along with the same word embeddings.
Results and Discussion. Table 2 shows the results of the CNN Lei model and the nBoW model on the four datasets. The CNN model performs slightly better than the baseline on all datasets in terms of F1 measure, with a much better Recall but worse Precision. Given that the end goal is to use the extracted visually relevant sentences together with images for fine-grained classification, and that the amount of visually relevant sentences in a document is small with respect to the document length, having high Recall is important.
One of the most interesting findings of this study is that both of the computational models perform much better on the human-labeled visually relevant datasets (200 HumV RL , 142 HumV RL ) than on the noisy visually relevant datasets (200 N oisyV RL , 142 N oisyV RL ). In particular, the recall increases significantly (e.g., from 63.24% on 142 N oisyV RL to 80.15% on 142 HumV RL using the CNN Lei model).
An error analysis highlights that the computational models are more 'conservative' with the classification of VRL than the noisy labeling. As mentioned earlier, the Description sections of the Wikipedia articles often (though not always) contain details pertaining to the birds' song. However, despite being trained on such a labeling, the computational models do not classify most sentences related primarily to the description of birds' song as VRL. This result was most likely aided by the fact that some of the training documents contain  Table 2: Classification results on the four datasets song descriptions outside of the description-type sections, so the words pertaining to sound were not correlated as strongly with the VRL class. It is also possible that the abundance of appearance descriptions in each description section would encourage the visual words to have a much stronger effect on the 'visualness' of a sentence. One such example is the sentence "The song is a series of musical notes which sound like: wheeta wheeta whee-tee-oh, for which a common pneumonic is 'The red, the red T-shirt'". Even the repetition of the word 'red' is not enough to make the classifier label the sentence as VRL.
Another type of example that explains these results are sentences that describe the weight of the birds, such as "Recorded weights range from 0.69 to 2 kg,[...]" These sentences were part of the Description section, but were not marked as VRL by either the Turkers or the computational models.
We also analyzed some of the false positives of the CNN Lei model on the 142 HumV RL and 200 HumV RL datasets. One type of error comes from sentences that are visually descriptive, but not visually relevant, such as sentences that describe other objects like eggs. For example, the sentence "The egg shells are of various shades of light or bluish grey with irregular, dark brown spots or greyish-brown splotches" was labeled as VRL by the model but not by the Turkers. More interesting are the false positives that contain comparison words such as "clapping or clicking has been observed more often in females than in males", and words having to do with appearance that do not specifically describe how the bird looks such as "this bird is more often seen than heard".

Related Work
There are two lines of work most closely related to ours. First, Gaizauskas et al. (2015) propose a definition and typology of Visually Descriptive Language (VDL). They show that humans are able to reliably annotate text segments as containing 'visually descriptive' language or not, providing evidence that standalone text can be classified by the visualness of its contents. In our work, motivated by the end task of fine-grained classification, we restrict the definition to 'visually relevant'. As Gaizauskas et al. (2015) do, we show that humans can reliably annotate text as visually relevant or not. Unlike Gaizauskas et al. (2015), we propose a method to automatically detect visually relevant sentences from full-text documents. Second, Dodge et al. (2012) propose a method to separate visual text from non-visual text in image captions. However, their method focuses just on noun-phrases, while our approach finds visually relevant sentences in full-length documents.
While our end result is a set of visually relevant text descriptions, our approach is complementary to the rich body of work on generating text descriptions from images (see (Bernardi et al., 2016) for a survey), since our method extracts such descriptions from existing text.

Conclusion
Our work shows that it is possible to take domain-specific full-length documents-such as Wikipedia articles for birds species-and classify their sentences by visual relevancy using a CNN model trained on a noisy dataset. As many documents generally have a small proportion of visually relevant sentences, this approach automatically generates high quality visually relevant textual descriptions for images to be used by zeroshot learning approaches for fine-grained image classification tasks (e.g., (Wang et al., 2009)). While our study has focused on bird species, we believe that this method is generally applicable for other domains used in fine-grained classification research such as flowers and dogs (all have associated Wikipedia articles and Description/Appearance sections). In future work, we plan to use the outcomes of this work for joint learning from text and images.