Neural Naturalist: Generating Fine-Grained Image Comparisons

We introduce the new Birds-to-Words dataset of 41k sentences describing fine-grained differences between photographs of birds. The language collected is highly detailed, while remaining understandable to the everyday observer (e.g., “heart-shaped face,” “squat body”). Paragraph-length descriptions naturally adapt to varying levels of taxonomic and visual distance—drawn from a novel stratified sampling approach—with the appropriate level of detail. We propose a new model called Neural Naturalist that uses a joint image encoding and comparative module to generate comparative language, and evaluate the results with humans who must use the descriptions to distinguish real images. Our results indicate promising potential for neural models to explain differences in visual embedding space using natural language, as well as a concrete path for machine learning to aid citizen scientists in their effort to preserve biodiversity.


Introduction
Humans are adept at making fine-grained comparisons, but sometimes require aid in distinguishing visually similar classes.Take, for example, a citizen science effort like iNaturalist, 1 where everyday people photograph wildlife, and the community reaches a consensus on the taxonomic label for each instance.Many species are visually similar (e.g., Figure 1, top), making them difficult for a casual observer to label correctly.This puts an undue strain on lieutenants of the citizen science community to curate and justify labels for a large number of instances.While everyone may be capable of making such distinctions visually, nonexperts require training to know what to look for.

🦉
Work done during an internship at Google. 1 https://www.inaturalist.org"Animal 2 looks smaller and has a stouter, darker bill than Animal 1. Animal 2 has black spots on its wings.Animal 2 has a black hood that extends down onto its breast, and the rest of its breast is white with orange only on its sides.In comparison, Animal 1's breast is entirely orange.""Animal 2 is brightly red-colored all over, except for a black oval around its beak.Animal 1 has more muted red and grey colors."The Birds-to-Words dataset: comparative descriptions adapt naturally to the appropriate level of detail (orange underlines).A difficult distinction (TOP) is given a longer and more fined-grained comparison than an easier one (BOTTOM).Annotators organically use everyday language to refer to parts (green highlights).
Field guides exist for the purpose helping people learn how to distinguish between species.Unfortunately, field guides are costly to create because writing such a guide requires expert knowledge of class-level distinctions.
In this paper, we study the problem of explaining the differences between two images using natural language.We introduce a new dataset called Birds-to-Words of paragraph-length descriptions of the differences between pairs of bird photographs.We find several benefits from eliciting comparisons: (a) without a guide, annotators naturally break down the subject of the image (e.g., arXiv:1909.04101v1[cs.CL] 9 Sep 2019 a bird) into pieces understood by the everyday observer (e.g., head, wings, legs); (b) by sampling comparisons from varying visual and taxonomic distances, the language exhibits naturally adaptive granularity of detail based on the distinctions required (e.g., "red body" vs "tiny stripe above its eye"); (c) in contrast to requiring comparisons between categories (e.g., comparing one species vs. another), non-experts can provide high-quality annotations without needing domain expertise.
We also propose the Neural Naturalist model architecture for generating comparisons given two images as input.After embedding images into a latent space with a CNN, the model combines the two image representations with a joint encoding and comparative module before passing them to a Transformer decoder.We find that introducing a comparative module-an additional Transformer encoder-over the combined latent image representations yields better generations.
Our results suggest that these classes of neural models can assist in fine-grained visual domains when humans require aid to distinguish closely related instances.Non-experts-such as amateur naturalists trying to tell apart two species-stand to benefit from comparative explanations.Our work approaches this sweet-spot of visual expertise, where any two in-domain images can be compared, and the language is detailed, adaptive to the types of differences observed, and still understandable by laypeople.
Recent work has made impressive progress on context sensitive image captioning.One direction of work uses class labels as context, with the objective of generating captions that distinguish why the image belongs to one class over others (Hendricks et al., 2016;Vedantam et al., 2017).Another choice is to use a second image as context, and generate a caption that distinguishes one image from another.Previous work has studied ways to generalize single-image captions into comparative language (Vedantam et al., 2017), as well as comparing two images with high pixel overlap (e.g., surveillance footage) (Jhamtani and Berg-Kirkpatrick, 2018).Our work complements these efforts by studying directly comparative, everyday language on image pairs with no pixel overlap.
Our approach outlines a new way for models to aid humans in making visual distinctions.The Neural Naturalist model requires two instances as input; these could be, for example, a query image and an image from a candidate class.By differentiating between these two inputs, a model may help point out subtle distinctions (e.g., one animal has spots on its side), or features that indicate a good match (e.g., only a slight difference in size).These explanations can aid in understanding both differences between species, as well as variance within instances of a single species.

Birds-to-Words Dataset
Our goal is to collect a dataset of tuples (i 1 , i 2 , t), where i 1 and i 2 are images, and t is a natural language comparison between the two.Given a domain D, this collection depends critically on the criteria we use to select image pairs.If we sample image pairs uniformly at random, we will end up with comparisons encompassing a broad range of phenomena.For example, two images that are quite different will yield categorical comparisons ("One is a bird, one is a mushroom.").Alternatively, if the two images are very similar, such as two angles of the same creature, comparisons between them will focus on highly detailed nuances, such as variations in pose.These phenomena support rich lines of research, such as object classification (Deng et al., 2009) and pose estimation (Murphy-Chutorian and Trivedi, 2009).We aim to land somewhere in the middle.We wish to consider sets of distinguishable but intimately related pairs.This sweet spot of visual similarity is akin to the genre of differences studied in fine-grained visual classification (Wah et al., 2011;Krause et al., 2013a).We approach this collection with a two-phase data sampling procedure.We first select pivot images by sampling from our full domain uniformly at random.We then branch from these images into a set of secondary images that emphases fine-grained comparisons, but yields broad coverage over the set of sensible relations.Figure 2 provides an illustration of our sampling procedure.

Domain
We sample images from iNaturalist, a citizen science effort to collect research-grade 2 observations of plants and animals in the wild.We restrict our domain D to instances labeled under the taxonomic CLASS 3 Aves (i.e., birds).While a broader domain would yield some comparable instances (e.g., bird and dragonfly share some common body parts), choosing only Aves ensures that all instances will be similar enough structurally to be comparable, and avoids the gut reaction compar-2 Research-grade observations have met or exceeded iNaturalist's guidelines for community consensus of the taxonomic label for a photograph. 3To disambiguate class, we use CLASS to denote the taxonomic rank in scientific classification, and simply "class" to refer to the machine learning usage of the term as a label in classification.ison pointing out the differences in animal type.This choice yields 1.7M research-grade images and corresponding taxonomic labels from iNaturalist.We then perform pivot-branch sampling on this set to choose pairs for annotation.

Pivot Images
The Aves domain in iNaturalist contains instances of 9k distinct species, with heavy observation bias to more common species (such as the mallard duck).We uniformly sample species from the set of 9k to help overcome this bias.In total, we select 405 species and corresponding photographs to use as i 1 images.

Branching Images
We use both a visual similarity measure and taxonomy to sample a set of comparison images i 2 branching off from each pivot image i 1 .We use a branching factor of k = 12 from each pivot image.
To capture visually similar images to i 1 , we employ a similarity function V(i 1 , i 2 ).We use an Inception-v4 (Szegedy et al., 2017) network pretrained on ImageNet (Deng et al., 2009) and then fine-tuned to perform species classification on all research-grade observations in iNaturalist.We take the embedding for each image from the last layer of the network before the final softmax.We perform a k-nearest neighbor search by quantizing each embedding and using L2 distance (Wu et al., 2017;Guo et al., 2016), selecting the k v = 2 closest images in embedding space.
We also use the iNaturalist scientific taxonomy T (D) to sample images at varying levels of taxonomic distance from i 1 .We select k t = 10 taxonomically branched images by sampling two images each from the same SPECIES ( = 1), GENUS, FAMILY, ORDER, and CLASS ( = 5) as c.This yields 4,860 raw image pairs (i 1 , i 2 ).

Language Collection
For each image pair (i 1 , i 2 ), we elicit five natural language paragraphs describing the differences between them.
An annotator is instructed to write a paragraph (usually 2-5 sentences) comparing and contrasting the animal appearing in each image.We instruct annotators not to explicitly mention the species (e.g., "Animal 1 is a penguin"), and to instead focus on visual details (e.g., "Animal 1 has a black body and a white belly").They are additionally instructed to avoid mentioning aspects of the background, scenery, or pose captured in the photograph (e.g., "Animal 2 is perched on a coconut").
We discard all annotations for an image pair where either image did not have at least 4 5 positive ratings of image clarity.This yields a total of 3,347 image pairs, annotated with 16,067 paragraphs.Detailed statistics of the Birds-to-Words dataset are shown in Figure 3, and examples are provided in Figure 5.Further details of our both our algorithmic approach and dataset construction are given in Appendices A and B.

Neural Naturalist Model
Task Given two images (i 1 , i 2 ) as input, our task is to generate a natural language paragraph t = x 1 . . .x n that compares the two images.
Architecture Recent image captioning approaches (Xu et al., 2015;Sharma et al., 2018) extract image features using a convolutional neural network (CNN) which serve as input to a language decoder, typically a recurrent neural network (RNN) (Mikolov et al., 2010) or Transformer (Vaswani et al., 2017).We extend this paradigm with a joint encoding step and comparative module to study how best to encode and transform animal1 is brown and white with a short beak .animal2 is brown and gray with a long gray beak .
animal1 is white with dark brown and white wings and a golden head .animal2 is brown-gold with dark solid-colored brown wings and a dark head .

M M G G
animal1 is a dull yellow with grey tail feathers while animal2 is a yellow-green animal1 has dark orange claws , while animal2 has grey claws .animal1 has yellow coloring with black on the top of the head and in tiny wing patches .animal2 is mostly green with red on the neck and brown on the wings .

M M G G
animal1 has a black beak , while animal2 has a pale grey beak .animal1 is mostly black , while animal2 is mostly dark brown and white .animal1 has a black eye , while animal2 has a gold eye .animal1 is two -toned brown with a white patch on its head .animal2 is multi -colored with longer tail feathers .
animal1 is brown and white with a squatty body with a light brown head .animal2 is multi -colored with a light blue and black head .

M M G G
these animals appear exactly the same .
animal1 and animal2 look the same .

M M G G
animal1 is white with brown wings while animal2 is yellow with black head animal1 has a long dark tail and flecked dark wings with a white curved beak .animal2 has a shorter beak and a yellow breast and head with a shorter brown tail .

M M G G
animal1 is brown with black spots on the body while animal2 is tan with a white neck and black head animal1 is brown and white with a long yellow and brown beak .animal2 is gray with a short light pink beak .
animal2 ' s colors are brighter than animal1 .animal2 has more earthy colors than animal1 .animal1 is a bit bigger than animal2 .
both animals appear to be the same .

Animal 1 Animal 1 Animal 2
Animal 2 animal2 has a heart shaped face , whereas animal1 has an oval face .animal2 has entirely dark eyes .animal2 has a white beak , whereas animal1 has a dark beak .animal2 has more white in its feathers .multiple latent image embeddings.A schematic of the model is outlined in Figure 4, and its key components are described in the upcoming sections.

Image Embedding
Both input images are first processed using CNNs with shared weights.In this work, we consider ResNet (He et al., 2016) and Inception (Szegedy et al., 2017) architectures.In both cases, we extract the representation from the deepest layer immediately before the classification layer.This yields a dense 2D grid of local image feature vectors, shaped (d, d, f ).We then flatten each feature grid into a (d 2 , f ) shaped matrix:

Joint Encoding
We define a joint encoding J of the images which contains both embedded images (E 1 , E 2 ), a mutated combination (M), or both.We consider as possible mutations We try these encoding variants to explore whether simple mutations can effectively combine the image representations.

Comparative Module
Given the joint encoding of the images (J), we would like to represent the differences in feature space (C) in order to generate comparative descriptions.We explore two variants at this stage.
The first is a direct passthrough of the joint encoding (C = J).This is analogous to "standard" CNN+LSTM architectures, which embed images and pass them directly to an LSTM for decoding.Because we try different joint encodings, a passthrough here also allows us to study their effects in isolation.
Our second variant is an N -layer Transformer encoder.This provides an additional self-attentive mutations over the latent representations J.Each layer contains a multi-headed attention mechanism (ATTN MH ).The intent is that self-attention in Transformer encoder layers will guide comparisons across the joint image encoding.
Denoting LN as Layer Norm and FF as Feed Forward, with C i as the output of the ith layer of the Transformer encoder, C 0 = J, and C = C N :

Decoder
We use an N -layer Transformer decoder architecture to produce distributions over output tokens.The Transformer decoder is similar to an encoder, but it contains an intermediary multi-headed attention which has access to the encoder's output C at every time step.
Here we denote the text observed during training as X, which is modulated with a position-based encoding and masked in the first multi-headed attention.

Experiments
We train the Neural Naturalist model to produce descriptions of the differences between images in the Birds-to-Words dataset.We partition the dataset into train (80%), val (10%), and test (10%) sections by splitting based on the pivot images i 1 .
This ensures i 1 species are unique across the different splits.We provide model hyperparameters and optimization details in Appendix C.

Baselines and Variants
The most frequent paragraph baseline produces only the most observed description in the training data, which is that the two animals appear to be exactly the same.Text-Only samples captions from the training data according to their empirical distribution.Nearest Neighbor embeds both images and computes the lowest total L 2 distance to a training set pair, sampling a caption from it.We include two standard neural baselines, CNN (+ Attention) + LSTM, which concatenate the images embeddings, optionally perform attention, and decode with an LSTM.The main model variants we consider are a simple joint encoding (J = E 1 , E 2 ), no comparative module (C = J), a small (1-layer) decoder, and our full Neural Naturalist model.We also try several other ablations and model variants, which we describe later.
For human performance, we use a one-vs-rest scheme to hold one reference paragraph out and compute its metric using the other four.We average this score across twenty-five runs over the entire split in question.
Results using these metrics are given in Table 2 for the main baselines and model variants.We observe improvement across BLEU-4 and ROUGE-L scores compared to baselines.Curiously, we observe that the CIDEr-D metric is susceptible to common patterns in the data; our model, when stopped at its highest CIDEr-D score, outputs a variant of, "these animals appear exactly the same" for 95% of paragraphs, nearly mimicking the behavior of the most frequent paragraph (Freq.)baseline.The corpus-level behavior of CIDEr-D gives these outputs a higher score.We observed anecdotally higher quality outputs correlated with ROUGE-L score, which we verify using a human evaluation (paragraph after next).Ablations and Model Variants We ablate and vary each of the main model components, running the automatic metrics to study coarse changes in the model's behavior.Results for these experiments are given in Table 3.For the joint encoding, we try combinations of four element-wise operations with and without both encoded images.
To study the comparative module in greater detail, we examine its effect on the top three joint encodings: (i 1 , i 2 , −), −, and .After fixing the best joint encoding and comparative module, we also try variations of the decoder (Transformer depth), as well as decoding algorithms (greedy decoding, multinomial sampling, and beamsearch).
Overall, we we see that the choice of joint encoding requires a balance with the choice of comparative module.More disruptive joint encodings (like element-wise multiplication ) appear too destructive when passed directly to a decoder, but yield the best performance when paired with a deep comparative module.Others (like subtraction) function moderately well on their own, and are further improved when a comparative module is introduced.
Human Evaluation To verify our observations about model quality and automatic metrics, we also perform a human evaluation of the generated paragraphs.We sample 120 instances from the test set, taking twenty each from the six categories for choosing comparative images (visual similarity in embedding space, plus five taxonomic distances).We provide annotators with the two images in a random order, along with the output from the model at hand.Annotators must decide which image contains Animal 1, and which contains Animal 2, or they may say that there is no way to tell (e.g., for a description like "both look exactly the same").
We collect three annotations per datum, and score a decision only if ≥ 2/3 annotators made that choice.A model receives +1 point if annotators decide correctly, 0 if they cannot decide or agree there is no way to tell, and -1 point if they decide incorrectly (label the images backwards).This scheme penalizes a model for confidently writing incorrect descriptions.The total score is then normalized to the range [−1, 1].Note that Human uses one of the five gold paragraphs sampled at random.
Results for this experiment are shown in Table 4.In this measure, we see the frequency and text-only baselines now fall flat, as expected.The frequency baseline never receives any points, and the text-only baseline is often penalized for incorrectly guessing.Our model is successful at making distinctions between visually distinct species (GENUS column and ones further right), which is near the challenge level of current fine-grained visual classification tasks.However, it struggles on the two data subsets with highest visual similarity (VISUAL, SPECIES).The significant gap between all methods and human performance in these columns indicates ultra fine-grained distinctions are still possible for humans to describe, but pose a challenge for current models to capture.

Qualitative Analysis
In Figure 5, we present several examples of the model output for pairs of images in the dev set, along with one of the five reference paragraphs.In the following section, we split an analysis of the model into two parts: largely positive findings, as well as common error cases.

Positive Findings
We find that the model exhibits dynamic granularity, by which we mean that it adjusts the magnitude of the descriptions based on the scale of differences between the two animals.If two animals are quite similar, it generates fine-grained descriptions such as, "Animal 2 has a slightly more curved beak than Animal 1," or "Animal 1 is more iridescent than Animal 2." If instead the two animals are very different, it will generate text describing larger-scale differences, like, "Animal 1 has a much longer neck than Animal 2," or "Animal 1 is mostly white with a black head.Animal 2 is almost completely yellow." We also observe that the model is able to pro-duce coherent paragraphs of varying linguistic structure.These include a range of comparisons set up across both single and multiple sentences.For example, one it generates straightforward comparisons of the form, Animal 1 has X, while Animal 2 has Y.But it also generates contrastive expressions with longer dependencies, such as Animal 1 is X, Y, and Z. Animal 2 is very similar, except W. Furthermore, the model will mix and match different comparative structures within a single paragraph.
Finally, in addition to varying linguistic structure, we find the model is able to produce coherent semantics through a series of statements.For example, consider the following full output: "Animal 1 has a very long neck compared to Animal 2. Animal 1 has shorter legs than Animal 2. Animal 1 has a black beak, Animal 2 has a brown beak.Animal 1 has a yellow belly.Animal 2 has  darker wings than Animal 1."The range of concepts in the output covers neck, legs, beak, belly, wings without repeating any topic or getting sidetracked.

Error Analysis
We also observe several patterns in the model's shortcomings.The most prominent error case is that the model will sometimes hallucinate differences (Figure 5, bottom row).These range from pointing out significant changes that are missing (e.g., "a black head" where there is none (Fig. 5, bottom left)), to clawing at subtle distinctions where there are none (e.g., "[its] colors are brighter . . .and [it] is a bit bigger" (Fig. 5, bottom right)).We suspect that the model has learned some associations between common features in animals, and will sometimes favor these associations over visual evidence.
The second common error case is missing obvious distinctions.This is observed in Fig. 5 (bottom middle), where the prominent beak of Animal 1 is ignored by the model in favor of mundane details.While outlying features make for lively descriptions, we hypothesize that the model may sometimes avoid taking them into account given its per-token cross entropy learning objective.
Finally, we also observe the model sometimes swaps which features are attributed to which animal.This is partially observed in Fig. 5 (bottom left), where the "black head" actually belongs to Animal 1, not Animal 2. We suspect that mixing up references may be a trade-off for the representational power of attending over both images; there is no explicit bookkeeping mechanism to enforce which phrases refer to which feature comparisons in each image.

Related Work
Employing visual comparisons to elicit focused natural language observations was proposed by (Maji, 2012), and later investigated in the context of crowdsourcing by (Zou et al., 2015).We take inspiration from these works.
Previous work has collected natural language captions of bird photographs: CUB Captions (Reed et al., 2016) and CUB-Justify (Vedantam et al., 2017) are both language annotations on top of the CUB-2011 dataset of bird photographs (Wah et al., 2011).In addition to describing two photos instead of one, the language in our dataset is more complex by comparison, containing a diversity of comparative structures and implied semantics.We also collect our data without an anatomical guide for annotators, yielding everyday language in place of scientific terminology.
Conceptually, our paper offers a complementary approach to works that generate single-image class-discriminative or image-discriminative captions (Hendricks et al., 2016;Vedantam et al., 2017).Rather than discriminative captioning, we focus on comparative language as a means for bridging the gap between varying granularities of visual diversity.
Methodologically, our work is most closely related to the Spot-the-diff dataset (Jhamtani and Berg-Kirkpatrick, 2018).While dataset captions two images with only a small section of pixels that change (surveillance footage), we consider image pairs with no pixel overlap, which motivates our stratified sampling approach for drawing good comparisons.
Finally, the recently released NLVR 2 dataset (Suhr et al., 2018) introduces a challenging natural language reasoning task using two images as context.Our work instead focuses on generating comparative language rather than reasoning.

Conclusion
We present the new Birds-to-Words dataset and Neural Naturalist model for generating comparative explanations of fine-grained visual differences.This dataset features paragraph-length, adaptively detailed descriptions written in everyday language.We hope that continued study of this area will produce models that can aid humans in critical domains like citizen science.

A Algorithmic Approach to Dataset Construction
We present here an algorithmic approach to collecting a dataset of image pairs with natural language text describing their differences.The central challenge is to balance empirical desideratamainly, sample coverage and model relevancewith practical constraints of data quality and cost.This algorithmic approach underpins the dataset collection we outlined in the paper body.

A.1 Goals
Our goal is to collect a dataset of tuples (i 1 , i 2 , t), where i 1 and i 2 are images, and t is a textual comparison of them.We can consider each image i as drawn from some domain D ∈ {furniture, trees, ...}, or a completely open domain of all concepts.There are several criteria we would like to balance: 1. Coverage A dataset should sufficiently cover D so that generalization across the space is possible.
2. Relevance Given the capabilities for models to distinguish i 1 and i 2 , t should provide value.
3. Comparability Each pair (i 1 , i 2 ) must have sufficient structural similarities that a human annotator can reasonably write t comparing them.Pairs that are too different will yield lengthy and uninteresting descriptions without direct contrasting statements.Pairs that are too similar for human perception may yield "I can't see any difference."44. Efficiency Image judgements and textual annotations require human labor.With a fixed budget, we would like to yield a dataset of the largest size possible.
We describe sampling algorithms for addressing these issues given the choice of a domain.

A.2 Pivot-Branch Sampling
Drawing a single image i from domain D, there is a chance p ∈ [0, 1] that each image is ill-suited for comparisons.For example, i might be out-offocus or contain multiple instances.
If a pair of images is drawn, and each has probability p of being discarded, then 1 (1−p) 2 times more pairs must be selected and annotated.For example, if p = 2 3 , then the annotation cost is scaled by 2.25.This severely impacts annotation efficiency.
To combat this, we employ a stratified sampling strategy we call pivot-branch sampling.Each image on one side of the comparison (say, i pivot ) is vetted individually, and k images on the other side (say, i branch ) are sampled to produce pairs.With k-times fewer i pivot images, it is feasible to check each instance for usability.This lowers the annotation cost scale to 1 1−p (e.g., with p = 2 3 , this is 1.5).
Splitting our selection from D into two parts allows us to define two distinct sampling strategies.One choice is for s pivot (D) to select pivot images.The second is for s branch (D, i pivot , k) to sample k images given a single pivot image.

A.3 Designing s pivot (D)
Selecting i pivot are important because each will contribute to k image pairs in a dataset.Here we consider the case where there are class labels c ∈ C available for each image in the domain.We propose selecting s pivot to sample uniformly over C.This strategy attempts to provide coverage over D using class labels as a coarse measure of diversity.It accounts for category-level dataset bias (e.g., where most images belong to only a few classes).This pushes the need to address relevance and comparability to the sampling procedure for branched images.
A.4 Designing s branch (D, i pivot , k) Given each pivot image i pivot , we will choose k images from D for comparison.We can make use of additional functions and structure available on D: A function that measures the visual similarity between any two images.

T (D)
A taxonomy over D, with image class labels c ∈ C as leaves.
We can partition k = k v + k t to sample k v visually-similar images using and k t taxonomically related images.A simple strategy for visually similar images is to pick k v times without replacement.This samples the k v most visually similar images to i pivot , excluding the image itself.
To employ taxonomic information, we propose a walk over mutually exclusive subsets of T (D).We define a function a T (D) (c, ) that gives the set of other taxonomic leaves that share a common ancestor exactly taxonomic levels above c, and no levels lower.More formally, if we use p(c, c , ) to express that c and c a parent taxonomic levels above c, then we can define: The function a T (D) (c, ) partitions the taxonomy T (D) into disjoint subtrees.For example, a T (D) (c, 1) are the set of sibling classes to c which share its direct parent; a T (D) (c, 2) are the set of cousin classes to c which share its grandparent, but not its parent.
We can employ a T (D) (c, ) by choosing class c from our pivot image i pivot and varying .As we increase , we define mutually exclusive sets of classes with greater taxonomic distance from c.
To sample images using this scheme, we can further split our k t budget for taxonomically sampled images into k t = k t 1 + k t 2 + • • • + k t for different levels.Then, if we write the set of classes C = a T (D) (c, ), we can sample k t images from C. One scheme is to perform round-robin sampling: rotate through each class c ∈ C and sample sample one image from each until k t are cho-sen.
A.5 Analyzing s branch (D, i pivot , k) Given a good visual similarity function V, image pairs will exhibit enough similarity to satisfy requirement that they be semantically close enough to be comparable.They may also be so visually similar that comparability is difficult.However, this aspect counter-balances with relevance: if V(i 1 , i 2 ) is small under a visual model, but their differences are describable by humans, their difference description has high value because it distinguishes two points with high similarity in visual embeddings space.
The use of the taxonomy T (D) complements V by providing controllable coverage over D while maintaining relevance and comparability.Tuning the range of values used in the taxonomic splits a T (D) (c, ) ensures comparability is maintained.Clamping below a threshold ensures images have sufficient similarity, and controlling the proportion of k t for small values of mitigates the risk of too-similar image pairs.
Similarly, we can adjust the relevance of taxonomic sampling by controlling the distribution of k t 1 . . .k t with respect to the particular structure of the taxonomy T (D).If the taxonomy is wellbalanced, then fixing a constant k t will draw proportionally more samples from subtrees close to c.This can be seen by considering that a T (D) (c, ) defines exponentially larger subsets of T (D) as increases.Drawing the same number of samples from each subset biases the collection towards relevant pairs (which should be more difficult to distinguish) while maintaining sparse coverage over the entirety of D.

B Details for Constructing
Birds-to-Words Dataset We provide here additional details for constructing the Birds-to-Words dataset.This is meant to link the high level overview in Section 2 with the algorithmic approach presented in the previous section (Appendix A).

B.1 Clarity
To build a dataset emphasizing fine-grained comparisons between two animals, we impose stricter restrictions on the images than iNaturalist research-grade observations (photographs).An iNaturalist observation that is research-grade indi-cates the community has reached consensus on the animal's species, that the photo was taken in the wild, and several other qualifications. 5We include four additional criteria that we define together as clarity: 1. Single instance: A photo must include only a single instance of the target species.Bird photography often includes flocks in trees, in the air, or on land.In addition, some birds appear in male/female pairs.For our dataset, all of those photos must be discarded.

Animal:
A photo must include the animal itself, rather than a record of it (e.g., tracks).
3. Focus: A photo must be sufficiently in-focus to describe the animal in detail.

Visibility:
The animal in the photo must not be too obscured by the environment, and must take up enough pixels in the photo to be clearly described.

B.2 Pivot Images
To pick pivot images, we first uniformly sample from the set of 9k species in the taxonomic CLASS Aves in iNaturalist.We consider only species with at least four recorded observations to promote the likelihood that at least one image is clear.We also perform look-ahead branch sampling to ensure that a species will yield sufficient comparisons taxonomically.For each species, we manually review four images sampled from this species to select the clearest image to use as the pivot image.If none are suitable, we move to the next species.With this manual process, we select 405 species and corresponding photographs to use as pivot i 1 images.

B.3 Branching Images
See Section 2.3 for the description of selecting k v = 2 visually similar branching images using a function V(i 1 , i 2 ).We highlight here the use of the taxonomy T (D) to select k t = 10 branching images with varying levels of taxonomic distance.
For the class c corresponding to image i 1 , we split the taxonomic tree into disjoint subtrees rooted ∈ {1..5} taxonomic levels above c.Each higher level excludes the levels beneath it.For example, at = 1 we consider all images of the same 5 More details on iNaturalist research-grade specification: https://www.inaturalist.org/pages/help#quality species as i 1 ; at = 2, we consider all images of the same genus as i 1 , but that have a different species.We set each k t = 2 for a total of k t = 10.

B.4 Annotations
Clarity Annotators first label whether i 1 and i 2 are clear.While we manually verified each i 1 is clear, each i 2 must still be vetted. 6Starting from 405 pivot images i 1 , and selecting k = 12 branching images i 2 for each, we annotated a total of 4,860 image pairs.After restricting images to have ≥ 4 5 positive clarity judgments, we ended up with the 3,347 image pairs in our dataset, a retention rate of 68.9%.
Quality We vet each annotator individually by manually reviewing five reference annotations from a pilot round, and perform random quality assessments during data collection.We found that manually vetting the writing quality and guideline adherence of each individual annotator vital for ensuring high data quality.

C Model Details
For the image embedding component of our model, we use a ResNet-101 network as our CNN.We use a model pretrained on ImageNet and fix the CNN weights before starting training for our task.We also experimented with an Inception-v4 model, but found ResNet-101 to have better performance.
For both the Transformer encoder and decoder, we use N = 6 layers, a hidden size of 512, 8 attention heads, and dot product self-attention.Each paragraphs is clipped at 64 tokens during training (chosen empirically to cover 94% of paragraphs).The text is preprocessed using standard techniques (tokenization, lowercasing), and we replace mentions referring to each image with special tokens ANIMAL1 and ANIMAL2.
For inference, we experiment with greedy decoding, multinomial sampling, and beam search.Beam search performs best, so we use it with a beam size of 5 for all reported results (except the decoding ablations, where we report each).
We train with Adagrad for 700k steps using a learning rate of .01 and batch size of 2048.We decay the learning rate after 20k steps by a factor of 0.9.Gradients are clipped at a magnitude of 5.

D Image Attributions
The table above provides attributions for all photographs used in this paper.
Figure1: The Birds-to-Words dataset: comparative descriptions adapt naturally to the appropriate level of detail (orange underlines).A difficult distinction (TOP) is given a longer and more fined-grained comparison than an easier one (BOTTOM).Annotators organically use everyday language to refer to parts (green highlights).

Figure 2 :
Figure 2: Illustration of pivot-branch stratified sampling algorithm used to construct the Birds-to-Words dataset.The algorithm harnesses visual and taxonomic distances (increasing vertically) to create a challenging task with board coverage.

Figure 4 :
Figure 4: The proposed Neural Naturalist model architecture.The multiplicative joint encoding and Transformerbased comparative module yield the best comparisons between images.
animal1 is covered in black feathers , while animal2 has light grey abdomen and chest with variety of dark brown and light brown feathers over wings , back and head .

Figure 5 :
Figure 5: Samples from the dev split of the proposed Birds-to-Words dataset, along with Neural Naturalist model output (M) and one of five ground truth paragraphs (G).The second row highlights failure cases in red.The model produces coherent descriptions of variable granularity, though emphasis and assignment can be improved.

Table 1 :
Comparison with recent fine-grained language-and-vision datasets.Lang values: S = scientific, E = everyday, M = mixed.Images Ctx = number of images shown, Images Cap = number of images described in caption.Dataset citations: R = Reed et al., V = Vedantam et al., J&B = Jhamtani and Berg-Kirkpatrick.

Table 2 :
Experimental results for comparative paragraph generation on the proposed dataset.For human captions, mean and standard deviation are given for a one-vs-rest scheme across twenty-five runs.We observed that CIDEr-D scores had little correlation with description quality.The Neural Naturalist model benefits from a strong joint encoding and Transformer-based comparative module, achieving the highest BLEU-4 and ROUGE-L scores.

Table 3 :
Variants and ablations for the Neural Naturalist model.We find the best performing combination is an elementwise multiplication ( ) for the joint encoding, a 6-layer Transformer comparative module, a 6-layer Transformer decoder, and using beamsearch to perform inference.

Table 4 :
Human evaluation results on 120 test set sam-