Learning to Describe Differences Between Pairs of Similar Images

In this paper, we introduce the task of automatically generating text to describe the differences between two similar images. We collect a new dataset by crowd-sourcing difference descriptions for pairs of image frames extracted from video-surveillance footage. Annotators were asked to succinctly describe all the differences in a short paragraph. As a result, our novel dataset provides an opportunity to explore models that align language and vision, and capture visual salience. The dataset may also be a useful benchmark for coherent multi-sentence generation. We perform a first-pass visual analysis that exposes clusters of differing pixels as a proxy for object-level differences. We propose a model that captures visual salience by using a latent variable to align clusters of differing pixels with output sentences. We find that, for both single-sentence generation and as well as multi-sentence generation, the proposed model outperforms the models that use attention alone.


Introduction
The interface between human users and collections of data is an important application area for artificial intelligence (AI) technologies. Can we build systems that effectively interpret data and present their results concisely in natural language? One recent goal in artificial intelligence has been to build models that are able to interpret and describe visual data to assist humans in various tasks. For example, image captioning systems (Vinyals et al., 2015b;Xu et al., 2015;Rennie et al., 2017;Zhang et al., 2017) and visual question answering systems (Antol et al., 2015;Lu et al., 2016;Xu and Saenko, 2016) can help visually impaired people in interacting with the world. Another way in which machines can assist humans is by identifying meaningful pat- Figure 1: Examples from Spot-the-diff dataset: We collect text descriptions of all the differences between a pair of images. Note that the annotations in our dataset are exhaustive wrt differences in the two images i.e. annotators were asked to describe all the visible differences. Thus, the annotations contain multi-sentence descriptions. terns in data, selecting and combining salient patterns, and generating concise and fluent 'humanconsumable' descriptions. For instance, text summarization (Mani and Maybury, 1999;Gupta and Lehal, 2010;Rush et al., 2015) has been a long standing problem in natural language processing aimed at providing a concise text summary of a collection of documents.
In this paper, we propose a new task and accompanying dataset that combines elements of image captioning and summarization: the goal of 'spotthe-diff' is to generate a succinct text description of all the salient differences between a pair of similar images. Apart from being a fun puzzle, solutions to this task may have applications in assisted surveillance, as well as computer assisted tracking of changes in media assets. We collect and release a novel dataset for this task, which will be potentially useful for both natural language and computer vision research communities. We used crowd-sourcing to collect text descriptions of differences between pairs of image frames from video-surveillance footage (Oh et al., 2011), asking annotators to succinctly describe all salient differences. In total, our datasets consist of descriptions for 13,192 image pairs. Figure 1 shows a sample data point -a pair of images along with a text description of the differences between the two images as per a human annotator.
There are multiple interesting modeling challenges associated with the task of generating natural language summaries of differences between images. First, not all low-level visual differences are sufficiently salient to warrant description. The dataset presents an interesting source of supervision for methods that attempt to learn models of visual salience (we additionally conduct exploratory experiments with a baseline salience model, as described later). Second, humans use different levels of abstraction when describing visual differences. For example, when multiple nearby objects have all moved in coordination between images in a pair, an annotator may refer to the group as a single concept (e.g. 'the row of cars'). Third, given a set of salient differences, planning the order of description and generating a fluent sequence of multiple sentences is itself a challenging problem. Together, these aspects of the proposed task make it a useful benchmark for several directions of research.
Finally, we experiment with neural image captioning based methods. Since salient differences are usually described at an object-level rather than at a pixel-level, we condition these systems on a first-pass visual analysis that exposes clusters of differing pixels as a proxy for object-level differences. We propose a model which uses latent discrete variables in order to directly align difference clusters to output sentences. Additionally we incorporate a learned prior that models the visual salience of these difference clusters. We observe that the proposed model which uses alignment as a discrete latent variable outperforms those that use attention alone.

'Spot-the-diff' Task and Dataset
We introduce 'spot-the-diff' dataset consisting of 13,192 image pairs along with corresponding human provided text annotations stating the differences between the two images. Our goal was to create a dataset wherein there are meaning- % Long sentences (> 20 words) 5% Table 1: Summary statistics for spot-the-diff dataset ful differences between two similar images. To achieve this, we work with image frames extracted from VIRAT surveillance video dataset (Oh et al., 2011), which consists of 329 videos across 11 frames of reference totalling to about 8.5 hours of videos.

Extracting Pairs of Image Frames
To construct our dataset, we first need to identify image pairs such that some objects have changed positions or have entered or left in the second image compared to the first image. To achieve this, we first extract a certain number of randomly selected image frame pairs from a given video. Thereafter, we compute the L 2 distance between the two images in each pair (under RGB representation). Finally, we set a lower and a upper threshold on the L 2 distance values so calculated to filter out the image pairs with potentially too less or too many changes. These thresholds are selected based on manual inspection. The resulting image pairs are used for collecting the difference descriptions.

Human Annotation
We crowd-sourced natural language differences between images using Amazon Mechanical Turk. We restrict to annotators from primarily Anglophone countries: USA, Australia, United Kingdom, and Canada, as we are working with English language annotations. We limit to those participants which have lifetime HIT > 80%. We award 5 cents per HIT (Human Intelligence Task) to participants. We provide the annotators with an example on how to work on the task. We request the annotators to write complete English sentences, with each sentence on a separate line. We collect a total of 13192 annotations.   Table 1 shows some summary statistics about the collected dataset. Since we deal with a focused domain, we observe a small vocabulary size. On an average there are 1.86 reported differences / sentences per image pair. We also report interannotator agreement as measured using text overlap of multiple annotations for the same image pair. We collect three sets of annotations for a small subset of the data (467 data points) for the purpose of reporting inter-annotator agreements. We thereby calculate BLEU and ROUGE-L scores by treating one set of annotations as 'hypothesis' while remaining two sets act as 'references' (Table  2). We repeat the same analysis for MS-COCO dataset and report these measures for reference. The BLEU and METEOR values for our dataset seem reasonable and are comparable to the values observed for MS-COCO dataset.

Modeling Difference Description Generation
We propose a neural model for describing visual difference based on the input pair of images that uses latent alignment variable to capture visual salience. Since most descriptions talk about higher-level differences rather than individual pixels, we first perform a visual analysis that precomputes a set of difference clusters in order to approximate object-level differences, as described next. The output of this analysis is treated as input to a neural encoder-decoder text generation model that incorporates a latent alignment variable and is trained on our new dataset.

Exposing Object-level Differences
We first analyze the input image pair for the pixellevel differences by computing a pixel-difference mask, followed by a local spatial analysis which segments the difference mask into clusters that approximate the set of object-level differences. Thereafter, we extract image features using convolutional neural models and use these as input to a neural text generation model, described later.
Pixel-level analysis: The lowest level of visual difference is individual differences between corresponding pixels in the input pair. Instead of requiring our description model to learn to compute pixel-level differences as a first step, we precompute and directly expose these to the model. Let X = (I 1 , I 2 ) represent the image pair in a datum. For each such image pair in our dataset, we obtain a corresponding pixel-difference mask M . M is a binary-valued matrix of the same dimensions (length and width) as each of the images in the corresponding image pair, wherein each element in the matrix is 1 (active) if the corresponding pixel is different between the input pair, and ing a model to describe visual difference, we first compute pixel-level differences, as well as a segmentation of these differences into clusters, as a proxy for exposing object-level differences. The first row shows the original image pair. Bottom left depicts the pixel-difference mask, which represents extracted pixel-level differences. The segmentation of the pixel-difference mask into clusters is shown in the bottom right.
0 otherwise. To decide whether a pair of corresponding pixels in the input image pair are sufficiently different, we calculate the L 2 -distance between the vectors corresponding to each pixel's color value (three channels) and check whether this difference is greater than a threshold δ (set based on manual inspections). While the images are extracted from supposedly still cameras, we do find some minor shifts in the camera alignment, which is probably due to occasional wind but may also be due to manual human interventions. These shifts are rare and small, and we align the images in the pair by iterating over a small range of vertical and horizontal shifts to find the shift with minimum corresponding L 2 -distance between the two images.
Object-level analysis: Most visual descriptions refer to object-level differences rather than pixellevel differences. Again, rather than requiring the model to learn to group pixel differences into objects, we attempt to expose this to the model via pre-processing. As a proxy for object-level difference, we segment the pixel-level differences in the pixel-difference mask into clusters, and pass these clusters as additional inputs to the model. Based on manual inspection, we find that with the right clustering technique, this process results in group- Figure 4: The figure shows the pixel-difference mask for the running example, along with the two original images, with bounding boxes around clusters. Typically one or more difference clusters are used to frame one reported difference / sentence, and it is rare for a difference cluster to participate in more than one reported difference.
ings that roughly correspond to objects that have moved, appeared, and disappeared between the input pair. Here, we find that density based clustering algorithms like DBScan (Ester et al., 1996) work well in practice for this purpose. In our scenario, the DBScan algorithm predicts clusters of nearby active pixels, and marks outliers consisting of small groups of isolated active pixels, based on a calculation of local density. This also serves as a method for pruning any noisy pixel differences which may have passed through the pixellevel analysis.
As the output of DBScan, we obtain segmentation of the pixel difference matrix M into difference clusters. Let the number of difference clusters be represented by K (DBScan is a nonparametric clustering method, and as such the number of clusters K is different for each data point.). Now, let's define C k as another binaryvalued mask matrix such that the elements in matrix corresponding to the k th difference cluster are 1 (active) while rest of the elements are 0.

Text Generation Model
We observe from annotated data that each individual sentence in a full description typically refers only to visual differences within a single cluster (see Figure 4). Further, on average, there are more clusters than there are sentences. While many uninteresting and noisy pixel-level differences get screened out in preprocessing, some uninteresting clusters are still identified. These are unlikely to be described by annotators because, even though they correspond to legitimate visual differences, they are not visually salient. Thus, we can roughly model description generation as a cluster selection process.
In our model, which is depicted in Figure 5, we  assume that each output description, which consists of sentences S 1 , . . . , S T , is generated sentence by sentence conditioned on the input image pair X = (I 1 , I 2 ). Further, we let each sentence S i be associated with a latent alignment variable, z i ∈ {1, . . . , K}, that chooses a cluster to focus on (Vinyals et al., 2015a). The choice of z i is itself conditioned on the input image pair, and parameterized in a way that lets the model learn which types of clusters are visually salient and therefore likely to be described as sentences. Together, the probability of a description given an image pair is given by: The various components of this equation are described in detail in the next few sections. Here, we briefly summarize each. The term P (z i |X; w) represents the prior over the latent variable z i and is parameterized in a way that lets the model learn which types of clusters are visually salient. The term P (S i |z i , X; θ) represents the likelihood of sentence S i given the input image pair and alignment z i . We employ masking and attention mechanisms to encourage this decoder to focus on the cluster chosen by z i . Each of these components conditions on visual features produced by a pretrained image encoder. The alignment variable z i for each sentence is chosen independently, and thus our model is similar to IBM Model 1 (Brown et al., 1993) in terms of its factorization structure. This will allow tractable learning and inference as described in Section 3.3. We refer to our approach as DDLA (Difference Description with Latent Alignment).
Alignment prior: We define a learnable prior over alignment variable z i . In particular, we let the multinomial distribution on z i be parameterized in a log-linear fashion using feature function g(z i ). Specifically, we consider the following four features: the length, width, and area of the smallest rectangular region enclosing cluster z i , and the number of active elements in mask C z i . Specifically, we let P (z i |X; w) ∝ exp(w T g(z i )).
Visual encoder: We extract images features using ResNet (He et al., 2016) pre-trained on Imagenet data. Similar to prior work (Xu et al., 2015), we extract features using a lower level convolutional layer instead of fully connected layer. In this way, we obtain image features of dimensionality 14 * 14 * 2096, where the first two dimensions correspond to a grid of coarse, spatially localized, feature vectors. Let F 1 and F 2 represent the extracted feature tensors for I 1 and I 2 respectively. Sentence decoder: We use an LSTM decoder (Hochreiter and Schmidhuber, 1997) to generate the sequence of words in each output sentence, conditioned on the image pair and latent alignments. We use a matrix transformation of the extracted image features to initialize the hidden state of the LSTM decoder for each sentence, independent of the setting of z i . Additionally, we use an attention mechanism over the image features at every decoding step, similar to the previous work (Xu et al., 2015). However, instead of considering attention over the entire image, we restrict attention over image features to the cluster mask determined by the alignment variable, C z i . Specifically, we project binary mask C z i from the input image dimensionality (224*224) to the dimensionality of the visual features (14*14). To achieve this, we use pyramid reduce downsampling on a smoothed version of cluster mask C z i . The resulting projection roughly corresponds to the subset of visual features with the cluster region in their receptive field. This projection is multiplied to attention weights. We incorporate a discrete latent variable z which selects one of the clusters as a proxy for object-level focus. Conditioned on the cluster and visual features in the corresponding region, the model generates a sentence using an LSTM decoder. During training, each sentence in the full description receives its own latent alignment variable, z.

Learning and Decoding
Learning in our model is accomplished by stochastic gradient ascent on the marginal likelihood of each description with alignment variables marginalized out. Since alignment variables are independent of one another, we can marginalize over each z i separately. This means running backpropagation through the decoder K times for each sentence, where K is the number of clusters. In practice K is relatively small and this direct approach to training is feasible. Following equation 1, we train both the generation and prior in an endto-end fashion.
For decoding, we consider the following two problem settings. In the first setting, we consider the task of producing a single sentence in isolation. We evaluate in this setting by treating the sentences in the ground truth description as multiple reference captions. This setting is similar to the typical image captioning setting. In the second setting, we consider the full multi-sentence generation task where the system is required to produce a full description consisting of multiple sentences describing all differences in the input. Here, the generated multi-sentence text is directly evaluated against the multi-sentence annotation in the crowd-soured data.
Single-sentence decoding: For single sentence generation, we first select the value of z i which maximizes the prior P (z i |X; w). Thereafter, we simply use greedy decoding to generate a sentence conditioned on the chosen z i and the input image pair.
Multi-sentence decoding: Here, we first select a set of clusters to include in the output description, and then generate a single sentence for each cluster using greedy decoding. Since typically there are more clusters than sentences, we condition on the ground truth number of sentences and choose the corresponding number of clusters. We rank clusters by decreasing likelihood under the alignment prior and then choose the top T .

Experiments
We split videos used to create the dataset into train, test, and validation in the ratio 80:10:10. This is done to ensure that all data points using images from the same video are entirely in one split. We report quantitative metrics like CIDEr (Vedantam et al., 2015), BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014), and ROUGE-L, as is often reported by works in image captioning. We report these measures for both sentence level setting and multi-sentence generation settings. Thereafter, we also discuss some qualitative examples. We implement our models in PyTorch (Paszke et al., 2017). We use mini-batches of size 8 and use Adam optimizer 1 . We use CIDEr scores on validation set as a criteria for early stopping.
Baseline models: We consider following baseline models: CAPT model considers soft attention over the input pair of images (This atten-   tion mechanism is similar to that used in prior image captioning works (Xu et al., 2015), except that we have two images instead of a single image input). We do not perform any masking in case of CAPT model, and simply ignore the cluster information. The model is trained to generate a single sentence. Thus, this model is similar to a typical captioning model but with soft attention over two images. CAPT-MASK model is similar to CAPT model except that it incorporates the masking mechanism defined earlier using the union of all the cluster masks in the corresponding image. We also consider a version of the CAPT model wherein the target prediction is the whole multi-sentence description -CAPT-MULTI -for this setting, we simply concatenate the sentences in any arbitrary order 2 . Additionally, we consider a nearest neighbor baseline (NN-MULTI), wherein we simply use the annotation of the closest matching training data point. We compute the closeness based on the extracted features of the image pair, and leverage sklearns (Pedregosa et al., 2011) Nearest-Neighbor module. For single sentence setting (NN), we randomly pick one of the sentences in the annotation. We also consider a version of DDLA model with fixed uniform prior, and refer to this model as DDLA-UNIFORM .
For single sentence generation, we sample z j randomly from the uniform distribution and then perform decoding. For the multi-sentence generation setting, we employ simple heuristics to order the clusters at test time. One such heuristic we consider is to order the clusters as per the decreasing area of the bounding box (smallest rectangular area enclosing the cluster).
Results: We report various automated metrics for the different methods under single sentence generation and multi-sentence generation in Tables 4  and 5 respectively. For the single sentence generation setting, we observe that the DDLA model outperforms various baselines as per most of the scores on the test data split. DDLA-UNIFORM method performs similar to the CAPT baseline methods. For the multi-sentence generation, the DDLA model again outperforms other methods. This means that having a learned prior is useful in our proposed method. Figure 6 shows an example data point with predicted outputs by different methods.

Discussion and Analysis
Qualitative Analysis of Outputs We perform a qualitative analysis on the outputs to understand the drawbacks in the current methods. One apparent limitation of the current methods is the failure to explicitly model the movement of same object in the two images (Figure 7) -prior works on object tracking can be useful here. Sometimes the models get certain attributes of the objects wrong. e.g. 'blue car' instead of 'red car'. Some output predictions state an object to have 'appeared' instead of 'disappeared' and vice Figure 7: Some drawbacks with the current models: One apparent drawback with the single cluster selection is that it misses opportunity to identify an object which has moved significantly-considering it as appeared or disappeared as the case may be. In this example, the blue truck moved, but the DDLA model predicts that the truck is no longer there. versa.

Do models learn alignment between sentence and difference clusters?
We performed a study on 50 image pairs by having two humans manually annotate gold alignments between sentences and difference clusters.
We then computed alignment precision for the model's predicted alignments. To obtain model's predicted alignment for a given sentence S i , we compute argmax k P (z i = k|X)P (S i |z i = k, X). Our proposed model achieved a precision of 54.6%, an improvement over random chance at 27.4%.
Clustering for pre-processing Our generation algorithm assumed one sentence uses only one cluster and as such we tune the hyper-parameters of clustering method to get large clusters so that typically a cluster will entirely contain a reported difference. On inspecting randomly selected data points, we observe that in some cases too large clusters are marked by the clustering procedure. One way to mitigate this is to tune clustering parameters to get smaller clusters and update the generation part to use a subset of clusters. As mentioned earlier, we consider clustering as a means to achieve object level pre-processing. One possible future direction is to leverage pre-trained object detection models to detect cars, trucks, people, etc. and make these predictions readily available to the generation model.
Multi-sentence Training and Decoding As mentioned previously, we query the models for a desired number of 'sentences'. In future works we would like to relax this assumption and design models which can predict the number of sentences as well. Additionally, our proposed model doesn't not explicitly ensure consistency in the latent variables for different sentences of a given data point i.e the model does not make explicit use of the fact that sentences report non-overlapping visual differences. Enforcing this knowledge while retaining the feasibility of training is a potential future direction of work.

Related Work
Modeling pragmatics: The dataset presents an opportunity to test methods which can model pragmatics and reason about semantic, spatial and visual similarity to generate a textual description of what has changed from one image to another. Some prior work in this direction (Andreas and Klein, 2016;Vedantam et al., 2017) contrastively describe a target scene in presence of a distractor. In another related task -referring expression comprehension (Kazemzadeh et al., 2014;Mao et al., 2016;Hu et al., 2017) -the model has to identify which object in the image is being referred to by the given sentence. However, our proposed task comes with a pragmatic goal related to summarization: the goal is to identify and describe all the differences. Since the goal is well defined, it may be used to constrain models that attempt to learn how humans describe visual difference.
Natural language generation: Natural language generation (NLG) has a rich history of previous work, including, for example, recent works on biography generation (Lebret et al., 2016), weather report generation (Mei et al., 2016), and recipe generation (Kiddon et al., 2016). Our task can viewed as a potential benchmark for coherent multi-sentence text generation since it involves assembling multiple sentences to succinctly cover a set of differences.
Visual grounding: Our dataset may also provide a useful benchmark for training unsupervised and semi-supervised models that learn to align vision and language. Plummer et al. (2015) collected annotation for phrase-region alignment in an image captioning dataset, and follow up work has attempted to predict these alignments (Wang et al., 2016;Plummer et al., 2017;Rohrbach et al., 2016). Our proposed dataset poses a related alignment problem: attempting to align sentences or phrases to visual differences. However, since differences are contextual and depend on visual comparison, our new task may represent a more challenging scenario as modeling techniques advance.
Image change detection: There are some works on land use pattern change detection ((Radke et al., 2005)). These works are related since they try to screen out noise and mark the regions of change between two images of same area at different time stamps. Bruzzone and Prieto (2000) propose an unsupervised change detection algorithms aim to discriminate between changed and unchanged pixels for multi-temporal remote sensing images. Zanetti and Bruzzone (2016) propose a method that allows unchanged class to be more complex rather than having a single unchanged class. Though image diff detection is part of our pipeline, our end task is to generate natural language descriptors.
Moreover, we observe that simple clustering seems to work well for our dataset.
Other relevant works: Maji (2012) aim to construct a lexicon of parts and attributes by formulating an annotation task where annotators are asked to describe differences between two images. Some other related works model phrases describing change in color (Winn and Muresan, 2018), move-by-move game commentary for describing change in game state (Jhamtani et al., 2018), and code commit message summarizing changes in code-base from one commit to another (Jiang et al., 2017). There exist some prior works on fine grained image classification and captioning (Wah et al., 2014;Nilsback and Zisserman, 2006;Khosla et al., 2011). The premise of such works is that it is difficult for machine to find discriminative features between similar objects e.g. birds of different species. Such works are relevant for us as the type of data we deal with are usually of same object or scene taken at a different time or conditions.

Conclusion
In this paper, we proposed the new task of describing differences between pairs of similar images and introduced a corresponding dataset. Compared to many prior image captioning datasets, text descriptions in the 'Spot-the-diff' dataset are often multi-sentence, consisting of all the differences in two similar images in most of the cases. We performed exploratory analysis of the dataset and highlighted potential research challenges. We discuss how our 'Spot-the-diff' dataset is useful for tasks such as language vision alignment, referring expression comprehension, and multisentence generation. We performed pixel and object level preprocessing on the images to identify clusters of differing pixels. We observe that the proposed model which aligns clusters of differing pixels to output sentences performs better than the models which use attention alone. We also discuss some limitations of current methods and scope for future directions.