Unsupervised Discovery of Multimodal Links in Multi-image, Multi-sentence Documents

Images and text co-occur constantly on the web, but explicit links between images and sentences (or other intra-document textual units) are often not present. We present algorithms that discover image-sentence relationships without relying on explicit multimodal annotation in training. We experiment on seven datasets of varying difficulty, ranging from documents consisting of groups of images captioned post hoc by crowdworkers to naturally-occurring user-generated multimodal documents. We find that a structured training objective based on identifying whether collections of images and sentences co-occur in documents can suffice to predict links between specific sentences and specific images within the same document at test time.


Introduction
Images and text act as natural complements on the modern web. News stories include photographs, product listings show multiple images providing detail for online shoppers, and Wikipedia pages include maps, diagrams, and pictures. But the exact matching between words and images is often left implicit. Algorithms that identify documentinternal connections between specific images and specific passages of text could have both immediate and long-term promise. On the user-experience front, alt-text for vision-impaired users could be produced automatically (Wu et al., 2017) via intradocument retrieval, and user interfaces could explicitly link images to descriptive sentences, potentially improving the reading experience of sighted users. Also, in terms of improving other applications, the text in multimodal documents can be viewed as a noisy form of image annotation: inferred image-sentence associations can serve as training pairs for vision models, particularly in domains lacking readily-available labeled data.  Figure 1: At training time, we assume we are given a set of multi-image/multi-sentence documents. At testtime, we predict links between individual images and individual sentences within single documents. Because no explicit multimodal annotation is available at training time, we refer to this task as unsupervised.
In this work, we develop unsupervised models that learn to identify multimodal within-document links despite not having access to supervision at the individual image/sentence level during training. Rather, the training documents contain multiple images and multiple sentences 1 that are not aligned, as illustrated in Figure 1.
Our intra-document setting poses challenges beyond those encountered in the usual cross-modal retrieval framework, wherein "documents" generally consist of a single image associated with a single piece of text, e.g., an image caption. For the longer documents we consider, a sentence may have many corresponding images or no corresponding images, and vice versa. Furthermore, we expect that images within documents will be, on average, more similar than images across documents, thus making disambiguation more difficult than in the usual one-image/one-sentence case.
Our approach for this difficult setting is rankingbased: we train algorithms to score image collec-tions and sentence collections that truly co-occur more highly than image collections and sentence collections that do not co-occur. The matching functions we consider predict a latent similarityweighted bipartite graph over a document's images and sentences; at test time, we evaluate this internal bipartite graph representation learned by our models for the task of intra-document link prediction.
We work with a variety of datasets (one of which we introduce), ranging from concatenations of individually-captioned images to organicallymultimodal documents scraped from noisy, usergenerated web content. 2 Despite having no supervision at the individual image-sentence level, our algorithms perform well on the same-document link prediction task. For example, on a visual storytelling dataset, we achieve 90+ AUC, even in the presence of a large number of sentences that do not correspond to any images in the document. Similarly, for organically-multimodal web data, we are able to surpass object-detection baselines by a wide margin, e.g., for a step-by-step recipe dataset, we improve precision by 20 points on link prediction within documents by leveraging document-level co-occurrence during training.
We conclude by using our algorithm to discover links within a Wikipedia image/text dataset that lacks ground-truth image-sentence links. While the predictions are imperfect, the algorithm qualitatively identifies meaningful patterns, such as matching an image of a dodo bird to one of two sentences (out of 100) in the corresponding article that mention "dodo".

Task Formulation
We assume as given a set of documents where each document d i = S i , V i consists of a set S i of n i = |S i | sentences and a set V i of m i = |V i | images. 3 For example, d i could be an article about Paris with n i = 100 sentences and m i = 3 images of, respectively, the Eiffel Tower, the Arc de Triomphe, and a map of Paris. For each d i , we are to predict an alignment -where some sentences or images may not be aligned to anything -represented by a (potentially sparse) bipartite graph 2 Data and code: www.cs.cornell.edu/ jhessel/multiretrieval/multiretrieval. html 3 Sentences and images can be considered as sequences rather than sets in our framework, but unordered sets are more appropriate for modeling some of the crowd-sourced corpora we used in our experiments. on n i sentence nodes and m i image nodes. During training, we are given no access to ground-truth image-sentence association graphs, i.e., we do not know a priori which images correspond to which sentences, only that all images/sentences in a document co-occur together; this is why we refer to our task as unsupervised.
We produce a dense sentence-to-image association matrix M i ∈ R n i ×m i , in which each entry is the confidence that there is an (undirected) edge between the corresponding nodes. Applying different thresholding strategies to M i 's values yields different alignment graphs. Evaluation. When we have ground-truth alignment graphs for test documents, we evaluate the correctness of the association matrix M i predicted by our algorithms according to two metrics: AU-ROC (henceforth AUC) and precision-at-C (p@C). AUC, commonly used in evaluating link prediction (see Menon and Elkan (2011)) is the area under the curve of the true-positive/false-positive rate produced by sweeping over possible confidence thresholds; random is 50, perfect is 100. p@C measures the accuracy of the algorithm's most confident C predicted edges (in our case, the most confident edges correspond to the largest entries in M i ). This metric models cases where only a small number of high-confidence predictions need be made per document. We evaluate using C ∈ {1, 5}.

Models
Our algorithm is inspired by work in cross-modal retrieval (Rasiwasia et al., 2010;Hodosh et al., 2013;Costa Pereira et al., 2014;Kiros et al., 2014b). Instead of operating at the level of individual images/sentences, however, our training objective encourages image sets and sentence sets appearing in the same document to be more similar than non-co-occurring sets.

Alignment Model and Loss Function
We assume that the dimensionality d multi of the multimodal text-image space is predetermined. Extracting sentence representations. We pass the words in each sentence through a 300D wordembedding layer initialized with GoogleNewspretrained word2vec embeddings (Mikolov et al., 2013). We then pass the sequence of word vectors to a GRU (Cho et al., 2014) and extract and L2-normalize a d multi -dimensional sentence representation from the final hidden state.
Extracting image representations. We first compute a representation for each image using a convolutional neural network (CNN). 4 The network's output is then mapped via affine projection to R d multi and L2-normalized. Correspondence prediction. The result of running the two steps above on an image-set/text-set pair S, V is |S| + |V | vectors, all in R d multi . From these, we compute the similarity matrix M ∈ R |S|×|V | , where the (j, k) th entry is the cosine similarity between the j th sentence vector and the k th image vector. Training Objective. We train under the assumption that co-occurring image-set/sentence-set pairs should be more similar than non-co-occurring image-set/sentence-set pairs. We hope that use of this document-level objective will produce an M i offering reasonable intra-document information at test time, even though such information is not available at training time.
The training process is modulated by a similarity function sim(S, V ) that measures the similarity between a set of sentences and a set of images by examining the entries of the individual image/sentence similarity matrix M i (specific definitions of sim(S, V ) are proposed in §3.2). We use a max-margin loss with negative sampling: we iterate through true documents d i = S i , V i , and negatively sample at the document level a set of b sets of images that did not co-occur with S i , V ′ = {V ′ 1 , ..., V ′ b }, and a set of b sets of sentences that did not co-occur with V i , S ′ = {S ′ 1 , ..., S ′ b }. We then compute a loss for S i , V i by comparing the true similarities to the negative-sample similarities. We find that hard-negative mining (Dalal and Triggs, 2005;Schroff et al., 2015;Faghri et al., 2018), the technique of selecting the negative cases that maximally violate the margin within the minibatch, performs better than simple averaging. The loss for a single positive example is: for hinge loss h α (p, n) = max(0, α − p + n), where we set margin α = 0.2 (Kiros et al., 2014a;Faghri et al., 2018).

Similarity Functions
We explore several functions for measuring how similar a set of n sentences S is to a set of m images V . All similarity functions convert the matrix M ∈ R n×m corresponding to S, V into a bipartite graph based on the magnitude of the entries. The functions differ in how they determine which entries M ij correspond to edges and edge weights. Dense Correspondence (DC). The DC function assumes a dense correspondence between images and sentences; each sentence must be aligned to its most similar image, and vice versa, regardless of how small the similarity might be: The underlying assumption of this function can clearly be violated in practice: 5 sentences can have no image, and images no sentence.
Top-K (TK). Instead of assuming that every sentence has a corresponding image and vice versa, in this function only the top k most likely sentence ⇒ image (and image ⇒ sentence) edges are aligned. This process mitigates the effect of non-visual sentences by allowing algorithms to align them to no image. We discuss choices of k for particular experimental settings in §4.1. Assignment Problem (AP). We may wish to consider the image-sentence alignment task as a bipartite linear assignment problem (Kuhn, 1955), such that each image/sentence in a document has at most one association. Each time we compute sim(S, V ) in the forward pass of our models, we solve the integer programming problem of maximizing i,j M ij x ij subject to the constraints: Despite involving a discrete optimization step, the model remains fully differentiable. Our forward pass uses tensorflow's python interface, tf.py func, and the lapjv implementation of the JV algorithm (Jonker and Volgenant, 1987) to solve the integer program itself. Given the solution x * ij , we compute (and backpropagate gradients through) the similarity function sim(S, V ) = i,j M ij x * ij /r where r is the number of nonzero x * ij . Should we want to impose an upper bound Step #2 ... and final layer of your beautiful snack. *** Pulling It All Together! 1. Remove the dually layered bar ... *** Finishing Notes Allow the bar to acclimate...

RecipeQA
So my partner and I decided that we want to build our first In-Home rock climbing wall... *** We set aside a budget of $1200 and began a model to estimate... *** Each box represents one square foot of climbing space... *** After cutting a bit more plywood and lining it up... *** I insisted in putting a few cross braces into the angled section... *** I'm going to have fun with this.

DIY
Rivet A rivet is a permanent mechanical fastener... Solid rivets consist simply of a shaft and head... Steel rivets can be found in static structures such as bridges, cranes, ... They are offered from 1/16-inch (1.6 mm) to 3/8-inch (9.5 mm) in diameter ... The most common machine is the impact riveter and the most common use of semitubular rivets is in lighting, brakes ...

Imageclef-Wiki
[male] and [male] went to a fair on friday. There were lot of people there in the field. A big roller coaster was set up in the middle of the fair. There were also other ride to play on. Thankfully the last ride was the scariest ride that i refused to go on, was the one that went straight up and dropped down quickly.

Story-SIS Story-DII
The horses are small and in the pen. Two ponies are in a dirt covered field near a wire fence. Brown animals are standing up next to each other. Two horses are grazing on green grass outside. A brown horse with messy fur is staring at the camera.
A run down street with grass growing in the middle it. A person's hand holding up a cell phone to a guinea pig in a cage. A man in a party hat sits at a table talking on a cell phone. A person doing a high jump on a skateboard. A keyboard sitting on a desk next to a large mouse pad. A man standing outside a building and practicing tennis. A person helping another person fix their skis. A photograph of sewing supplies including: scissors, a tape measure. Buttons and a needle & thread. A large white and blue bus driving down a street. Some people walking on the sand water and a kite surfer. Figure 2: Sample documents from six of our datasets. Image sets and sentence sets may be truncated due to space constraints. The example from Story-DII is harder than is typical, but we include it to illustrate a point regarding image spread made in §4.1. *** denotes text-chunk delimiters present in the original data.

MSCOCO
k on the number of links, we can add the following additional constraint: 6 i,j x ij ≤ k(S, V ). For example, one could set k(S, V ) = 1 2 min(|S|, |V |). The JV algorithm's runtime is O(max(n, m) 3 ), and each positive example requires computing similarities for the positive case and the 2b negative samples from Eq. 1, for a per-example runtime of O(b · max(n, m) 3 ). Fortunately, lapjv is highly optimized, so despite solving many integer programs, AP often runs faster than DC.

Baselines
We construct two baseline similarity functions, as we are not aware of existing models that directly address our task in an unsupervised fashion. Object Detection. For each image in the document, we use DenseNet169 (Huang et al., 2017) to find its K most probable ImageNet classes (e.g., "stingray"), and represent the image as the average of the word2vec embeddings of those K labels. We represent each sentence in a document as the mean word2vec embedding of its words. To form the strongest possible baseline, we compute the cosine similarity between all sentence-image pairs to form M for K ∈ {1...20} and report the variant with the best post-hoc performance on the test set. NoStruct. The similarity functions described in §3.2 rely on document-level, structural information, i.e., for a single image in a document, the other images in a document affect the overall similarity (and vice versa for sentences). However, this structural information may not be worth incorporating. Thus, we train a baseline that solely relies on 6 Applying Volgenant's (2004) polynomial-time algorithm.
single image/single sentence co-occurrence statistics. At training time, we randomly sample a single image and a single sentence from a document, compute the cosine similarity of their vector representations, and treat that value as the document similarity. While the randomly sampled image/sentence will not truly correspond for every sample, we still expect this baseline to produce above-random results when averaged over many iterations, as true correspondences have some (low) probability of being sampled. 7

Experiments on Crowdlabeled Data
Our first set of experiments uses four pre-existing datasets created by asking crowdworkers to add sentence-long textual descriptions to images in a collection. Image-sentence alignments are therefore known by construction. We do not use these labels at training time: gold-standard alignments are only used at evaluation time to compare performance between algorithms. 8 Statistics of these datasets are given in the top half of Table 1, and example documents are given in Figure 2. Each crowdlabeled dataset is constructed to address a different question about our learning setting. Q: Is this task even possible? Test: MSCOCO. MSCOCO (Lin et al., 2014) was created by crowdsourced manual captioning of single images. We construct "documents" from this data by first randomly aggregating five image-caption pairs. We then add five "distractor" images with no captions and five "distractor" captions with no images. Thus, a non-distractor image truly corresponds to the single caption that was written about it, and not to the other 9 captions in the document. There are a total of 10 images/sentences per document, and 5 ground-truth image-sentence links. A priori, we expect this to be the easiest setting for withindocument disambiguation because mismatched images and sentences are completely independent. Q: What if the images/sentences within a document are similar? Test: Story-DII.  asked crowdworkers to collect subsets of images contained in the same Flickr album (Thomee et al., 2016) that could be arranged into a visual story. In the Story-DII (= "descriptions in isolation") case, (possibly different) crowdworkers subsequently captioned the images, but only saw each image in isolation. We construct a set of documents from Story-DII so that each contains five images and five sentences. Because images come from the same album, images and captions in our Story-DII "documents" are more similar to each other than those in our MSCOCO "documents." Q: What if the sentences are cohesive and refer to each other? Test: Story-SIS.  also presented all the images in a subset from the same Flickr album to crowdworkers simultaneously and asked them to caption the image subsets collectively to form a story (SIS = "story in sequence"). In contrast to Story-DII, the generated sentences are generally not stand-alone descriptions of the corresponding image's contents, and may, for example, use pronouns to refer to elements from neighboring sentences and images. Q: What if there are many sentences with no corresponding images? Test: DII-Stress. Because documents often have many sentences that do not directly refer to visual content, we constructed a setting with many more sentences than images. We augment documents from Story-DII with 45 randomly negatively sampled distractor captions.  50 epochs. We decrease the learning rate by a factor of 5 each time the loss in Eq. 1 over the dev set plateaus for more than 3 epochs. We set 9 d multi = 1024, and apply dropout with p = .4. At test time, we use the model checkpoint with the lowest dev error.

Crowdlabeled-Data Results
We tried all combinations of b ∈ {10, 20, 30}, sim ∈ {DC, TK, AP}. For TK and AP we set the maximum link threshold k to min(S i , V i ) or ⌈ 1 2 min(S i , V i )⌉ (denoted 1 2 k in the results table). 10 Table 2 shows test-set prediction results for b = 10 (results for b ∈ {20, 30} are similar). The retrieval-style objectives we consider encourage algorithms to learn useful within-document representations, and incorporating a structured similarity is beneficial. All our algorithms outperform the strongest baseline (NoStruct) in all cases, e.g., by at least 10 absolute percentage points in p@1 on Story-DII.
We next show, as a sanity check, that our interdocument training objective function (Eq. 1) corresponds to intra-document prediction performance (the actual function of interest). Figure 3 plots how both functions vary with number of epochs, for two different validation datasets. In general, inter-document performance and intra-document performance rise together during training; 11 for a fixed neural architecture, models better at optimiz-   ing the inter-document loss in Eq. 1 also generally produce better intra-document representations.
In addition, we found that i) DC, despite assuming every sentence corresponds to an image, achieves high performance on DII-Stress, even though 90% of its sentences do not correspond to an image; ii) Allowing AP/TK to make fewer connections (i.e., setting 1 2 k) did not result in significant performance changes, even in the MSCOCO case, where the true number of links (5) was the same as the number of links accounted for by AP/TK+ 1 2 k; and iii) adding topical cohesion (MSCOCO → Story-DII) makes the task more difficult, as does adding textual cohesion (Story-DII → Story-SIS). Models have trouble with the same documents. We calculated AUC for each test document individually. The Spearman correlation between these individual-instance AUC values is very high: of all pairs in DC/TK/AP, over all crowdlabeled datasets at b=10, DC vs. AP on MSCOCO had the lowest correlation with ρ = .89. Error analysis: content vs. spread. Why are some instances more difficult to solve for all of our algorithms? We consider two hypotheses. The "content" hypothesis is that some concepts are more difficult for algorithms to find multimodal relationships between: "beauty" may be hard to visualize, whereas "dog" is a concrete concept (Lu et al., 2008;Berg et al., 2010;Parikh and Grauman, 2011;Hessel et al., 2018;Mahajan et al., 2018). The "spread" hypothesis, which we introduce, is that documents with lower diversity among images/sentences may be harder to disambiguate at test time. For example, a document in which all images and all sentences are about horses re-quires finer-grained distinctions than a document with a horse, a barn, and a tractor. The Story-DII vs. Story-SIS example in Fig. 2 illustrates this contrast.
To quantify the spread of a document, we first extract vector representations of each test image/sentence. 12 We then L2-normalize the vectors and compute the mean squared distance to their centroid; higher "spread" values indicate that a document's sentences/images are more diverse. To quantify the content of a document, for simplicity, we mean-pool the image/sentence representations and reduce to 20 dimensions with PCA.
We first compute an OLS regression of image spread + text spread on test AUC scores for Story-DII/Story-SIS/DII-Stress 13 for AP with b = 10: 42/23/16% respectively (F-test p ≪ .01) of the variance in AUC can be explained by the spread hypothesis alone. In general, documents with less diverse content are harder, with image spread explaining more variance than text spread. When adding in the image+text content features, the proportion of AUC variance explained increases to 52/35/38%; thus, for these datasets, both the "content" and "spread" hypotheses independently explain document difficulty, though the relative importance of each varies across datasets.

Experiments on RQA and DIY
The previous datasets had captions added by crowdworkers for the explicit purpose of aiding research on grounding: for MSCOCO, annotators providing 12 We use DenseNet169 features for images and mean word2vec for sentences. We don't use internal model representations as we aim to quantify aspects of the dataset itself. 13 MSCOCO is omitted because the AUC scores are all large.  image captions were explicitly instructed to provide literal descriptions and "not describe what a person might say" (Chen et al., 2015). The manner in which users interact with multimodal content "in the wild" significantly differs from crowdlabeled data: Marsh and Domas White's (2003) 49-element taxonomy of multimodal relationships (e.g., "decorate", "reiterate", "humanize") observed in 45 web documents highlights the diversity of possible image-text relationships. We thus consider two datasets (one of which we release ourselves) of organically-multimodal documents scraped from web data, where the original authors created or selected both images and sentences. Statistics of these datasets are given in the bottom half of Table 1. RQA. RecipeQA (Yagcioglu et al., 2018) is a question-answering dataset scraped from instructibles.com consisting of images/descriptions of food preparation steps; we construct documents by treating each recipe step as a sentence. 14 Users of the Instructibles web interface put images and recipe steps in direct correspondence, which gives us a graph for test time evaluation. DIY (new). We downloaded a sample of 9K Reddit posts made to the community DIY ("do it yourself"). These posts 15 consist of multiple images that users have taken of the progression of their construction projects, e.g., building a rock climbing wall (see Figure 2). Users are encouraged to explicitly annotate individual images with captions, 16 and, for evaluation, we treat a caption written along-14 Recipe steps have variable length, are often not strictly grammatical sentences, and can contain lists, linebreaks, etc. 15 We required at least 25 upvotes per Reddit post to filter out spam and low-quality submissions. 16 As with RQA, DIY captions are not always grammatical. side a given image as corresponding to a true link. We adopt the same experimental protocols as in §4, but increase the maximum sentence tokenlength from 20 to 50; Table 3 shows the test-set results. In general, the algorithms we introduce again outperform the NoStruct baseline. In contrast to the crowdlabeled experiments, AP (slightly) outperformed the other algorithms. 17 DIY is the most difficult among the datasets we consider.
To see if the algorithms err on the same instances, we again compute the Spearman correlation ρ between test-instance AUC scores for DC/TK/AP, for b = 10. We find greater variation in performance on organically-multimodal compared to crowdlabeled data. For example, on RQA, DC and AP have a ρ of only .64. We also repeat the regression on test-instance AUC scores introduced in §4.1 with different results; content generally explains more variance than spread, e.g., for AP, for RQA/DIY respectively, only 2/1% is explained by spread alone, but 18/13% is explained by spread+content.

Qualitative Exploration
To visualize the within-document prediction for document i, we compute M i and solve the linear assignment problem described in §3.2, taking the edges with highest selected weights to be the most confident. Figure 4 contains example test predictions (along with M i ) from the datasets with ground-truth annotation. In an effort to provide representative cases, the selected examples have AUC scores close to average performance for their corresponding datasets.
The model mostly succeeds at associating literal objects and their descriptions: tennis players in MSCOCO, castles in Story-DII, a stapler in DIY, and bacon in a blender in RQA. Errors are often justifiable. For example, for the MSCOCO document, the chosen caption for a picture of two people playing baseball accurately describes the image, despite it having been written for a different image and thus counting as an error in our quantitative evaluation. Similarly, for RQA, a container of maple syrup is associated with a caption mentioning "syrup", which seems reasonable even though the recipe's author did not link that image/sentence.
In other cases, the algorithm struggles with what part of the image to "pay attention" to. In the Story-DII case (Figure 4b), the algorithm erroneously A young man writing on the door of a refrigerator a field that has a few baseball players on it A woman preparing to serve a ball thrown high in the air.
A woman with a tennis racket with a green background.
A kitchen with two metal sinks next to a stove top oven.
... cars dressed up for a wedding with the bride and groom sitting in the back...
Guests stand outside the entrance of an outdoor party tent.
The couple made their way through the cemetery on this special day.
A very big castle that is standing tall. A group of young men wearing suits stand and smile together .
My boss is great and makes me laugh.  (but arguably justifiably) decides to assign a caption about a bride, groom, and a car to a picture of the couple, instead of to a picture of a vehicle.
For more difficult datasets like Story-SIS (Figure 4c), the algorithm struggles with ambiguity. For 2/5 sentences that refer to literal objects/actions (soup cans/laughter), the algorithm works well. The remaining 3 captions are general musings about working at a grocery store that could be matched to any of the three remaining images depicting grocery store aisles. DIY is similarly difficult, as many images/sentences could reasonably be assigned to each other.

WIKI.
We also constructed a dataset from English sentence-tokenized Wikipedia articles (not including captions) and their associated images from Im-ageCLEF2010 (Popescu et al., 2010). In contrast to RQA and DIY, there are no explicit connections between individual images and individual sentences, so we cannot compute AUC or precision, but this corpus represents an important organicallymultimodal setting. We follow the same experimental settings as in §4 at training time, but instead of using pre-extracted features, we fine-tune the vision model's parameters. 18 Examining the predictions of the AP+fine-tuned CNN model trained on WIKI shows many of the model's predictions to be reasonable. Figure 5 shows the model's 5 most confident predictions on the 100-sentence Wikipedia article about Mauritius, chosen for its high image/text spread.

Additional related work
Our similarity functions are inspired by work in aligning image fragments, such as object bounding boxes, with portions of sentences without ex-  Figure 5: Predicted sentences, with cosine similarities, for images in a 100-sentence ImageCLEF Wikipedia article on Mauritius. The first three predictions are reasonable, the last two are not. The third result is particularly good given that only two sentences mention dodos; for comparison, the object-detection's choice began "(Mauritian Creole people usually known as 'Creoles')".

Conclusion and Future Directions
We have demonstrated that a family of models for learning fine-grained image-sentence links within documents can produce good test-time results even if only given access to document-level cooccurrence at training time. Future work could incorporate better models of sequence within document context Alikhani and Stone, 2018). While using structured loss functions improved performance, image and sentence representations themselves have no awareness of neighboring images/sentences; this information should prove useful if modeled appropriately. 19