Linking Entities Across Images and Text

This paper describes a set of methods to link entities across images and text. As a corpus, we used a data set of images, where each image is commented by a short caption and where the regions in the images are manually segmented and labeled with a category. We extracted the entity mentions from the captions and we computed a semantic similarity between the mentions and the region labels. We also measured the statistical associations be-tween these mentions and the labels and we combined them with the semantic similarity to produce mappings in the form of pairs consisting of a region label and a caption entity. In a second step, we used the syntactic relationships between the mentions and the spatial relationships between the regions to rerank the lists of candidate mappings. To evaluate our methods, we annotated a test set of 200 images, where we manually linked the image regions to their corresponding mentions in the captions. Eventually, we could match objects in pictures to their correct mentions for nearly 89 percent of the segments, when such a matching exists.


Introduction
Linking an object in an image to a mention of that object in an accompanying text is a challenging task, which we can imagine useful in a number of settings. It could, for instance, improve image retrieval by complementing the geometric relationships extracted from the images with textual descriptions from the text. A successful mapping would also make it possible to translate knowledge and information across image and text.
In this paper, we describe methods to link mentions of entities in captions to labeled image seg-ments and we investigate how the syntactic structure of a caption can be used to better understand the contents of an image. We do not address the closely related task of object recognition in the images. This latter task can be seen as a complement to entity linking across text and images. See Russakovsky et al. (2015) for a description of progress and results to date in object detection and classification in images.
2 An Example Figure 1 shows an example of an image from the Segmented and Annotated IAPR TC-12 data set (Escalantea et al., 2010). It has four regions labeled cloud, grass, hill, and river, and the caption: a flat landscape with a dry meadow in the foreground, a lagoon behind it and many clouds in the sky containing mentions of five entities that we identify with the words meadow, landscape, lagoon, cloud, and sky. A correct association of the mentions in the caption to the image regions would Figure 1: Image from the Segmented and Annotated IAPR TC-12 data set with the caption: a flat landscape with a dry meadow in the foreground, a lagoon behind it and many clouds in the sky map clouds to the region labeled cloud, meadow to grass, and lagoon to river.
This image, together with its caption, illustrates a couple of issues: The objects or regions labelled or visible in an image are not always mentioned in the caption, and for most of the images in the data set, more entities are mentioned in the captions than there are regions in the images. In addition, for a same entity, the words used to mention it are usually different from the words used as labels (the categories), as in the case of grass and meadow.

Previous Work
Related work includes the automatic generation of image captions that describes relevant objects in an image and their relationships. Kulkarni et al. (2011) assign each detected image object a visual attribute and a spatial relationship to the other objects in the image. The spatial relationships are translated into selected prepositions in the resulting captions. Elliott and Keller (2013) used manually segmented and labeled images and introduced visual dependency representations (VDRs) that describe spatial relationships between the image objects. The captions are generated using templates. Both Kulkarni et al. (2011) and Elliott and Keller (2013) used the BLEU-score and human evaluators to assess grammatically the generated captions and on how well they describe the image.
Although much work has been done to link complete images to a whole text, there are only a few papers on the association of elements inside a text and an image. Naim et al. (2014) analyzed parallel sets of videos and written texts, where the videos show laboratory experiments. Written instructions are used to describe how to conduct these experiments. The paper describes models for matching objects detected in the video with mentions of those objects in the instructions. The authors mainly focus on objects that get touched by a hand in the video. For manually annotated videos, Naim et al. (2014) could match objects to nouns nearly 50% of the time. Karpathy et al. (2014) proposed a system for retrieving related images and sentences. They used neural networks and they show that the results are improved if image objects and sentence fragments are included in the model. Sentence fragments are extracted from dependency graphs, where each edge in the graphs corresponds to a fragment.

Data Set
We used the Segmented and Annotated IAPR TC-12 Benchmark data set (Escalantea et al., 2010) that consists of about 20,000 photographs with a wide variety of themes. Each image has a short caption that describes its content, most often consisting of one to three sentences separated by semicolons. The images are manually segmented into regions with, on average, about 5 segments in each image.
Each region is labelled with one out of 275 predefined image labels. The labels are arranged in a hierarchy, where all the nodes are available as labels and where object is the top node. The labels humans, animals, man-made, landscape/nature, food, and other form the next level.

Entities and Mentions
An image caption describes a set of entities, the caption entities CE, where each entity CE i is referred to by a set of mentions M . To detect them, we applied the Stanford CoreNLP pipeline (Toutanova et al., 2003) that consists of a partof-speech tagger, lemmatizer, named entity recognizer (Finkel et al., 2005), dependency parser, and coreference solver. We considered each noun in a caption as an entity candidate. If an entity CE i had only one mention M j , we identified it by the head noun of its mention. We represented the entities mentioned more than once by the head noun of their most representative mention. We applied the entity extraction to all the captions in the data set, and we found 3,742 different nouns or noun compounds to represent the entities.
In addition to the caption entities, each image has a set of labeled segments (or regions) corresponding to the image entities, IE. The Cartesian product of these two sets results in pairs P generating all the possible mappings of caption entities to image labels. We considered a pair (IE i , CE j ) a correct mapping, if the image label IE i and the caption entity CE j referred to the same entity. We represented a pair by the region label and the identifier of the caption entity, i.e. the head noun of the entity mention. In Fig. 1, the correct pairs are (grass, meadow), (river, lagoon), and (cloud, clouds).

Building a Test Set
As the Segmented and Annotated IAPR TC-12 data set does not provide information on links between the image regions and the mentions, we annotated a set of 200 randomly selected images from the data set to evaluate the automatic linking accuracy. We assigned the image regions to entities in the captions and we excluded these images from the training set. The annotation does not always produce a 1:1 mapping of caption entities to regions. In many cases, objects are grouped or divided into parts differently in the captions and in the segmentation. We created a set of guidelines to handle these mappings in a consistent way. Table 1 shows the sizes of the different image sets and the fraction of image regions that have a corresponding entity mention in the caption.

Set
Files

Ranking Entity Pairs
To identify the links between the regions of an image and the entity identifiers in its caption, we first generated all the possible pairs. We then ranked these pairs using a semantic distance derived from WordNet (Miller, 1995), statistical association metrics, and finally, a combination of both techniques.

Semantic Distance
The image labels are generic English words that are semantically similar to those used in the captions. In Fig. 1, cloud and clouds are used both as label and in the caption, but the region labeled grass is described as a meadow and the region labeled river, as a lagoon. We used the WordNet Similarity for Java library, (WS4J), (Shima, 2014) to compute the semantic similarity of the region labels and the entity identifiers. WS4J comes with a number of metrics that approximate similarity as distances between WordNet synsets: PATH, WUP (Wu and Palmer, 1994), RES, (Resnik, 1995), JCN (Jiang and Conrath, 1997), HSO (Hirst and St-Onge, 1998), LIN (Lin, 1998), LCH (Leacock and Chodorow, 1998), and LESK (Banerjee and Banerjee, 2002).
We manually lemmatized and simplified the image labels and the entity mentions so that they are compatible with WordNet entries. It resulted in a smaller set of labels: 250 instead of the 275 original labels. We also simplified the named entities from the captions. When a person or location was not present in WordNet, we used its named entity type as identifier. In some cases, it was not possible to find an entity identifier in WordNet, mostly due to misspellings in the caption, like buldings, or buidling, or because of POS-tagging errors. We chose to identify these entities with the word entity. The normalization reduced the 3,742 entity identifiers to 2,216 unique ones.
Finally, we computed a 250 × 2216 matrix containing the similarity scores for each (image label, entity identifier) pair for each of the WS4J semantic similarity metrics.

Statistical Associations
We used three functions to reflect the statistical association between an image label and an entity identifier: • Co-occurrence counts, i.e. the frequencies of the region labels and entity identifiers that occur together in the pictures of the training set; • Pointwise mutual information (P M I) (Fano, 1961) that compares the joint probability of the occurrence of a (image label, entity identifier) pair to the independent probability of the region label and the caption entity occurring by themselves; and finally • The simplified Student's t-score as described in Church and Mercer (1993).
As with the semantic similarity scores, we used matrices to hold the scores for all the (image label, entity identifier) pairs for the three association metrics.

The Mapping Algorithm
To associate the region labels of an image to the entities in its caption, we mapped the label L i to the caption entity E j that had the highest score with respect to L i . We did this for the three association scores and the eight semantic metrics. Note that a region label is not systematically paired with the same caption entity, since each caption contains different sets of entities.
Background and foreground are two of the most frequent words in the captions and they were frequently assigned to image regions. Since they rarely represent entities, but merely tell where the entities are located, we included them in a list of stop words, as well as middle, left, right, and front that we removed from the identifiers.
We applied the linking algorithm to the annotated set. We formed the Cartesian product of the image labels and the entity identifiers and, for each image region, we ranked the caption entities using the individual scoring functions. This results in an ordered list of entity candidates for each region. Table 2 shows the average ranks of the correct candidate for each of the scoring functions and the total number of correct candidates at different ranks.

Reranking
The algorithm in Sect. 5.3 determines the relationship holding between a pair of entities, where one element in the pair comes from the image and the other from the caption. The entities on each side are considered in isolation. We extended their description with relationships inside the image and the caption. Weegar et al. (2014) showed that pairs of entities in a text that were linked by the prepositions on, at, with, or in, often corresponded to pairs of segments that were close to each other. We further investigated the idea that spatial relationships in the image relate to syntactical relationships in the captions and we implemented it in the form of a reranker.
For each label-identifier pair, we included the relationship between the image segment in the pair and the closest segment in the image. As in Weegar et al. (2014), we defined the closeness as the Euclidean distance between the gravity centers of the bounding boxes of the segments. We also added the relationship between the caption entity in the label-identifier pair and the entity mentions which were the closest in the caption. We parsed the captions and we measured the distance as the number of edges between the two entities in the dependency graph.

Spatial Features
The Segmented and Annotated IAPR TC-12 data set comes with annotations for three different types of spatial relationships holding between the segment pairs in each image: Topological, horizontal, and vertical (Hernández-Gracidas and Su-car, 2007). The possible values are adjacent or disjoint for the topological category, beside or horizontally aligned for the horizontal one, and finally above, below, or vertically aligned for the vertical one.

Syntactic Features
The syntactic features are all based on the structure of the sentences' dependency graphs. We followed the graph from the caption-entity in the pair to extract its closest ancestors and descendants. We only considered children to the right of the candidate. We also included all the prepositions between the entity and these ancestor and descendant.  Figure 2 shows the dependency graph of the sentence a flat landscape with a dry meadow in the foreground. The descendants of the landscape entity are meadow and foreground linked respectively by the prepositions with and in. Its ancestor is the root node and the distance between landscape and meadow is 2. The syntactic features we extract for the entities in this sentence arranged in the order ancestor, distance to ancestor, preposition, descendant, distance to descendant, and preposition are for landscape, (root, 1, null, meadow, 2, with) and (root, 1, null, foreground, 2, in), for meadow, (landscape, 2, with, null, -, null), and for foreground, (landscape, 2, in, null, -, null). We discard foreground as it is part of the stop words.

Pairing Features
The single features consist of the label, entity identifier, and score of the pair. To take interaction into account, we also paired features characterizing properties across image and text. The list of these features is (Table 3): 1. The label of the image region and the identifier of the caption entity. In Fig 2, we create grass meadow from (grass, meadow).
2. The label of the closest image segment to the ancestor of the caption entity. The closest  Anc ClosestSeg: Closest segment label with the ancestor of the caption entity Desc ClosestSeg: Closest segment label with the descendant of the caption entity AncDist: Distance between the ancestor and the caption entity, and distance between segments DescDist: Distance between the descendant and the caption entity, and distance between the segments TopoRel DescPreps: Topological relationship between segments and the prepositions linking the caption entity with its descendant TopoRel AncPreps: Topological relationship between the segments and the prepositions linking the caption entity with its ancestor XRel DescPreps: Horizontal relationship between segments and the prepositions linking the caption entity with its descendant XRel AncPreps: Horizontal relationship between segments and the prepositions linking the caption entity with its ancestor YRel DescPreps: Vertical relationship between segments and the prepositions linking the caption entity with its descendant YRel AncPreps: Vertical relationship between segments and the prepositions linking the caption entity with its ancestor SegmentDist: Distance (in pixels) between the gravity center of the bounding boxes framing the two closest segments Table 3: The reranking features using the current segment and its closest segment in the image segment of the grass segment is river and the ancestor of meadow is landscape. This gives the paired feature meadow landscape. The labels of the segments closest to the current segment and the descendant of meadow are also paired.
3. The distance between the segment pairs in the image divided into seven intervals with the distance between the caption entities. We measured the distance in pixels since all the images have the same pixel dimensions.
4. The spatial relationships of the closest segments with the prepositions found between their corresponding caption entities. The segments grass and river in the image are adjacent and horizontally aligned and grass is located below the segment labeled river. Each of the spatial features is paired with the prepositions for both the ancestor and the de-scendant.
We trained the reranking models from the pairs of labeled segments and caption entities, where the correct mappings formed the positive examples and the rest, the negative ones. In Fig. 1, the mapping (grass, meadow) is marked as correct for the region labeled grass, while the mappings (grass, lagoon) and (grass, cloud) are marked as incorrect. We used the manually annotated images (200 images, Table 1) as training data, a leave-oneout cross-validation, and L2-regularized logistic regression from LIBLINEAR (Fan et al., 2008). We applied a cutoff of 3 for the list of candidates in the reranking and we multiplied the original score of the label-identifier pairs with the reranking probability.  Table 4: An example of an assignment before (upper part) and after (lower part) reranking. The caption entities are ranked according to the number of co-occurrences with the label. We obtain the new score for a label-identifier pair by multiplying the original score by the output of the reranker for this pair four regions in Fig. 1. The column Entity 1 shows that the scoring function maps the caption entity sky to all of the regions. We created a reranker's feature vector for each of the 8 label-identifier pairs. Table 5 shows two of them corresponding to the pairs (grass, sky) and (grass, meadow). The pair (grass, meadow) is a correct mapping, but it has a lower co-occurrence score than the incorrect pair (grass, sky).

Reranking Example
In the cross-validation evaluation, we applied the classifier to these vectors and we obtained the reranking scores of 0.0244 for (grass, sky) and 0.79 for (grass, meadow) resulting in the respective final scores of 36 and 699. Table 4, lower part, shows the new rankings, where the highest scores correspond to the associations: (cloud, cloud), (grass, meadow), (hill, landscape), and (river, cloud), which are all correct except the last one.

Individual Scoring Functions
We evaluated the three scoring functions: Cooccurrence, mutual information, and t-score, and the semantic similarity functions. Each labeled segment in the annotated set was assigned the caption-entity that gave the highest scoring labelidentifier pair.
To confront the lack of annotated data we also investigated a self-training method. We used the statistical associations we derived from the training set and we applied the mapping procedure in Sect. 5.3 to this set. We repeated this procedure   ) and (grass, sky). The ancestor distance 2 a means that there are two edges in the dependency graph between the words meadow and landscape, and a represents the smallest of the distance intervals, meaning that the two segments grass and river are less than 50 pixels apart with the three statistical scoring functions. We counted all the mappings we obtained between the region labels and the caption identifiers and we used these counts to create three new scoring functions denoted with a sign. Table 6 shows the performance comparison between the different functions. The second column shows how many correct mappings were found by each function. The fourth column shows the improved score when the stop words were removed. The removal of the stop words as entity candidates improved the co-occurrence and tscore scoring functions considerably, but provided only marginal improvement for the scoring functions based on semantic similarity and pointwise mutual information. The percentage of correct mappings is based on the 730 regions that have a matching caption entity in the annotated test set.
The semantic similarity functions -PATH, HSO, JCN, LCH, LESK, LIN, RES and WUPoutperform the statistical one and the self-trained versions of the statistical scoring functions yield better results than the original ones.
We applied an ensemble voting procedure with the individual scoring functions, where each function was given a number of votes to place on its preferred label-identifier pair. We counted the votes and the entity that received the majority of the votes was selected as the mapping for the current label.

Reranking
We reranked all the scoring functions using the methods described in Sect. 6. We used the three label-identifier pairs with the highest score for each segment and function to build the model and we also reranked the top three label-identifier pairs for each of the assignments. Table 8 shows the results we obtained with the reranker compared to the original scoring functions. The reranking pro-    Figure 3 shows the comparison between the original scoring functions, the scoring functions without stop words, and the reranked versions. There is a total of 928 segments, where 730 have a matching entity in the caption.
We applied an ensemble voting with the reranked functions (Table 9). Reranking yields a significant improvement for the statistical scoring functions. When they get one vote each in the ensemble voting, the results increase from 52% correct mappings to 75%. When used in an ensemble with the semantic similarity scoring functions, the results improve further.

Scoring function
Number of votes Reranked co-oc.  We also evaluated ensemble voting with different numbers of votes for the different functions. We tested all the permutations of integer weights in the interval {0,3} on the development set. Table 10 shows the best result for both the original assignments and the reranked assignments on the test set. The reranked assignments gave the best results, 88.76% correct mappings, and this is also the best result we have been able to reach.

Conclusion and Future Work
The extraction of relations across text and image is a new area for research. We showed in this paper that we could use semantic and statistical functions to link the entities in an image to mentions of the same entities in captions describing this image. We also showed that using the syntactic structure of the caption and the spatial structure of the image improves linking accuracy. Eventually, we managed to map correctly nearly 89% of the image segments in our data set, counting only segments that have a matching entity in the caption.
The semantic similarity functions form the most accurate mapping tool, when using functions in isolation. The statistical functions improve sig-  nificantly their results when they are used in an ensemble. This shows that it is preferable to use multiple scoring functions, as their different properties contribute to the final score.
Including the syntactic structures of the captions and pairing them with the spatial structures of the images is also useful when mapping entities to segments. By training a model on such features and using this model to rerank the assignments, the ordering of entities in the assignments is improved with a better precision for all the scoring functions.
Although we used images manually annotated with segments and labels, we believe the methods we described here can be applied on automatically segmented and labeled images. Using image recognition would then certainly introduce incorrectly classified image regions and thus probably decrease the linking scores.