The Lexical Gap: An Improved Measure of Automated Image Description Quality

The challenge of automatically describing images and videos has stimulated much research in Computer Vision and Natural Language Processing. In order to test the semantic abilities of new algorithms, we need reliable and objective ways of measuring progress. We show that standard evaluation measures do not take into account the semantic richness of a description, and give the impression that sparse machine descriptions outperform rich human descriptions. We introduce and test a new measure of semantic ability based on relative lexical diversity. We show how our measure can work alongside existing measures to achieve state of the art correlation with human judgement of quality. We also introduce a new dataset: Rich-Sparse Descriptions, which provides 2K human and machine descriptions to stimulate interest into the semantic evaluation of machine descriptions.


Introduction
Image and video processing systems are being developed for a wide variety of semantically rich tasks, such storytelling (Zhu et al., 2015), Visual Question Answering (VQA) (Anderson et al., 2017;Teney et al., 2016;Wu et al., 2016), and engaging in visual dialogue (Jain et al., 2018). In this paper, we consider the task of Image Description (Lin et al., 2014;Hodosh et al., 2015;Plummer et al., 2017). Closing the semantic gap between human and machine descriptions requires robust and standardised measures of performance. In classical computer vision problems such as object detection, segmentation and classification, quality can be defined easily as a comparison between machine predictions and reference answers. Standard measures of image description quality consider the alignment of candidate sentences with ground truth sentences. However defining a set of "correct" answers for a given image is restrictive, as an image may contain diverse semantic information. Consequently we find semantically rich and detailed content is regarded very poorly by such measures, and the more sparse and simplistic the reference data and predictions, the higher the score. In summary: 1. We sourced 2K human and machine descriptions, which we used to show that standard automated measures of quality give an incomplete picture of semantic ability. The measures produce higher scores when candidates and reference data are semantically sparse, and lower scores on richer descriptions.
2. We show that measuring the relative lexical diversity of a system is a better indicator of semantic ability. We define two measures of relative diversity, and show that when combined with standard measures, achieve achieve state-of-the-art correlation with human judgement.
We hope our work will stimulate research in to more advanced measures of semantic ability, helping to close the gap between human and machine descriptions.

Relevant Literature
The predominant approach to generating original descriptions is to encode visual data into semantically useful features, which are then decoded into language. The capability of Convolutional Neural Networks (CNN) and their variants for extracting spatial features is well established in Computer Vision. Pre-training the network on a dataset such as ImageNet 1 (which already embeds images based on the WordNet nouns contained within them) provides spatial features which accurately predict common nouns. In language generation, it is common to use a gated recurrent neural network which predicts a probability distribution across the vocabulary, given prior states and spatial features (Long et al., 2014). Many systems have evolved from this fundamental approach, and we refer interested readers to surveys on such developments (Bernardi et al., 2016;Aafaq et al., 2018). Systems are typically trained end-toend on one of a number of image description datasets. Relevant to this paper are MS-COCO (Lin et al., 2014), Flickr8k (Hodosh et al., 2015), and Flickr30k (Plummer et al., 2017).

Methods of Evaluation
Objective measures of performance enable the automatic evaluation of systems across large datasets, avoiding the laborious process of sourcing human judgements. The measures divide into three groups: 1. Machine Translation measures: Early description systems considered image description as a translation task, in which information in the visual domain, is translated to the linguistic domain. As such machine translation measures based on n-gram alignment such as BLEU (Papineni et al., 2002), ROUGE (Lin and Hovy, 2003) and METEOR (Denkowski and Lavie, 2014).
2. Captioning Measures: CIDEr (Vedantam et al., 2015) and SPICE (Anderson et al., 2016), designed specifically for the description task. CIDEr addresses the problem of description diversity by rewarding candidates that match the consensus of references. SPICE, applies work from scene graph generation (Schuster et al., 2015) to create semantic graph representations of candidate and ground truth.
3. Neural Network Evaluation: Neural networks can be trained to evaluate descriptions. NNEVAL (Sharif et al., 2018) is a network trained to predict whether a description is human or machine, using both the captioning and translation measures as linguistic features.
As automated measures are a substitute for human evaluation, they are compared on the basis of their ability to correlate with human judgement. The poor correlation of translation measures is well known, (Bernardi et al., 2016;Chen and Dolan, 2011), and captioning measures show improved results. In this work we assess the correlation using the Composite dataset (Aditya et al., 2015). Human and machine captions for images in subsets of MS-COCO, Flickr8K and Flickr30K are judged by Amazon Mechanical Turk workers, and rated for correctness and completeness.

Lexical Diversity (LD)
The ability of text or speech to convey information specifically and articulately is a widely studied field. It is of interest in areas such as language learning, educational psychology and the study of speech impediments (Durán et al., 2004;Jarvis, 2013). An indicator of such fluency is Lexical Diversity (LD), which is a measure of the distribution of words used in a sample text. A simple measure such as the Type Token Ratio (TTR) considers the number of unique words used, relative to the total number of words in a sample. However TTR disadvantages longer texts, because for every additional word added to a corpus, the probability that it will be novel decreases. Such a measure would therefore be difficult to apply to a large scale image description corpus. A variety of measures derived from TTR have been proposed to address the issue of sample size such as the rate at which the TTR falls as successive tokens are added to the text (Jarvis, 2013). A curve with a larger negative gradient demonstrates more diversity than one with a smaller decay, and its parameters can be found with a numerical method (Durán et al., 2004). We later illustrate the application of this to image descriptions. More recent measures such as MTLD (McCarthy and Jarvis, 2010) consider the mean length of word strings for a particular TTR.
Hypo-geometric Distribution-D (HD-D) (McCarthy and Jarvis, 2010) measures the probability that for a random sample of words from a corpus, a particular token will be selected a certain number of times.
Here we use HD-D for its simple implementation, lower sensitivity to corpus size and wide use in the literature, but our method could be applied with a different LD measure.

Evaluation Measures and Rich Descriptions
A desirable quality of a description is to convey semantically insightful information. In this section we describe how we sourced a set of human and machine descriptions, comparing them on their semantic richness. We compared standard evaluation measures on semantically sparse and rich captions.

Sourcing Rich and Sparse Descriptions
We showed a total of 20 images to volunteers (Figure 1), asking them to describe the image in an informative sentence. "Describe this image as if describing it to a friend". Unlike large scale data collection, where participants have many images to process, our smaller scale collection gave participants unlimited time to consider their description. We also sourced machine descriptions by training a common image captioning baseline (Xu et al., 2016) on MS-COCO. After validating the performance of our system against the original paper, we sourced 1K machine descriptions of our images. From a subjective comparison between the human and machine descriptions, we noted a gap in semantic richness, illustrated in Figure 2. Humans incorporate information extrinsic to the images, such as from current affairs, cultural background and human experience, reacting with empathy to emotional cues. Machine descriptions however, are produced sequentially one word at a time, with each word selected from a probability distribution, predicted from object and attribute features. As all human descriptions were semantically more insightful than corresponding machine descriptions, we refer the machine descriptions as "sparse" and human descriptions as "rich". Table 2 shows that the distinction between rich and sparse is also evident in the vocabulary and lexical diversity of the datasets.

Evaluation Measures on Human and Machine Descriptions
We evaluated human and machine descriptions separately, using the standard evaluation measures. For each image we performed 1000 evaluations, where 5 sentences were randomly selected from the set of descriptions to be the ground truth candidates, with the remaining used to calculate the metrics. Table  1 shows that when both ground truth and candidate description sentences are semantically sparse they perform very well. However descriptions of a higher semantic complexity are penalised as a result of their more diverse and rich descriptions, with many insightful descriptions scoring zero. Figure 3 shows examples where the SPICE metric scores rich descriptions as zero. When rich descriptions were used as ground truth, the machine descriptions perform very poorly.

Comparison of Lexical Diversity
We measured and compared the LD of human and machine descriptions. Our human descriptions were universally richer and more semantically detailed than the machine descriptions. For each of the 40 TTR  curves we plotted (machine and human for each image), we found that LD was an accurate indication of whether a descriptions was from the rich or sparse set. Figure 4 shows the TTR curves for the examples presented in Figure 2 . The figure illustrates the faster decline of the sparse descriptions, relative to the semantically richer descriptions.

Comparison of Linguistic Complexity
Readability measures have long been used to automatically grade the complexity of language. We tested several measures, including Flesch-Kincaid (Kincaid JP, 1988), Coleman-Liau(Coleman and Liau, 1975), Dale-Chall(Dale E, 1948) and Automated Readability (Senter, 1967). However we found they did not correlate well with semantic quality. Informative descriptions tend to be lexically diverse, but are not necessarily complex. Rich descriptions can contain a higher syllable count and more 'difficult' words than sparse descriptions however this is not always the case. Furthermore a description corpus which generates exactly the same complex sentence for every image conveys no information and yet would score highly on complexity.

The Lexical Gap
One indication of the performance of a machine description system, is its ability to convey semantically rich information. We propose a measure which considers the entire output of a description system (which  we call c m ) and compares it with its training data (which we call c r ). Thus instead of solely considering a machine's ability to predict n-grams or words, we also measure its ability to maintain the linguistic diversity of its training corpus. Our key finding is that measuring the LD of a description corpus relative to its ground truth data is a good indication of semantic quality, and can be used to weight standard performance measures, increasing their correlation with human subjective judgement. In this section we define our measures, which we later compare compare with standard captioning measures.

Measuring the Lexical Gap
The Lexical Diversity Ratio (LDR) is a straightforward measure of the ability of a machine to match the semantic depth of its source material. Given a function L which calculates LD for a reference description corpus c r and the machine description corpus c m , we define the Lexical Diversity Ratio (LDR) l d as: A machine with a score of 1, is more able to match the lexical diversity of its training source. A lower score, indicates a reduction in semantic richness. We also define the lexical gap (L g ) a bounded measure of the ability of a system to maintain lexical diversity. An l d below some constant µ, will tend to zero indicating a larger lexical gap. As l d increases a system is closing that gap, towards a score of 1, which indicates ideal performance. Given the constants µ and α, we define the Lexical Gap L g : Considering our rich and sparse descriptions independently, we split them into sub-corpora. We calculate l d scores each sub-corpora as (c r ) using in every case the richer descriptions has our reference c r . Figure  5 shows the LDRs (l d ) for the rich and sparse parts of our description dataset. The richer descriptions, although more broadly distributed, have a higher mean l d . We define µ as the value that produces the Bayes Minimum error between the two distributions of l d (0.81), and we set α=5 to distribute all our values broadly and between the range 0..1. Then given a description metric M , we calculate the gap- Figure 5: Distribution of LDR scores for sparse and rich descriptions weighted score for each sentence: s n in a corpus s n ⊆ c r :

Results
We evaluated the performance of weighted lexical measures using the Composite dataset. The dataset contains selected human and machine descriptions for images sourced from Flickr30k, Flickr8K and    Table 5. Using our measures defined previously, we also calculated l d and L g for each subset of the Composite dataset (Table 3) using the relevant source corpus as our reference (c r ). We thus measured the lexical diversity of human and machine subsets of the Composite dataset. Before using standard evaluation measures, we found that our l d and L g correlated well with human subjective judgements, as presented in Table 4. Then we calculated the m gap and m ldr for each evaluation measure over the entire Composite dataset. We calculate the correlation performance with the human evaluation scores. Table 5 compares the gap weighted scores with standard measures of performance. We found that on all measures, weighting by l d and L g improves the correlation between human judgements and objective measures.  Much progress has been in visual description, with many systems capable of generating original sentences which convey salient objects and attributes. However building systems capable of conveying semantically insightful information still remains a big challenge because of the difficulty of developing effective and insightful evaluation measures. We find that LD of descriptions is a useful indicator of semantic quality, and propose that description systems are measured not only on the accuracy of their predictions, but also on their ability convey lexically specific information. Measuring LD, rewards systems which are able to preserve rich and diverse descriptions, but penalises sparse systems, which have a poor lexical capability. We hope that our work will inspire larger datasets of semantically richer and more detailed descriptions, and the development of more effective evaluation criteria for descriptions.