Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning

Image Captioning aims at generating a short description for an image. Existing research usually employs the architecture of CNN-RNN that views the generation as a sequential decision-making process and the entire dataset vocabulary is used as decoding space. They suffer from generating high frequent n-gram with irrelevant words. To tackle this problem, we propose to construct an image-grounded vocabulary, based on which, captions are generated with limitation and guidance. In specific, a novel hierarchical structure is proposed to construct the vocabulary incorporating both visual information and relations among words. For generation, we propose a word-aware RNN cell incorporating vocabulary information into the decoding process directly. Reinforce algorithm is employed to train the generator using constraint vocabulary as action space. Experimental results on MS COCO and Flickr30k show the effectiveness of our framework compared to some state-of-the-art models.


Introduction
Recent years have witnessed growing popularity of research in multimodal learning across vision and language. Image captioning , one of the most widely studied multimodal tasks, aims at constructing a short text description given an image. Existing research on image captioning usually employs a CNN-RNN architecture with a Convolutional Neural Network (CNN) used for image feature extraction and a Re- * *Corresponding author 1 https://github.com/LibertFan/ ImageCaption Figure 1: Two images from MS-COCO (Lin et al., 2014) with captions generated by NIC (Vinyals et al., 2015) and the corresponding ground truth (GT) captions. current Neural Network (RNN) for caption generation (Vinyals et al., 2015). Although impressive results have been achieved, existing models suffer from the problem of generating N-grams which occurred frequently in the training set but are irrelevant to the particular given image (Anderson et al., 2016;Dai et al., 2017).
Examples of caption generation are shown in Figure 1. Two images are presented with modelgenerated captions (Vinyals et al., 2015) and their corresponding human-constructed ones. As we can see, the N-gram "a woman sitting at a table" is generated mistakenly for both images. This is because when generating a text sequence, the RNN-based generator tends to ignore the semantic meaning encoded in the given image and instead generate the text sequences that occurred most often in the training set. Although different image grounding strategies have been proposed to address this problem, they usually consider visual information as external features for caption generation via various attention mechanisms You et al., 2016;Lu et al., 2017). We argue that visual information should be embedded into the generation process in a more principled way.
In the CNN-RNN architecture, the RNN-based generator constructs image captions word by word. In each step, a word is selected from the vocabulary built on the entire training set. Generally, the size of the full vocabulary is on the order of 10 4 . When describing a particular image, the possible words to be used should be drawn from a much smaller word set. As an illustration, we show in Figure 2 the statistics of the number of distinct words in human-generated captions for images from MS-COCO (Lin et al., 2014). We can see that the average size of the pool of words used for the description of a particular image is around 30. Based on this observation, we speculate that if we can efficiently constrain the word selection space during the image caption generation process, we should be able to address the irrelevant N-gram problem.
In this paper, we propose to construct an imagegrounded vocabulary as a way to leverage the im-age semantics for image captioning. For vocabulary construction, we propose a two-step approach which incorporates both visual semantics and the relations among words. For text generation, we explore two strategies to utilize the constructed vocabulary. One uses the vocabulary as a hard constraint and the other encodes the weight of each word obtained from the image-grounded vocabulary into the RNN cell as a soft constraint. Experimental results on two public datasets show the effectiveness of using image-grounded vocabulary for visual captioning compared to several stateof-the-art approaches in terms of automatic evaluation metrics. Further analysis reveals that our model has the advantage of generating more novel captions compared to existing approaches.

Our Approach
The overall architecture of our model is shown in Figure 3, which consists of two main stages, image-grounded vocabulary construction and text generation with vocabulary constraints. The image-grounded vocabulary constructor builds a vocabulary related to a given image by considering the visual information encoded and the relationships among words. The text generator with vocabulary constraints generates captions using the constructed vocabulary in two different ways. First, words generated are strictly limited to those in the image-grounded vocabulary. Second, words in the image-grounded vocabulary are re-weighted within the RNN cell such that they are more likely to be generated. We also study the use of the image-grounded vocabulary under the framework of reinforcement learning treating the image-grounded vocabulary as the action space for caption generation.

Image-Grounded Vocabulary Construction
The image-grounded vocabulary constructor aims to identify words required for the description of a given image I i . Intuitively, words used to describe an image can be divided into two groups. One group of words are directly related to the image (e.g., entities or objects depicted in the image) and the other group of words are function words or words that do not correspond directly to elements of the image. We assume that the directly-related words can be determined based on the visual information, while the identification of words in the second group requires the consideration of their relationship with those in the first group. Therefore, we propose a two-step strategy to construct the image-grounded vocabulary.
In the first step, we identify words that are directly related to a given image. Taking each word as a label, the construction of the image-grounded vocabulary can be treated as a multi-label classification problem. We take the visual features of the image as input and obtain a probability distribution S i for words, indicating the relevance of words for image I i . Following Fang et al. (2015), we only consider a list of words with high frequency in the dataset as seeds, denoted as H. The relevance distribution of words in H for an image I i is computed as follows: where v i is the visual features of image I i and M k is a multi-layer perceptron (MLP) with k layers (one layer in this case), σ(·) denotes a sigmoid function, S In the second step, we compute the relevance scores of words in the full vocabulary V given the image I i and the probabilities of directly-related words S (H) i . Specifically, a 2-layer MLP with sigmoid function is employed. The probability distribution of words in V considering both visual information and relations among words is computed in Equation 2: where [·, ·] is the concatenation operation. During inference, we pick the top k words in terms of their relevance scores to form the image-grounded vocabulary for image I i , denoted as W i . Note that S (V ) i stands for the relevance score of words in V for image I i and S (V ) i is in the same size as V .

Text Generation with Vocabulary Constraints
In order to utilize the image-grounded vocabulary W i and word relevance distribution S (V ) i for caption generation, we explore two different strategies. One uses W i as a hard constraint and the other integrates the relevance of each word into the RNN cell for caption generation. In what follows, we first introduce the basic RNN-based text generator, and then describe each of the two strategies in turn.

RNN-based Generator
RNN-based generator takes the visual features as input, and generates an image caption word by word. In each step, an RNN cell takes the hidden state h t−1 and the output word a t−1 from the previous step as input and computes the hidden state h t for the current step. Based on h t , a sof tmax layer is used to compute the probability distribution of words in the vocabulary and the top one is selected as the output. The computation process is described in Equation 3: In our case, we use an LSTM (Gers et al., 1999) as the RNN cell. Suppose the hidden state, the cell state and the output in the (t−1) th step are denoted as h t−1 , c t−1 and a t−1 , respectively, the states and output at the t th step can be computed as: where denotes the element-wise multiplication, and W * a , U * a , * ∈ {i, f, o, c} are the parameters of the LSTM cell.

Generator with Hard Constraint
A straightforward way of utilizing the imagegrounded vocabulary for text generation is to limit the decoding space to W i (refer to word constraint in Figure 3). The word selection in each step within the RNN cell can thus be modified as follows: In practice, a mask operation m i is introduced to replace the j th value in the vector with −∞ if w j is not found in W i as shown in Equation 6.

Generator with Soft Constraint
Instead of using the image-grounded vocabulary as the hard constraint, we further explore to integrate the probability distribution S (V ) i of words in vocabulary V for the given image I i into the decoding RNN cell (refer to word-aware in Figure 3). In the t th step, we simply combine a t , h t and S (V ) i with the element-wise multiplication. The computation steps in the cell are shown below: where W * s , W * a , U * s , U * a , * ∈ {i, f, o, c} are the parameters of the cell.
The new RNN cell integrates information about the image-grounded vocabulary so that words in that vocabulary are more likely to be generated.

Reinforcement Learning for Text Generation
Although it is straightforward to impose the hard vocabulary constraint during inference, it is not easy to train the text generator with the hard constraint since words in the ground-truth caption may not appear in the image-grounded vocabulary W i constructed for image I i . We denote such words as: where W i is the ground-truth vocabulary for image I i . In order to tackle this problem, we employ reinforcement learning to train the generator under the vocabulary constraint so that it is less likely to select words not in W i . This strategy not only aligns the behavior of word selection during training and testing, but also makes the generator better accustomed to the distribution of W i through the feedback reward.
Recall the goal of reinforcement learning is to maximize the expected reward of the generator with parameter θ: The policy gradient of Equation 9 with a baseline is shown in Equation 10.
Following Rennie et al. (2017), we utilize CIDEr-D (Vedantam et al., 2015) as the reward of the generated sentence (a 1 , · · · , a T ) and set b = r(â 1 , · · · ,â T ) which is the reward obtained by the current model with greedy decoding.
In summary, the training strategy with reinforcement learning under the vocabulary constraint can be described as follows: Algorithm 1 Caption generation with reinforcement learning 1: f or t = 1 : T do 2:

Training
The overall training procedure of our proposed framework can be described by the following four steps: Algorithm 2 Training procedure 3 Experiment

Dataset
We evaluate our proposed framework on MS-COCO (Lin et al., 2014) and Flickr30k (Plummer et al., 2015). In MS-COCO, there are 113,287 images in the training set and 5,000 images in both of the validation and test sets. In Flickr30k, the number of images for the training, validation and test sets is 29,000, 1,000 and 1,000, respectively. Each image contains 5 human annotated captions. We split the dataset following the process described in (Karpathy and Fei-Fei, 2015).

Implementation Details
For image representation, we rescale the image to 224 × 224 and use ResNet-152 (He et al., 2016) pre-trained on ImageNet (Russakovsky et al., 2015) to extract features of dimension 2,048. The mini-batch size is 64. The dimensions of LSTM hidden unit and the word embedding are 512 and 300, respectively, and the word embedding is initialized with GloVe (Pennington et al., 2014) 2 which is pretrained on Wikipedia 2014 and Gigaword 5. We prune the vocabulary by dropping words appear less than five times. For the generator, We train the model with cross-entropy using Adam (Kingma and Ba, 2014) with an initial learning rate 1 × 10 −3 which decreases by a factor of 0.8 every 2 × 10 4 iterations. Then we train the generator with reinforcement learning but without hard constraints using Adam with an initial learning rate 5 × 10 −5 which decreases by a factor of 0.8 every 3 × 10 4 iterations. Finally, we train the generator with reinforcement learning under the hard constraints using Adam with an initial learning rate 5×10 −5 which decay at a rate of 0.8 every 2 × 10 4 iterations. For each model, we evaluate on the validation set to select the best parameters with grid search. We set the size of W i to 64 for all models with hard constraints.

Models for Comparison
We compare our model with the state-of-the-art approaches listed below. In addition, we also performed ablation studies of our proposed model. We denote the hard word constraint mechanism, soft word-aware mechanism and reinforcement learning as WC, WA and RL, respectively.
-NIC (Vinyals et al., 2015) is the baseline CNN-RNN model trained with cross-entropy loss. NIC+RL is trained with reinforcement learning.
-ATT (You et al., 2016) detects a list of visual concepts from a given image, which is used to guide the caption generation process through an attention mechanism.
-AdapAtt (Lu et al., 2017) utilizes the context information of RNN cells in the decoder to better predict non-visual words.
-TopDown (Anderson et al., 2018) employs the visual attention mechanism with the two-layer LSTM.
-NIC+WC uses the hard word constraints (WC). NIC+WC+RL is trained with reinforcement learning using the image-grounded vocabulary as the action space.
-NIC+WC+WA employs the soft word-aware (WA) mechanism on top of NIC+WC. NIC+WC+WA+RL is trained with reinforcement learning using the image-grounded vocabulary as the action space.
-NIC+WC(GT) utilizes the ground-truth vocabulary W i as the word constraints instead of W i . This is an oracle.

Overall Performance
We report scores of several widely used metrics for image captioning evaluation, including BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), ROUGE (Lin and Hovy, 2003) and CIDEr-D (Vedantam et al., 2015). The overall performance is shown in Table 1. Several findings stand out: -Both NIC+WC and NIC+WC+RL perform better than their counter-part models NIC and NIC+RL across all metrics. This shows the effectiveness of using the word constraint mechanism for reducing irrelevant words for a given image.
-Both NIC+WC+WA and NIC+WC+WA+RL outperform NIC+WC and NIC+WC+RL respectively. This shows that the word-aware mechanism effectively guides the generator to better capturing the semantics of a given image.
-Compared to NIC+WC and NIC+WC+WA, both NIC+WC+RL and NIC+WA+WC+RL achieve better performance. This demonstrates that training the generator under the word constraints with reinforcement learning encourages the generator to adhere to the constraints set by the image-grounded vocabulary.
-Our proposed model NIC+WA+WC+RL outperforms all the baselines and its variants. However we notice that there is still a large gap between our proposed model and the oracle model NIC+WC (GT). This shows that there is still potential to improve the process for the construction of the image-grounded vocabulary.
We conducted statistical significance tests (Students paired t-test) to verify that the differences seen among the different approaches were Results showed that NIC+WA+WC+RL outperformed NIC+RL significantly across all the metrics (p < 0.01). Similarly for NIC+WA+WC and NIC (p < 0.01). This confirms the effectiveness of using imagegrounded vocabulary to improve visual captioning.

Further Analysis
Further analysis was conducted to evaluate the sensitivity of our model with respect to parameter and component setting and present case studies to illustrate the merits of our model in comparison to baseline models.
Influence of the size of W i We explore the influence of the size of the image-grounded vocabulary on the performance of the generator. We test three models, namely, NIC+WC, NIC+WC+RL and NIC+WC+WA+RL using various sizes of W i , and report the CIDEr-D scores. In addition, we also report the recall and precision of W i compared to the ground truth W i . The results are shown in Figure 4. Note that models without word constraints can be interpreted as taking |V | as W i .
We observe a similar trend of CIDEr-D for all the three models. It gradually goes up with the increasing size of W i , reaches the peak at 48, 48 and 64, respectively, and then gradually drops with the further increase of the size of W i . It is worth noting that the peak numbers are quite close to the average number of words (i.e., around 30) in W i shown in Figure 2. The performance of the generator is poor when the size of W i is too small because the possible word choices are too limited. As the size of W i gets larger, more irrelevant words are included, which introduces noise to the generator and thus performance drops.  Figure 5, we show the mean and standard deviation of CIDEr-D scores of NIC+WC+WA+RL with W i = 64 and W i = |V | in three different runs for training the generator with reinforcement learning under word constraints. We can see that the model with W i = 64 consistently outperforms the one with W i = |V | across the training iterations.

Robustness of our model In
Influence of vocabulary constructor We analyze the influence of the vocabulary constructor on the performance of the generator. Instead of using the vocabulary constructor introduced in section 2.1, we build another baseline model that takes visual features as input and employs a sin-  gle layer MLP with sigmoid for generating the vocabulary. The variants of the models are named as NIC+WC b and NIC+WC b +RL. Experimental results are shown in Table 2. We report precision and recall of the generated vocabulary to evaluate the constructor directly and report BLEU-4 and CIDEr-D to see their influence on the generator. It can be observed that our constructor is able to build a better vocabulary compared to the baseline constructor in terms of both precision and recall. This indicates the effectiveness of our two-step approach. Moreover, with our proposed vocabulary constructor, both NIC+WC and NIC+WC+RL outperform NIC+WC b and NIC+WC b +RL respectively, achieving better BLEU-4 and CIDEr-D scores in image captioning.

Effectiveness of generating novel captions
Novel caption generation is crucial for automatic image captioning because retrieval-based models that simply retrieve existing captions from the training set often produce less human results though they can achieve high scores in terms of au-tomatic evaluation metrics (Devlin et al., 2015b). The worst case of N-gram problem is that the model directly generated the same frequent captions in the training set (Devlin et al., 2015a). Thus the capability of generating novel captions for an image that is not seen in the training set indicates that the generator is able to understand a given image better instead of simply generating frequent Ngrams found in the training set.
In this experiment, we consider captions generated by models that are not seen in the training set as novel captions. We show the ratio of novel captions generated by different models in Figure 6. Our proposed model outperforms NIC and other two competitive baselines, TopDown and Adap-Att, by a large margin. Moreover, NIC+WC is also able to generate more novel captions compared to NIC, indicating that the word constraint mechanism helps reducing generic words.

Case Study
We show example captions generated by our model in Figure 7. Results from two models are presented, namely NIC+RL and NIC+WA+WC+RL. In order to show how the image-grounded vocabulary W i regulate the generation process, we cross those words in the caption generated by NIC+RL but not included in W i . The crossed words are entity words such as "grass" and "field" in the first image (up-left), preposition "on" in the second one (up-right) and entity word "bench" in the third one (bottom-left).
Examples also indicate the effectiveness of the word-aware mechanism that guides the generator to replace "standing" with "walking" in the first image, "people" with "children" in the third one, "standing" with "is flying over" in the last one (bottom-right).

Related Work
Research investigation the connection between vision and language has attracted increasing attentions in the past a few years. Popular tasks include image captioning, visual question answering (VQA), and visual question generation. In image captioning, most of the proposed models You et al., 2016;Lu et al., 2017;Anderson et al., 2018) employ CNN to extract visual features and RNN to generate captions word by word (2015). Visual question answering (Antol et al., 2015;Goyal et al., 2017) aims to provide an answer to a question related to a given image. Existing architectures designed for VQA (Mali-nowski et al., 2015) utilize an RNN to encode the question, and a CNN to encode the image. Most efforts are made to align the visual and text information for generating the answer. Visual question generation is a relatively new task that generates natural questions about an image (Mostafazadeh et al., 2016). Approaches have been explored to generate diverse questions (Tang et al., 2017;Zhang et al., 2017;Fan et al., 2018a) and questions with a specific property (Fan et al., 2018b). Instead of using high-level visual features extracted from the image for text generation, some researchers explore identifying fine-grained information from the image, i.e. objects and attributes, to guide the process of text generation. Traditionally, template-based approaches are used to compose the caption (Farhadi et al., 2010;Kulkarni et al., 2013;Lin et al., 2015). After that, different attention mechanisms are proposed to align visual information and text for better generation (You et al., 2016;Lu et al., 2017;Anderson et al., 2018).
For better aligning visual information and text, some researchers explore identifying semantic concepts related to the image.  employs retrieved sentences as additional semantic information to assist generation. Others (Fang et al., 2015;Wu et al., 2016;You et al., 2016;Gan et al., 2017) utilize high-frequency words as semantic concepts. Fang et al. (2015) develops features based on detected concepts to re-rank the generated captions. You et al. (2016) employs an attention mechanism over concepts to enhance the generator. Gan et al. (2017) applies weight tensors in LSTM units to integrate the semantic concept into the generator. Instead, in our proposed approach, image-grounded vocabulary is built at the word level and imposed as constraints on caption generation.
The work most relevant to ours is from Yao et al. (2017) and Wu et al. (2018). Yao et al. (2017) incorporates a copy mechanism to encourage the generator to generate visually related words. Wu et al. (2018) dynamically construct a vocabulary with a lightweight network and then picks one from this smaller vocabulary with a more complex network to improve computational efficiency. Our model is novel in three ways. First, we observe that the large mismatch between the dataset vocabulary and the vocabulary required for describing an image is one of the main reasons for the generation of irrelevant N-grams. Second, we propose a novel two-step approach for image-grounded vocabulary construction. Third, we explore two different strategies for caption generation using the constructed vocabulary.

Conclusion and Future Work
In this paper, we have proposed a novel framework which constructs an image-grounded vocabulary to leverage the image semantics for image captioning in order to tackle the problem of generating irrelevant N-grams. A novel two-step approach has been proposed to construct the vocabulary considering both visual information and relations among words. Two strategies have then been explored to utilize the constructed vocabulary via hard constraints and soft constraints. Reinforcement learning has been adopted for the training of the generator to encourage it to only choose words from the image-grounded vocabulary. Experiments on two public datasets, namely, MS COCO and Flickr30k, show that image-grounded vocabulary is able to enhance the quality of image captions compared to existing state-of-the-art approaches. In future, we plan to study more effective ways to construct the image-grounded vocabulary. Furthermore, it is also interesting to design a mutual reinforcement mechanisms between the vocabulary constructor and the text generator to improve both components simultaneously.