What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?

Image captioning has evolved into a core task for Natural Language Generation and has also proved to be an important testbed for deep learning approaches to handling multimodal representations. Most contemporary approaches rely on a combination of a convolutional network to handle image features, and a recurrent network to encode linguistic information. The latter is typically viewed as the primary “generation” component. Beyond this high-level characterisation, a CNN+RNN model supports a variety of architectural designs. The dominant model in the literature is one in which visual features encoded by a CNN are “injected” as part of the linguistic encoding process, driving the RNN’s linguistic choices. By contrast, it is possible to envisage an architecture in which visual and linguistic features are encoded separately, and merged at a subsequent stage. In this paper, we address two related questions: (1) Is direct injection the best way of combining multimodal information, or is a late merging alternative better for the image captioning task? (2) To what extent should a recurrent network be viewed as actually generating, rather than simply encoding, linguistic information?


Introduction
Image captioning (Bernardi et al., 2016) has emerged as an important testbed for solutions to the fundamental AI challenge of grounding symbolic or linguistic information in perceptual data (Harnad, 1990;Roy and Reiter, 2005).Most captioning systems focus on what Hodosh et al. (2013) refer to as concrete conceptual descriptions, that is, captions that describe what is strictly within the image, although recently, there has been growing interest in moving beyond this, with research on visual question-answering (Antol et al., 2015) and imagegrounded narrative generation (Huang et al., 2016) among others.
Approaches to image captioning can be divided into three main classes (Bernardi et al., 2016): 1. Systems that rely on computer vision techniques to extract object detections and features from the source image, using these as input to an NLG stage (Kulkarni et al., 2011;Mitchell et al., 2012;Elliott and Keller, 2013).The latter is roughly akin to the microplanning and realisation modules in the well-known NLG pipeline architecture (Reiter and Dale, 2000).
2. Systems that frame the task as a retrieval problem, where a caption, or parts thereof, is identified by computing the proximity/relevance of strings in the training data to a given image.This is done by exploiting either a unimodal (Ordonez et al., 2011;Gupta et al., 2012;Mason and Charniak, 2014) or multimodal (Hodosh et al., 2013;Socher et al., 2014) space.Many retrieval-based approaches rely on neural models to handle both image features and linguistic information (Ordonez et al., 2011;Socher et al., 2014).
3. Systems that also rely on neural models, but rather than performing partial or wholesale caption retrieval, generate novel captions using a recurrent neural network (RNN), usually a long short-term memory (LSTM).Typically, such models use image features extracted from a pre-trained convolutional neural network (CNN) such as the VGG CNN (Simonyan and Zisserman, 2014) to bias the RNN towards sampling terms from the vocabulary in such a way that a sequence of such terms produces a caption that is relevant to the image (Kiros et al., 2014b;Kiros et al., 2014a;Vinyals et al., 2015;Mao et al., 2015a;Hendricks et al., 2016).
This paper focuses on the third class.The key property of these models is that the CNN image features are used to condition the predictions of the best caption to describe the image.However, this can be done in different ways and the role of the RNN depends in large measure on the mode in which CNN and RNN are combined.
It is quite typical for RNNs to be viewed as 'generators'.For example, Bernardi et al. (2016) suggest that 'the RNN is trained to generate the next word [of a caption]', a view also expressed by Le-Cun et al. (2015).A similar position has also been taken in work focusing on the use of RNNs as language models for generation (Sutskever et al., 2011;Graves, 2013).However, an alternative view is possible, whereby the role of the RNN can be thought of as primarily to encode sequences, but not directly to generate them.These two views can be associated with different architectures for neural caption generators, which we discuss below and illustrated in Figure 1.In one class of architectures, image features are directly incorporated into the RNN during the sequence encoding process (Figure 1a).In these models, it is natural to think of the RNN as the primary generation component of the image captioning system, making pre-dictions conditioned by the image.A different architecture keeps the encoding of linguistic and perceptual features separate, merging them in a later multimodal layer, at which point predictions are made (Figure 1b).In this type of model, the RNN is functioning primarily as an encoder of sequences of word embeddings, with the visual features merged with the linguistic features in a later, multimodal layer.This multimodal layer is the one that drives the generation process since the RNN never sees the image and hence would not be able to direct the generation process.
While both architectural alternatives have been attested in the literature, their implications have not, to our knowledge, been systematically discussed and comparatively evaluated.In what follows, we first discuss the distinction between the two architectures (Section 2) and then present some experiments comparing the two (Sections 3 and 4).Our conclusion is that grounding language generation in image data is best conducted in an architecture that first encodes the two modalities separately, before merging them to predict captions.

Background: Neural Caption Generation Architectures
In a neural language model, an RNN encodes a prefix (for example, the caption generated so far) and either itself predicts the next item in the sequence with the help of a feed forward layer or else it passes the encoding to the next layer which will make the prediction itself.This new item is added to the prefix at the next iteration to predict another item, until an end-of-sequence symbol is reached.Typically, the prediction is carried out using a softmax function to sample the next item according to a probability distribution over the vocabulary items, based on their activation.This process is illustrated in Figure 2. One way to condition the RNN to predict image captions is to inject both visual and linguistic features directly into the RNN, depicted in Figure 1a.We refer to this as 'conditioning-by-inject' (or inject for short).Different types of inject architectures have become the most widely attested among deep learning approaches to image captioning (Chen and Zitnick, 2015;Donahue et al., 2015;Hessel et al., 2015;Karpathy and Fei-Fei, 2015;Liu et al., 2016; Figure 2: How RNNs work: each state of the RNN encodes a prefix, which incorporates the output word derived from the previous state.In practice the neural network does not output a single word but a probability distribution over all known words in the vocabulary.Legend: FF -feedforward layer; <beg> -the start-of-sentence token; <end> -the end-of-sentence token.Yang et al., 2016;Zhou et al., 2016).1 Given training pairs consisting of an image and a caption, the RNN component of such models is trained by exposure to prefixes of increasing length extracted from the caption, in tandem with the image.
An alternative architecture -which we refer to as 'conditioning-by-merge' (Figure 1b) -treats the RNN exclusively as a 'language model' to encode linguistic sequences of varying length.The linguistic vector resulting from this encoding is subsequently combined with the image features in a separate multimodal layer.This amounts to viewing the RNN as primarily an encoder of linguistic information.This type of architecture is also attested in the literature, albeit to a lesser extent than the inject architecture (Mao et al., 2014;Mao et al., 2015a;Mao et al., 2015b;Song and Yoo, 2016;Hendricks et al., 2016;You et al., 2016).A limited number of approaches have also been proposed in which both architectures are combined (Lu et al., 2016;Xu et al., 2015).
Notice that both architectures are compatible with the inclusion of attentional mechanisms (Xu et al., 2015).The effect of attention in the inject architec-ture is to combine a different representation of the image with each word.In the case of merge, a different representation of the image can be combined with the final RNN state before each prediction.Attentional mechanisms are however beyond the scope of the present work.
The main differences between inject and merge architectures can be summed up as follows: In an inject model, the RNN is trained to predict sequences based on histories consisting of both linguistic and perceptual features.Hence, in this model, the RNN is primarily responsible for image-conditioned language generation.By contrast, in the merge architecture, RNNs in effect encode linguistic representations, which themselves constitute the input to a later prediction stage that comes after a multimodal layer.It is only at this late stage that image features are used to condition predictions.
As a result, a model involving conditioning by inject is trained to learn linguistic representations directly conditioned by image data; a merge architecture maintains a distinction between the two representations, but brings them together in a later layer.
Put somewhat differently, it could be argued that at a given time step, the merge architecture predicts what to generate next by combining the RNNencoded prefix of the string generated so far (the 'past' of the generation process) with non-linguistic information (the guide of the generation process).The inject architecture on the other hand uses the full image features with every word of the prefix during training, in effect learning a 'visuo-linguistic' representation of each word.One effect of this is that image features can serve to further specify or disambiguate the 'meaning' of words, by disambiguating tokens of the same word which are correlated with different image features (such as 'crane' as in the bird versus the construction equipment).This implies that inject models learn a larger vocabulary during training.
The two architectures also differ in the number of parameters they need to handle.As noted above, since an inject architecture combines the image with each word during training, it is effectively handling a larger vocabulary than merge.Assume that the image vectors are concatenated with the word embedding vectors (inject) or the final RNN state (merge).Then, in the inject architecture, the number of weights in the RNN is a function of both the caption embedding and the images, whereas in merge, it is only the word embeddings that contribute to the size of this layer of the network.Let e be the size of the word embedding, v the size of the vocabulary, i the image vector size and s the state size of the RNN.In the inject case, the number of weights in the RNN is w ∝ (e + i) × s, whereas it is w ∝ e × s in merge.The smaller number of weights handled by the RNN in merge is offset by a larger number of weights at the final softmax layer, which has to take as input the RNN state and the image, having size A systematic comparison of these two architectures would shed light on the best way to conceive of the role of RNNs in neural language generation.Apart from the theoretical implications concerning the stage at which language should be grounded in visual information, such a comparison also has practical implications.In particular, if it turns out that merge outperforms inject, this would imply that the linguistic representations encoded in an RNN could be pre-trained and re-used for a variety of tasks and/or image captioning datasets, with domain-specific training only required for the final feedforward layer, where the tuning required to make perceptually grounded predictions is carried out.We return to this point in Section 6.1.
In the following sections, we describe some experiments to conduct such a comparison.

Experiments
To evaluate the performance of the inject and merge architectures, and thus the roles of the RNN, we trained and evaluated them on the Flickr8k (Hodosh et al., 2013) and Flickr30k (Young et al., 2014) datasets of image-caption pairs.For the purposes of these experiments, we used the version of the datasets distributed by Karpathy and Fei-Fei (2015) 2 .The dataset splits are identical to that used by Karpathy and Fei-Fei (2015): Flickr8k is split into 6,000 images for training, 1,000 for validation, and 1,000 for testing whilst Flickr30k is split into 29,000 images for training, 1,014 images for validation, and 1,000 images for testing.Each image in both datasets has five different captions.4,096element image feature vectors that were extracted from the pre-trained VGG CNN (Simonyan and Zisserman, 2014) are also available in the distributed datasets.We normalised the image vectors to unit length during preprocessing.
Tokens with frequency lower than a threshold in the training set were replaced with the 'unknown' token.In our experiments we varied the threshold between 3 and 5 in order to measure the performance of each model as vocabulary size changes.For thresholds of 3, 4, and 5, this gives vocabulary sizes of 2,539, 2,918, and 3,478 for Flickr8k and 7,415, 8,275, 9,584 and for Flickr30k.
Since our purpose is to compare the performance of architectures, we used the 'barest' models possible, with the fewest number of hyperparameters.This means that complexities that are usually introduced in order to reach state-of-the-art performance, such as regularization, were avoided, since it is difficult to determine which combination of hyperparameters do not give an unfair advantage to one architecture over the other.
We constructed a basic neural language model consisting of a word embedding matrix, a basic LSTM (Hochreiter and Schmidhuber, 1997), and a softmax layer.The LSTM is defined as follows: where x n is the n th input, s n is the hidden state after n inputs, s 0 is the all-zeros vector, c n is the cell state after n inputs, c 0 is the all-zeros vector, i n is the input gate after n inputs, f n is the forget gate after n inputs, o n is the output gate after n inputs, i n is the input gate after n inputs, g n is the modified input used to calculate c n after n inputs, W αβ is the weight matrix between α and β, b α is the bias vector for α, is the elementwise vector multiplication operator, and 'sig' refers to the sigmoid function.The hidden state and the cell state always have the same size.
In the experiments, this basic neural language model is used as a part of two different architectures: In the inject architecture, the image vector is concatenated with each of the word vectors in a caption.In the merge architecture, it is only concatenated with the final LSTM state.The layer sizes of the embedding, LSTM state, and projected image vector were also varied in the experiments in order to measure the effect of increasing the capacity of the networks.The layer sizes used are 128, 256, and 512.The details of the architectures used in the experiments are illustrated in Figure 3.
Training was performed using the Adam optimisation algorithm (Kingma and Ba, 2014) with default hyperparameters and a minibatch size of 50 captions.The cost function used was sum crossentropy.Training was carried out with an early stopping criterion which terminated training as soon as performance on the validation data started to deteriorate (validation performance is measured after each training epoch).Initialization of weights was done using Xavier initialization (Glorot and Bengio, 2010) and biases were set to zero.
Each architecture was trained three separate times; the results reported below are averages over these three separate runs.
To evaluate the trained models we generated captions for images in the test set using beam search with a beam width of 3 and a clipped maximum length of 20 words.The MSCOCO evaluation code3 was used to measure the quality of the captions by using the standard evaluation metrics BLEU-(1,2,3,4) (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), CIDEr (Vedantam et al., 2015), and ROUGE-L (Lin and Och, 2004).We also calculated the percentage of word types that were actually used in the generated captions out of the vocabulary of available word types.This measure indicates how well each architecture exploits the vocabulary it is trained on.
The code used for the experiments was implemented with TensorFlow and is available online4 .

Results
Table 1 reports means and standard deviations over the three runs of all the MSCOCO measures and the vocabulary usage.Since the point is to compare the effects of the architectures rather than to reach state-of-the-art performance, we do not include results from other published systems in our tables.
Across all experimental variables (dataset, vocabulary, and layer sizes), the performance of the merge architecture is generally superior to that of the inject architecture in all measures except for ROUGE-L and BLEU (ROUGE-L is designed for evaluating text summarization whilst BLEU is criticized for its lack of correlation with human-given scores).In what follows, we focus on the CIDEr measure for caption quality as it was specifically designed for captioning systems.
Although merge outperforms inject by a rather narrow margin, the low standard deviation over the three training runs suggests that this is a consistent performance advantage across train-and-test runs.In any case, there is clearly no disadvantage to the merge strategy with respect to injecting image features.
One peculiarity is that results on Flickr8k are better than those on Flickr30k.This could mean that Flickr8k captions contain less variation, hence are easier to perform well on.Preliminary results on the larger dataset MSCOCO (Lin et al., 2014) (currently in progress) show CIDEr results over 0.7 (d) Flickr30k: BLEU-n scores.
Table 1: Results on the captions generated using the inject and merge architectures.Values are means over three separately retrained models, together with the standard deviation in parentheses.Legend: Layer -the layer size used ('x' in Figure 3); Vocab.-the vocabulary size used.which means that either Flickr8k is too easy or Flickr30k is too hard when compared to the much larger MSCOCO.
The best-performing models are merge with state size of 256 on Flickr8k, and merge with state size 128 on Flickr30k, both with minimum token fre-quency threshold of 3. Inject models tend to improve with increasing state size, on both datasets, while the relationship between the performance of merge and the state size shows no discernible trend.Inject therefore does not seem to overfit as state size increases, even on the larger dataset.At the same time, inject only seems to be able to outperform the best scores achieved by merge if it has a much larger layer size.Therefore, in practical terms, inject models have to have larger capacity to be at par with merge.Put differently, merge has a higher performance to model size ratio and makes more efficient use of limited resources (this observation holds even when model size is defined in terms of number of parameters instead of layer sizes).
Given the same layer sizes and vocabulary, the number of parameters for merge is greater than for inject.The difference becomes greater as the vocabulary size is increased.For a vocabulary size of 2,539 and layer size of 512, merge has about 3% more parameters than inject whilst for a vocabulary size of 9,584 and layer size of 512, merge has about 20% more parameters.However, the foregoing remarks concerning over-and under-fitting also apply when the difference between the number of parameters is small.That is, the difference in performance is due at least in part to architectural differences, not just to differences in number of parameters.
Merge models use a greater proportion of the training vocabulary on test captions.However, the proportion of vocabulary used is generally quite small for both architectures: less than 16% for Flickr8k and less than 7% for Flickr30k.Overall, the trend is for smaller proportions of the overall training vocabulary to be used, as the vocabulary grows larger, suggesting that neural language models find it harder to use infrequent words (which are more numerous at larger vocabulary by definition).In practice, it means that reducing training vocabularies results in minimal performance loss.
Overall, the evidence suggests that delaying the merging of image features with linguistic encodings to a late stage in the architecture may be advantageous, at least as far as corpus-based evaluation measures are concerned.Furthermore, the results suggest that a merge architecture has a higher capacity than an inject architecture and can generate better quality captions with smaller layers.

Discussion
If the RNN had the primary role of generating captions, then it would need to have access to the image in order to know what to generate.This does not seem to be the case as including the image into the RNN is not generally beneficial to its performance as a caption generator.
When viewing RNNs as having the primary role of encoding rather than generating, it makes sense that the inject architecture generally suffers in performance when compared to the merge architecture.The most plausible explanation has to do with the handling of variation.Consider once more the task of the RNN in the image captioning task: During training, captions are broken down into prefixes of increasing length, with each prefix compressed to a fixed-size vector, as illustrated in Figure 2 above.
In the inject architecture, the encoding task is made more complex by the inclusion of image features.Indeed, in the version of inject used in our experiments -the most commonly used solution in the caption generation literature5 -image features are concatenated with every word in the caption.The upshot is (a) a requirement to compress caption prefixes together with image data into a fixed-size vector and (b) a substantial growth in the vocabulary size the RNN has to handle, because each im-age+word is treated as a single 'word'.This problem is alleviated in merge, where the RNN encodes linguistic histories only, at the expense of more parameters in the softmax layer.
One practical consequence of these findings is that, while merge models can handle more variety with smaller layers, increasing the state size of the RNN in the merge architecture is potentially quite profitable, as the entire state will be used to remember a greater variety of previously generated words.By contrast, in the inject architecture, this increase in memory would be used to better accommodate information from two distinct, but combined, modalities.
This paper has presented two views of the role of the RNN in an image caption generator.In the first, an RNN decides on which word is the most likely to be generated next, given what has been generated before.In multimodal generation, this view encourages architectures where the image is incorporated into the RNN along with the words that were generated in order to allow the RNN to make visuallyinformed predictions.
The second view is that the RNN's role is purely memory-based and is only there to encode the sequence of words that have been generated thus far.This representation informs caption prediction at a later layer of the network as a function of both the RNN encoding and perceptual features.This view encourages architectures where vision and langauge are brought together late, in a multimodal layer.
Caption generation turns out to perform worse, in general, when image features are injected into the RNN.Thus, the role of the RNN is better conceived in terms of the learning of linguistic representations, to be used to inform later layers in the neural network, where predictions are made based on what has been generated in the past together with the image that is guiding the generation.Had the RNN been the component primarily involved in generating the caption, it would need to be informed about the image in order to know what needs to be generated; however this line of reasoning seems to hurt performance when applied to an architecture.This suggests that it is not the case that the RNN is the main component of the caption generator that is involved in generation.
In short, given a neural network architecture that is expected to process input sequences from multiple modalities, arriving at a joint representation, it would be better to have a separate component to encode each input, bringing them together at a late stage, rather than to pass them all into the same RNN through separate input channels.With respect to the question of how language should be grounded in perceptual data, the tentative answer offered by these experiments is that the link between the symbolic and perceptual should be established late, once encoding has been performed.To this end, recurrent networks are best viewed as learning represen-tations, not as generating sequences.

Future work
The experiments reported here were conducted on two separate datasets.One concern is that results on Flickr8k and Flickr30k are not entirely consistent, though the superiority of merge over inject is clear in both.We are currently extending our experiments to the larger MSCOCO dataset (Lin et al., 2014).
The insights discussed in this paper invite future research on how generally applicable the merge architecture is in different domains.We would like to investigate whether similar changes in architecture would work in sequence-to-sequence tasks such as machine translation, where instead of conditioning a language model on an image we are conditioning a target language model on sentences in a source language.A similar question arises in image processing.If a CNN were conditioned to be more sensitive to certain types of objects or saliency differences among regions of a complex image, should the conditioning vector be incorporated at the beginning, thereby conditioning the entire CNN, or would it be better to instead incorporate it in a final layer, where saliency differences would then be based on high-level visual features?
There are also more practical advantages to merge architectures, such as for transfer learning.Since merge keeps the image separate from the RNN, the RNN used for captioning can conceivably be transferred from a neural language model that has been trained on general text.This cannot be done with an inject architecture since the RNN would need to be trained to combine image and text in the input.In future work, we intend to see how the performance of a caption generator is affected when the weights of the RNN are initialized from those of a general neural language model, along lines explored in neural machine translation (Ramachandran et al., 2016).
(a) Conditioning by injecting the image means injecting the image into the same RNN that processes the words.(b) Conditioning by merging the image means merging the image with the final state of the RNN in a "multimodal layer" after processing the words.

Figure 1 :
Figure 1: The inject and merge architectures for caption generation.The RNN's previous state going into the RNN is not shown.Legend: RNN -Recurrent Neural Network; FF -Feed Forward layer.

Figure 3 :
Figure 3: An illustration of the different architectures that are tested in this paper.The numbers or letters at the bottom of each box refer to the vector size output of a layer.'x' is an arbitrary layer size that is varied in the experiments and 'v' is the vocabulary size which is also varied in the experiments.'Dense' means fully connected layer with bias.