Generating Description for Sequential Images with Local-Object Attention Conditioned on Global Semantic Context

In this paper, we propose an end-to-end CNN-LSTM model for generating descriptions for sequential images with a local-object attention mechanism. To generate coherent descriptions, we capture global semantic context using a multi-layer perceptron, which learns the dependencies between sequential images. A paralleled LSTM network is exploited for decoding the sequence descriptions. Experimental results show that our model outperforms the baseline across three different evaluation metrics on the datasets published by Microsoft.


Introduction
Recently, automatically generating image descriptions has attracted considerable interest in the fields of computer vision and nature language processing.Such a task is easy to humans but highly non-trivial for machines as it requires not only capturing the semantic information from images (e.g., objects and actions) but also needs to generate human-like natural language descriptions.
Existing approaches to generating image description are dominated by neural network-based methods, which mostly focus on generating description for a single image (Karpathy and Li, 2015;Xu et al., 2015;Jia et al., 2015;You et al., 2016).Generating descriptions for sequential images, in contrast, is much more challenging, i.e., the information of both individual images as well as the dependencies between images in a sequence needs to be captured.Huang et al. (2016) introduce the first sequential vision-to-language dataset and exploit Gated Recurrent Units (GRUs) (Cho et al., 2014) based encoder and decoder for the task of visual sto-rytelling.However, their approach only considers image information of a sequence at the first time step of the decoder, where the local attention mechanism is ignored which is important for capturing the correlation between the features of an individual image and the corresponding words in a description sentence.Yu et al. (2017) propose a hierarchically-attentive Recurrent Neural Nets (RNNs) for album summarisation and storytelling.To generate descriptions for an image album, their hierarchical framework selects representative images from several image sequences of the album, where the selected images might not necessary have correlation to each other.
In this paper, we propose an end-to-end CNN-LSTM model with a local-object attention mechanism for generating story-like descriptions for multiple images of a sequence.To improve the coherence of the generated descriptions, we exploit a paralleled long short-terms memory (LSTM) network and learns global semantic context by embedding the global features of sequential images as an initial input to the hidden layer of the LSTM model.We evaluate the performance of our model on the task of generating story-like descriptions for an image sequence on the sequencein-sequence (SIS) dataset published by Microsoft.We hypothesise that by taking into account global context, our model can also generate better descriptions for individual images.Therefore, in another set of experiments, we further test our model on the Descriptions of Images-in-Isolation (DII) dataset for generating descriptions for each individual image of a sequence.Experimental results show that our model outperforms a baseline developed based on the state-of-the-art image captioning model (Xu et al., 2015) in terms of BLEU, METEOR and ROUGE, and can generate sequential descriptions which preserve the dependencies between sentences.

Related Work
Recent successes in machine translation using Recurrent Neural Network (RNN) (Bahdanau et al., 2014;Cho et al., 2014)  Recently, the attention mechanism (Xu et al., 2015;You et al., 2016;Lu et al., 2016;Zhou et al., 2016) has been widely used and proved to be effective in the task of image description generation.For instance, Xu et al. (2015) explore two kinds of attention mechanism for generating image descriptions, i.e., soft-attention and hard-attention, whereas You et al. (2016) exploits a selective semantic attention mechanism for the same task.
There is also a surge of research interest in visual storytelling (Kim and Xing, 2014;Sigurdsson et al., 2016;Huang et al., 2016;Yu et al., 2017).Huang et al. (2016) collect stories using Mechanical Turk and translate a sequence of images into story-like descriptions by extending a GRU-GRU framework.Yu et al. (2017) utilise a hierarchically-attentive structures with combined RNNs for photo selection and story generation.However, the above mentioned approaches for generating descriptions of sequential images do not explicitly capture the dependencies between each individual images of a sequence, which is the gap that we try to address in this paper.

Methodology
In this section, we describe the proposed CNN-LSTM model with local-object attention.In order to generate coherent descriptions for an image sequence, we introduce global semantic context and a paralleled LSTM in our framework as shown in Figure .1. Our model works by first extracting the global features of sequential images using a CNN network (VGG16) (Simonyan and Zisserman, 2014), which has been extensively used in image recognition.Here a VGG16 model contains 13 convolutional layers, 5 pooling layers and 3 fully connected layers.The extracted global features are then embedded into a global semantic vector with a multi-layer perceptron as the initial input to the hidden layer of a paralleled LSTM model.Our model then applies the last convolutional-layer operation from the VGG16 model to generate the local features of each image in sequence.Finally, we introduce a paralleled LSTM model and a local-object attention mecha-nism to decode sentence descriptions.

Features Extraction and Embedding
Sequential image descriptions are different from single image description due to the spatial correlation between images.Therefore, in the encoder, we exploit both global and local features for describing the content of sequential images.We extract global features of the sequential images with the second fully connected layer (FC7) from VGG16 model.The global features are denoted by G which are a set of 4096-dimension vectors.Then, we select the features of the final convolutional layer (Cov 5) from the VGG16 model to represent local features for each image in the sequence.The local features are denoted as L j (j = 1,. ..,N ), where N is the number of images in the sequence.In our experiment, we follow Huang et al. (2016) and set 5 as the number of images in a sequence.Finally, we embed the global features G into a 512-dimension context vector via a multi-layer perceptron which is then used as the initial input of the hidden layer in LSTM model.

Sequential Descriptions Generation
In the decoding stage, our goal is to obtain the most likely text descriptions of a given sequence of images.This can be generated by training a model to maximize the log likelihood of a sequence of sentences S, given the corresponding sequential images I and the model parameters θ, as shown in Eq. 1.
Here s j denotes a sentence in S, and N is the total number of sentences in S.
Assuming a generative model of each sentence s j produces each word in the sentence in order, the log probability of s j is given by the sum of the log probabilities over the words: (2) where s j,t represents the t th word in the j th sentence and C is the total number of words of s j .
We utilize a LSTM network (Hochreiter and Schmidhuber, 1997) to produce a sequence descriptions conditioned on the local feature vectors, the previous generated words, as well as the hidden state with a global semantic context.Formally, our LSTM model is formulated as follows: where i j t , f j t , o j t and c j t represents input gates, forget gates, output gates and memory, respectively.q j t represents the updating information in the memory c j t .σ denotes the sigmoid activation function, represents the element-wise multiplication, and ϕ indicates the hyperbolic tangent function.W • and b • are the parameters to be estimated during training.Also h j t is the hidden state at time step t which will be used as an input to the LSTM unit at the next time step.
Here, we utilize a multilayer perceptron to model the global semantic context which can be viewed as the initial input of the hidden state h j 0 , where every initial value h j 0 in the LSTM model is equal and is defined as: When modelling local context, the local context vector v j t is a dynamic representation of the relevant part of the j th image in a sequence at time t.In Eq. 6, we use the attention mechanism f att proposed by (Bahdanau et al., 2014) to compute the local attention vector v j t , where the corresponding weight k j t of each local features L j is computed by a softmax function with input from a multilayer perceptron which considers both the current local vector L j and the hidden state h j t−1 at time t − 1.
Both the SIS and DII datasets are published by Microsoft 1 , which have a similar data structure, DII (our model) (1) a group of people that are on the beach.( 2) a man and a woman pose for a picture together.( 3) a city at night with many buildings in the backgroud.( 4) a bridge that is next to the water.( 5) a large ship is being enjoyed by the crowd.

DII (cnn-att-lstm)
(1) a group of people that are next to each other.( 2) a man and a woman sitting at a table .(3) a group of friends pose for a picture.( 4) the man is blowing out into the camera.( 5) a woman is smilling.

DII (ground truth)
(1) a variety of people sitting in a window filled restanrant.( 2) closeup of a woman looking to her right in a restaurant setting.(3) many buildings by the beach.(4) a waterfront scence from an outside restaurant at night.( 5) people on the ferris wheel.

SIS (our model)
(1) the family went to restaurant.(2) the family was very excited to have a party.(3) the sun was going down to the beach.(4) the family decide to go to restaurant.( 5) i was so excited to have a great time.

SIS (cnn-att-lstm)
(1) the city is a small windows.(2) the girls are ready to go to the day.(3) the beautiful fireworks.( 4) the city has a great view.( 5) we drove up.

SIS (ground truth)
(1) me and my lover went on a vacation to see some sights.here we are getting something to eat.(2) we liked the food but the place was rather crowded for our tastes.here is a view of the city from our hotel.(3) it was so lovely to look out every night as the sun went down.another shot from high up.( 4) it was breath taking to watch the city light up as the sun went down.( 5) we where in line for a ferris wheel.i thought that this would make a good pic, and i think it came out well.    1.
Evaluation.We compare our model with the sequence-to-sequence baseline (cnn-att-lstm) with attention mechanism (Xu et al., 2015).The cnnatt-lstm baseline only utilises the local attention mechanism which combines visual concepts of an image with the corresponding words in a sentence.Our model, apart from adopting a local-object attention, can further model global semantic context for capturing the correlation between sequential images.
Table 2 shows the experimental results of our model on the task of generating descriptions for sequential images with three popular evaluation metrics, i.e.BLEU, Meteor and ROUGE.It can be observed from Table 2 that our model outperforms the baseline on both SIS and DII datasets for all evaluation metrics.It is also observed that the scores of the evaluation metric are generally higher for the DII dataset than the SIS dataset.The main reason is that the SIS dataset contains more sentences descriptions in a sequence and more abstract content descriptions such as "breathtaking" and "excited" which are difficult to understand and prone to overfitting.Figure 2 shows an example sequence of five images as well as their corresponding descriptions generated by our model, the baseline (cnn-attlstm), and the ground truth.For the SIS dataset, it can observed that our model can capture more coherent story-like descriptions.For instance, our model can learn the social word "family" to connect the whole story and learn the emotional words "great time" to summarise the description.However, the baseline model failed to capture such important information.Our model can learn dependencies of visual scenes between images even on the DII dataset.For example, compared to the descriptions generated by cnn-att-lstm, our model can learn the visual word "beach" in image 1 by reasoning from the visual word "water" in image 4.
Our model can generally achieve good results by capturing the global semantics of an image sequence such as the example in the first row of Figure 3.However, our model also has difficulties in generating meaningful descriptions in a number of cases.For instance, our model generates fairly abstractive descriptions such as "a great time" due to severe overfitting, as shown in the second row of Figure 3.We suppose the issue of overfitting is likely to be alleviated by adding more training data or using more effective algorithm for image feature extraction.

Conclusion
In this paper, we present a local-object attention model with global semantic context for sequential image descriptions.Unlike other CNN-LSTM models that only employ a single image as input for image caption, our proposed method can generate descriptions of sequential images by exploiting the global semantic context to learn the dependencies between sequential images.Extensive experiments on two image datasets (DII and SIS) show promising results of our model.

Figure 1 :
Figure 1: The architecture of our CNN-LSTM model with global semantic context.

Figure 2 :
Figure 2: Example of sequential descriptions generated by our model, the baseline, and the ground truth.

Figure 3 :
Figure 3: Error analysis of our model.First row: our model generates correct captions.Second row: failure cases due to severe overfitting.

Table 2 :
Evaluation of the quality of descriptions generated for sequential images.i.e., each image sequence consists of five images and their corresponding descriptions.The key difference is that descriptions of SIS consider the dependencies between images, whereas the descriptions of DII are generated for each individual image, i.e., no dependencies are considered.As the full DII and SIS datasets are quite large, we only used part of both datasets for our initial experiments, where the dataset statistics are shown in Table