Hierarchically-Attentive RNN for Album Summarization and Storytelling

We address the problem of end-to-end visual storytelling. Given a photo album, our model first selects the most representative (summary) photos, and then composes a natural language story for the album. For this task, we make use of the Visual Storytelling dataset and a model composed of three hierarchically-attentive Recurrent Neural Nets (RNNs) to: encode the album photos, select representative (summary) photos, and compose the story. Automatic and human evaluations show our model achieves better performance on selection, generation, and retrieval than baselines.


Introduction
Since we first developed language, humans have always told stories. Fashioning a good story is an act of creativity and developing algorithms to replicate this has been a long running challenge. Adding pictures as input can provide information for guiding story construction by offering visual illustrations of the storyline. In the related task of image captioning, most methods try to generate descriptions only for individual images or for short videos depicting a single activity. Very recently, datasets have been introduced that extend this task to longer temporal sequences such as movies or photo albums (Rohrbach et al., 2016;Pan et al., 2016;Lu and Grauman, 2013;.
The type of data we consider in this paper provides input illustrations for story generation in the form of photo albums, sampled over a few minutes to a few days of time. For this type of data, generating textual descriptions involves telling a temporally consistent story about the depicted visual information, where stories must be coherent and take into account the temporal context of the im-ages. Applications of this include constructing visual and textual summaries of albums, or even enabling search through personal photo collections to find photos of life events.
Previous visual storytelling works can be classified into two types, vision-based and languagebased, where image or language stories are constructed respectively. Among the vision-based approaches, unsupervised learning is commonly applied: e.g., (Sigurdsson et al., 2016) learns the latent temporal dynamics given a large amount of albums, and (Kim and Xing, 2014) formulate the photo selection as a sparse time-varying directed graph. However, these visual summaries tend to be difficult to evaluate and selected photos may not agree with human selections. For languagebased approaches, a sequence of natural language sentences are generated to describe a set of photos. To drive this work (Park and  collected a dataset mined from Blog Posts. However, this kind of data often contains contextual information or loosely related language. A more direct dataset was recently released , where multi-sentence stories are collected describing photo albums via Amazon Mechanical Turk. In this paper, we make use of the Visual Storytelling Dataset . While the authors provide a seq2seq baseline, they only deal with the task of generating stories given 5representative (summary) photos hand-selected by people from an album. Instead, we focus on the more challenging and realistic problem of end-toend generation of stories from entire albums. This requires us to either generate a story from all of the album's photos or to learn selection mechanisms to identify representative photos and then generate stories from those summary photos. We evaluate each type of approach.
Ultimately, we propose a model of hierarchically-attentive recurrent neural nets, consisting of three RNN stages. The first RNN encodes the whole album context and each photo's content, the second RNN provides weights for photo selection, and the third RNN takes the weighted representation and decodes to the resulting sentences. Note that during training, we are only given the full input albums and the output stories, and our model needs to learn the summary photo selections latently.
We show that our model achieves better performance over baselines under both automatic metrics and human evaluations. As a side product, we show that the latent photo selection also reasonably mimics human selections. Additionally, we propose an album retrieval task that can reliably pick the correct photo album given a sequence of sentences, and find that our model also outperforms the baselines on this task.

Related work
Recent years have witnessed an explosion of interest in vision and language tasks, reviewed below. Visual Captioning: Most recent approaches to image captioning (Vinyals et al., 2015b;Xu et al., 2015) have used CNN-LSTM structures to generate descriptions. For captioning video or movie content Pan et al., 2016), sequence-to-sequence models are widely applied, where the first sequence encodes video frames and the second sequence decodes the description. Attention techniques (Xu et al., 2015;Yu et al., 2016;Yao et al., 2015) are commonly incorporated for both tasks to localize salient temporal or spatial information. Video Summarization: Similar to documentation summarization (Rush et al., 2015;Cheng and Lapata, 2016;Woodsend and Lapata, 2010) which extracts key sentences and words, video summarization selects key frames or shots. While some approaches use unsupervised learning (Lu and Grauman, 2013;Khosla et al., 2013) or intuitive criteria to pick salient frames, recent models learn from human-created summaries (Gygli et al., 2015;Zhang et al., 2016b,a;Gong et al., 2014). Recently, to better exploit semantics, (Choi et al., 2017) proposed textually customized summaries. Visual Storytelling: Visual storytelling tries to tell a coherent visual or textual story about an image set. Previous works include storyline graph modeling Xing, 2014), unsupervised mining (Sigurdsson et al., 2016), blog-photo alignment , and language retelling Park and Kim, 2015). While (Park and  collects data by mining Blog Posts,  collects stories using Mechanical Turk, providing more directly relevant stories.

Model
Our model (Fig. 1) is composed of three modules: Album Encoder, Photo Selector, and Story Generator, jointly learned during training.

Album Encoder
Given an album A = {a 1 , a 2 , ..., a n }, composed of a set of photos, we use a bi-directional RNN to encode the local album context for each photo. We first extract the 2048-dimensional visual representation f i ∈ R k for each photo using ResNet101 , then a bi-directional RNN is applied to encode the full album. Following , we choose a Gated Recurrent Unit (GRU) as the RNN unit to encode the photo sequence. The sequence output at each time step encodes the local album context for each photo (from both directions). Fused with the visual representation followed by ReLU, our final photo representation is (top module in Fig. 1):

Photo Selector
The Photo Selector (illustrated in the middle yellow part of Fig. 1) identifies representative photos to summarize an album's content. As discussed, we do not assume that we are given the ground-truth album summaries during training, instead regarding selection as a latent variable in the end-to-end learning. Inspired by Pointer Networks (Vinyals et al., 2015a), we use another GRU-RNN to perform this task 1 .
Given the album representation V n×k , the photo selector outputs probabilities p t ∈ R n (likelihood of selection as t-th summary image) for all photos using soft attention. Figure 1: Model: the album encoder is a bi-directional GRU-RNN that encodes all album photos; the photo selector computes the probability of each photo being the tth album-summary photo; and finally, the story generator outputs a sequence of sentences that combine to tell a story for the album.
At each summarization step, t, the GRU takes the previous p t−1 and previous hidden state as input, and outputs the next hidden stateh t .h t is fused with each photo representation v i to compute the i th photo's attention p i t = p(y a i (t) = 1). At test time, we simply pick the photo with the highest probability to be the summary photo at step t.

Story Generator
To generate an album's story, given the album representation matrix V and photo summary probabilities p t from the first two modules, we compute the visual summary representation g t ∈ R k (for the t-th summary step). This is a weighted sum of the album representations, i.e., g t = p T t V . Each of these 5 g t embeddings (for t = 1 to 5) is then used to decode 1 of the 5 story sentences respectively, as shown in the blue part of Fig. 1.
Given a story S = {s t }, where s t is t-th summary sentence. Following , the l-th word probability of the t-th sentence is: where W e is the word embedding. The GRU takes the joint input of visual summarization g t , the previous word embedding w t,l , and the previous hidden state, then outputs the next hidden state. The generation loss is then the sum of the negative log likelihoods of the correct words: L gen (S) = − T t=1 Lt l=1 log p t,l (s t,l ). To further exploit the notion of temporal coherence in a story, we add an order-preserving con-straint to order the sequence of sentences within a story (related to the story-sorting idea in ). For each story S we randomly shuffle its 5 sentences to generate negative story instances S . We then apply a max-margin ranking loss to encourage correctly-ordered stories: L rank (S, S ) = max(0, m−log p(S )+log p(S)). The final loss is then a combination of the generation and ranking losses: L = L gen (S) + λL rank (S, S ). (2)

Experiments
We use the Visual Storytelling Dataset , consisting of 10,000 albums with 200,000 photos. Each album contains 10-50 photos taken within a 48-hour span with two annotations: 1) 2 album summarizations, each with 5 selected representative photos, and 2) 5 stories describing the selected photos.

Story Generation
This task is to generate a 5-sentence story describing an album. We compare our model with two sequence-to-sequence baselines: 1) an encoderdecoder model (enc-dec), where the sequence of album photos is encoded and the last hidden state is fed into the decoder for story generation, 2) an encoder-attention-decoder model (Xu et al., 2015) (enc-attn-dec) with weights computed using a soft-attention mechanism. At each decoding time step, a weighted sum of hidden states from the encoder is decoded. For fair comparison, we  use the same album representation (Sec. 3.1) for the baselines. We test two variants of our model trained with and without ranking regularization by controlling λ in our loss function, denoted as h-attn (without ranking), and h-attn-rank (with ranking). Evaluations of each model are shown in Table 1. The h-attn outperforms both baselines, and h-attnrank achieves the best performance for all metrics. Note, we use beam-search with beam size=3 during generation for a reasonable performancespeed trade-off (we observe similar improvement trends with beam size = 1). 2 To test performance under optimal image selection, we use one of the two ground-truth human-selected 5-photo-sets as an oracle to hard-code the photo selection, denoted as h-(gd)attn-rank. This achieves only a slightly higher Meteor compared to our end-to-end model.
Additionally, we also run human evaluations in a forced-choice task where people choose between stories generated by different methods. For this evaluation, we select 400 albums, each evaluated by 3 Turkers. Results are shown in Table 2. Experiments find significant preference for our model over both baselines. As a simple Turing test, we also compare our results with human written stories (last row of Table 2), indicating room for improvement of methods.

Album Summarization
We evaluate the precision and recall of our generated summaries (output by the photo selector) compared to human selections (the combined set   of both human-selected 5-photo stories). For comparison, we evaluate enc-attn-dec on the same task by aggregating predicted attention and selecting the 5 photos with highest accumulated attention. Additionally, we also run DPP-based video summarization (Kulesza et al., 2012) using the same album features. Our models have higher performance compared to baselines as shown in Table 3 (though DPP also achieves strong results, indicating that there is still room to improve the pointer network).

Album Retrieval
Given a human-written story, we introduce a task to retrieve the album described by that story. We randomly select 1000 albums and one groundtruth story from each for evaluation. Using the generation loss, we compute the likelihood of each album A m given the query story S and retrieve the album with the highest generation likelihood, A = argmax Am p(S|A m ). We use Recall@k and Median Rank for evaluation. As shown in Table 4), we find that our models outperform the baselines, but the ranking term in Eqn.2 does not improve performance significantly.

Conclusion
Our proposed hierarchically-attentive RNN based models for end-to-end visual storytelling can jointly summarize and generate relevant stories from full input photo albums effectively. Automatic and human evaluations show that our method outperforms strong sequence-to-sequence Figure 2: Examples of album summarization and storytelling by enc-attn-dec (blue), h-attn-rank (red), and ground-truth (green). We randomly select 1 out of 2 human album summaries as ground-truth here.
baselines on selection, generation, and retrieval tasks.