Incorporating Textual Evidence in Visual Storytelling

Previous work on visual storytelling mainly focused on exploring image sequence as evidence for storytelling and neglected textual evidence for guiding story generation. Motivated by human storytelling process which recalls stories for familiar images, we exploit textual evidence from similar images to help generate coherent and meaningful stories. To pick the images which may provide textual experience, we propose a two-step ranking method based on image object recognition techniques. To utilize textual information, we design an extended Seq2Seq model with two-channel encoder and attention. Experiments on the VIST dataset show that our method outperforms state-of-the-art baseline models without heavy engineering.


Introduction
Multi-image visual storytelling is extended from a long trend of research in image captioning and has attracted considerable attention in recent years.
To generate the stories, previous work employed a Seq2Seq framework, using image encoder to encode the image sequences and sentence decoder to generate stories from encoded image sequences.Most of the researches (Smilevski et al., 2018;Kim et al., 2018;Gonzalez-Rico and Pineda, 2018;Wang et al., 2018b;Huang et al., 2018;Yu et al., 2017) focused on improving the decoder, and took simple concatenation or an LSTM as encoder.With such design, only images are utilized as input in generating the stories.
However, through our observations, the images alone are inadequate for visual storytelling.Storytelling is creative and diversified, so background knowledge is often required to convert a few images to a complete story.However, extracting such background knowledge is very difficult, especially with limited data.
To alleviate such drawback, it is important to take previous experience of story-writing into account.Imagining when a person starts to tell stories from images, he/she may not understand the implications in those images and fail to write a proper story.However, if he/she had heard others telling stories, he/she may be able to tell a story from the stories of similar image sequences he/she previously heard.Motivated by such process, we propose to utilize the large corpus as an inventory and improve the visual storytelling model by including stories from similar image sequences in corpus as input to strengthen the encoder design.
On building such models, two major problems need to be solved: (1) how to measure the relatedness of stories from the image sequence pair; (2) how to incorporate the textual information into the model so as to fully exploit it for storytelling.
To handle the first problem of picking the most relevant stories, we propose a two-step ranking method for their image sequences.We first filter out the 'dissimilar' images with object cooccurrence, and then sort the remaining candidates with feature vectors.For the second problem of incorporating textual information, we design an enhanced Seq2Seq model with two-channel encoder, one for visual input and the other for textual input.
We conduct experiments on the VIST dataset (Huang et al., 2016), a widely used multi-image visual storytelling dataset.We show that with textual evidence, our model outperforms our baselines and state-of-the-art models.

Method
Our method is based on the Seq2Seq framework, composed of a two-channel encoder and a RNNbased decoder.The whole architecture of our method is shown in Figure 1.
In the two-channel encoder, one channel en-  codes visual evidence from the image sequence and the other encodes textual evidence from relevant stories.In the decoder, we adopt another RNN model to generate stories from the two encoder outputs.To integrate the two types of information, we use Luong attention (2015) to dynamically attend to the stories.There are also other modifications, as further explained in 2.1.
To collect the textual evidence for encoder input, we design a selection method described in Section 2.2 to get stories from the most similar images.

Visual Storytelling Framework
Most previous works on visual storytelling followed the Seq2Seq framework, taking image recognition models such as ResNet (He et al., 2015) or Inception (Szegedy et al., 2016) to extract image features, feeding them into a storylevel RNN encoder, bringing encoder output to the sentence-level decoder throughout the generation of the corresponding sentence.
We base our model on this framework with two key modifications: first, we design a text encoder to model the most similar stories which may provide evidence for story generation; second, we adopt the Luong attention Luong et al. (2015) mechanism on the textual side of encoded input to better utilize its information.

Text Encoder
We use an RNN encoder to model the textual inputs.For each story, we feed its 5 sentences into the RNN one by one, retaining the hidden state across sentences.We take the RNN output of every step through the fully connected layers as encoder output.
Joint Decoder Different from previous methods, our decoder depend on both image and text encoder.The incorporation of the two encoders is the key problem.Here we adopt two approaches to solve this problem.First, we use the concatenation of the image encoder output, the embedding of last word and the last hidden states of sentence encoder as the input of the decoder.Second, we design a Luong attention layer in decoder to attend to sentence encoder outputs.Formally, the concatenation decoder can be denoted as: and the downstream attention mechanism can be denoted as: where DEC is decoder RNN, s i t is RNN output for image i at step t, emb is word embedding, img and sent are image and sentence encoder output, W c and b c are appended linear matrix and bias.
To be noticed, in our model, both decoder RNN and image encoder are generic and not limited to one particular design.The image encoder can be of arbitrary architecture as long as it generates a vector for each image, and the decoder RNN can also be designed flexibly as long as it takes a vector as input and outputs another vector at each step.
Specifically, we implemented these modifications on two popular systems: GLACNet (2018), the group with best human evaluation scores in Visual Storytelling Challenge NAACL 2018, wwho use residual encoder to generate GLOCAL vectors; XE-ss, a baseline model of Wang et al. (2018b), who proposed to improve performance with reinforcement model (AREL).We call our two models GLAC-TG and XE-TG.(see section 3.1 for details).

Textual Evidence Selection
To provide strong textual evidence for story generation, we aim to select stories which are most similar to the expected story for the given sequence of images.
With the assumption that similar images usually have similar stories, we take stories of similar im-ages as similar stories.While it's most straightforward to choose the image with the most similar feature vector, it's shown through experiments 2 that comparing each pair of feature vectors for a large image corpus would be computationally expensive and suffer severely from false positives.Therefore, we propose to employ a two-step filterand-sort method to pick out the most similar stories.

Filter
In the filter step, we use object co-occurrence to discriminate 'roughly similar' image sequences from 'dissimilar' ones.Here we filter by image object information because it conforms with the intuition that images with similar objects describe relevant events.It is also because object information has been widely used in image captioning as helpful information on images.(Mishra and Liwicki, 2019;Liu et al., 2018;Jiang et al., 2018;Anderson et al., 2017;Yin and Ordonez, 2017;Wang et al., 2018a).
We first get the types and numbers of objects in each image using an object recognition model, and then we measure image similarity with a categorical criterion and a numerical criterion.As mentioned above, we compare images in sequences.We measure the similarity between the sequences as the average score of its images.By filtering on the corpus and keeping only the image sequences scored on the top, we narrow down our candidate sequences to a modest size.

Sort
After obtaining a small set of roughly similar image sequences, we use feature vectors to rank similarity more precisely.Here we experiment on two approaches: a simple cosine similarity measure and a Bi-Linear model with Meteor score as gold annotation inspired by Cao et al. (2018).Empirically we find that Bi-Linear model shows no advantage against cosine similarity.Thus, we sim-ply sort the roughly similar sequences with cosine similarity for downstream models.

Experiment Setup
Our experiment is built on VIST (Huang et al., 2016) dataset, which is organized in 5-image sequences annotated with 5-sentence complete stories.The dataset size is 40098 for train, 4988 for validation and 5050 for test.
In both models, we use ResNet152 (He et al., 2015) pre-trained on ImageNet (Krizhevsky et al., 2012) as image features, and we use Bi-LSTM and Bidirectional GRU respectively for image encoder.
In both models, we keep the hyper-parameters from their baseline models unmodified.For loss function, we use cross-entropy averaged on the sentence lengths.
On textual evidence selection, we use all stories and image sequences in train and validation set as reference corpus, and a Fast RCNN (He et al., 2017;Abdulla, 2017) model pre-trained on COCO dataset (Lin et al., 2014) to detect objects from each image.Roughly similar stories are filtered with numerical criterion at 500 candidate size as it shows the best performance.
Table 1: Performance of our method compared to existing visual storytelling models, R is ROUGE-L, C is CIDEr, M is METEOR (models we re-trained in same setting as original are listed in (re-trained) rows) In Table 1, we compare our models with several strong baselines on three automatic evaluation metrics, ROUGE-L, CIDEr and METEOR.In the top block of Table 1, we present 4 previous baselines: 1) a standard Seq2Seq baseline model developed by Huang et al. (2016); 2) a hierarchically attentive model designed by Yu et al. (2017); 3) the Seq2Seq model with sentence-wise separate decoders by Gonzalez-Rico and Pineda (2018); 4) reinforcement learning with topic guided decoders by Huang et al. (2018).In the middle block, we present the GLACNet model Kim et al. (2018) and our improved GLAC-TG model.In the bottom block, we present our XE-TG models which are improved based on the XE-ss model in AREL framework (Wang et al., 2018b).For fair comparison, we evaluate all models with the open source evaluation code1 (Yu et al., 2017).
Result shows that both our models outperform their corresponding baselines.Even using textual evidence only, our XE-TG-only model shows competitive performance compared to the baselines.Moreover, our XE-TG models using cross entropy loss outperformed state-of-the-art baselines with reinforcement learning techniques (Wang et al., 2018b;Huang et al., 2018).By using simple cross entropy loss, our models are also less costly to train, easier to tune and more stable when re-trained.
We conduct a qualitative analysis on XE-TG-top1 model in Figure 2 as an example.It shows that the selected similar story shares the same topic of wilderness adventure with similar story-flows.The generated story also catches the essence of the image sequence, with basic details closely relevant.It shows that our textual evidence selection method is capable of selecting proper textual evidence, and our storytelling framework is capable of capturing the provided information and telling fluent and coherent stories.

Analysis on Textual Evidence Selection
In this section, we further explore the effectiveness of similar stories.We experimented on filtering candidate size 50, 100 and 500 with both categorical and numerical criteria, using sorting on the entire reference corpus for comparison and ME-TEOR score as a metric of actual story similarity.In Table 2, we show that for all methods, the selected stories are significantly more similar to gold stories than randomly selected ones, and stories with higher rankings are generally better than those with lower rankings.Moreover, for both criteria, candidate size poses negligible effect.

Conclusion
In this paper, we show that textual evidence from similar image sequences contains rich information for visual storytelling, therefore it's capable of boosting storytelling performance.We propose a feasible two-step approach to extract textual evidence from a large corpus.We also design a twochannel encoder to incorporate textual and visual evidence into the Seq2Seq visual storytelling models and achieve state-of-the-art performance with-out heavy engineering.

Figure 1 :
Figure 1: Overall architecture of our proposed method.
Formally, O a and O b are the set of objects present in image a and b respectively, c k x is the count of occurrence for object k in image x.The categorical criterion concerns the types of common objects, namely score cat = |Oa∩O b | √ |Oa||O b | ; the numerical criterion concerns the differences in times of occurrence, namely score num = |Oa||O b | |Σ k∈(Oa∪O b ) (c k a −c k b ) 2 | .Additionally, we set similarity scores to 0 when no objects are recognized in either image.

Figure 2 :
Figure 2: An example sequence of visual storytelling.

Table 2 :
METEOR scores for top 1 to 5 similar stories regarding two criteria, B-L refers to Bi-Linear