Reading Between the Lines: Exploring Infilling in Visual Narratives

Generating long form narratives such as stories and procedures from multiple modalities has been a long standing dream for artificial intelligence. In this regard, there is often crucial subtext that is derived from the surrounding contexts. The general seq2seq training methods render the models shorthanded while attempting to bridge the gap between these neighbouring contexts. In this paper, we tackle this problem by using \textit{infilling} techniques involving prediction of missing steps in a narrative while generating textual descriptions from a sequence of images. We also present a new large scale \textit{visual procedure telling} (ViPT) dataset with a total of 46,200 procedures and around 340k pairwise images and textual descriptions that is rich in such contextual dependencies. Generating steps using infilling technique demonstrates the effectiveness in visual procedures with more coherent texts. We conclusively show a METEOR score of 27.51 on procedures which is higher than the state-of-the-art on visual storytelling. We also demonstrate the effects of interposing new text with missing images during inference. The code and the dataset will be publicly available at https://visual-narratives.github.io/Visual-Narratives/.


Introduction
Humans process information from their surrounding contexts from multiple modalities. These situated contexts are often derived from a modality (source) and expressed in another modality (target). Recent advances have seen a surge of interest in vision and language as source and target modalities respectively. One such widely studied task is image captioning (Hossain et al., 2019; which provides a textual description T given an image I. In contrast, visual storytelling (Huang et al., 1

ResNet-152 Narrative Context Layer Visual Feature Extraction
First cream the butter and the vanilla extract together with a hand mixer. It only takes a few minutes .
Scrape the sides of the bowl with a spatula as needed . Then give it a mix again for about 15 seconds .
Next add in your powdered sugar little by little , with the mixer on low , until all of the powdered sugar is blended in .
Next you can add a little bit of milk at a time to get the right consistency that you want .
To use a piping back fold it over one of your hands , and open up the middle , or you can fold it over a tall glass . in the second step is masked while the model generates the corresponding textual description from surrounding context. 2016) is the task of generating a sequence of textual descriptions ({T 1 , T 2 , ..., T n }) from a sequence of images ({I 1 , I 2 , ..., I n }). This sequential context is the differentiating factor in generation of visual narratives in comparison to image captioning in isolation. This long form generation comprises of a coherent sequence of multiple sentences.
A fundamental incongruity between how humans process information from multiple modalities and how we teach machines to do the same is that, humans are capable of bridging the information gap from surrounding contexts. Our training procedures do not take care of accommodating the same ability in a supervised learning paradigm. Traditionally, the problem of missing context in long text generation is addressed using additional input such as entities, actions, etc., (Fan et al., 2019;Dong et al., 2019), latent templates, external knowledge etc,. These are explicit methods to inject content during generation. In contrast, in the spirit of simplicity, we propose infilling techniques to implicitly interpolate the gap between surrounding contexts from a stream of images. The training procedure incorporates masked contexts with the objective of a masked span prediction. We focus on two kinds of visual narratives namely, stories and procedures. We curated a large scale ViPT dataset with pairwise image and text descriptions comprising of 46k procedures and 340k images. The percentage of unique words in each step in comparison to the rest of the recipe is about 60% for ViST and 39% for ViPT. This implies that overlapping contexts are predominant in procedures than stories datasets. This is usually because stories are more creative and diverse while procedures are in-domain. For both these reasons, we hypothesize that infilling technique is more effective in scenarios where it can leverage the vast context from the surrounding information to filling the missing pieces.
To this end, we present our infilling based model to perform visual narrative generation and compare its effects on visual stories and procedures. The overview of the infilling based training procedure is presented in Figure 1. We conclusively observe that it is more effective in procedural texts with stronger contextual dependencies. We also present the effects of infilling during training and inference phases, and observe that infilling shows benefits during inference as well. Similarly, the infilling based techniques are also capable of generating longer sentences. Interpolating contexts to generate narrative descriptions has potential applications in fields such as digital education (Hollingshead, 2018), social media content (Gella et al., 2018), augmented reality (Dudley et al., 2018), video games (Kurihara et al., 2019;Ammanabrolu et al., 2019) (Antol et al., 2015) and visual dialog (Das et al., 2017;Mostafazadeh et al., 2017;De Vries et al., 2017). While the task of generating a sentence from a single image i.e., image captioning has been well studied in the literature, generating a long form sequence of sentences from a sequence of images has been catching attention only in the recent past. Hence, the natural next step here is towards long form sequential generation in the form of stories, procedures etc., visual narrative telling.
Visual Storytelling: Huang et al. (2016) ventured into sequential step wise generation of stories by introducing visual storytelling (ViST). Recent methods have tackled ViST using adversarial learning, reinforcement learning (Wang et al., 2018;Huang et al., 2019;Hu et al., 2020), modalityfusion (Smilevski et al., 2018), traditional seq2seq models (Kim et al., 2018;Jung et al., 2020;Hsu et al., 2018) and explicit structures (Bosselut et al., 2016;Bisk et al., 2019).  also proposed a dataset of 16k recipes in a similar form. While these are all cooking recipes, the ViPT dataset comprises a mixture of ten different domains. Also, our dataset is aboout 2.8 times larger than the storyboarding dataset with almost double the number of procedures in the domain of cooking recipes itself. Though the stories in ViST demonstrate a sense of continuity, the overarching sequential context is feeble. Procedures such as cooking recipes (Salvador et al., 2019;  paper studies the effects of infilling techniques for visual narrative generation. An alternate stream of work to improve the context in stories include providing supporting information such as entities (Clark et al., 2018;Xu et al., 2018), latent templates (Wiseman et al., 2018), knowledge graphs (Yang et al., 2019), etc., explicitly. In contrast to this, infilling provides an opportune platform to implicitly learn the contextual information. Our work is positioned in the intersection of infilling and multimodal language generation.

ViPT Description
While there are several types of narratives such as literary, factual and persuasive, this paper looks into stories and procedures. This section describes our new ViPT dataset and highlights the differences with ViST.
Procedures vs Stories: Long form narratives are often characterized by three crucial properties: content, structure and surface form realization (Gatt and Krahmer, 2018). Narrative properties such as content and structure in these forms are sufficiently contrastive between stories and procedures. Content in stories include characters and events while procedures include ingredients, materials and actions. Coming to the structure, stories typically start by setting a scene and the era followed by characterizing the participants and culminating with a solution if an obstacle is encountered. In contrast, a procedural text is often goal oriented and thereby typically begins by listing the ingredients/materials needed followed by a step by step description to arrive at the final goal. While stories can be metaphoric, sarcastic and humorous in surface realization, the sentences in procedures are often in imperative or instructional tone.

Data Collection Process:
We manually examined around 10 blogging websites with various user written text on several how-to activities. Among these we found that snapguide and instructables are consistent in the form of pairs of textual descriptions along with their images. We are going to release the scripts used to collect this data as well as preprocess them. We removed all the procedures in which atleast one image in each step is absent. Once all this preprocessing is done, the data contained the following categories in both the websites. These categories are based on the tags given by the bloggers to the articles they have written from among the categories that each website offers. These categories for each of these websites are: • snapguide: recipes, games-tricks, sportsfitness, gardening, style, lifestyle, outdoors, beauty, arts-crafts, home, music, photography, pets, automotive, technology • instructables: crafts, cooking, teachers, circuits, living, workshop, outside In union, they are a total of 18 categories. We manually examined a few procedures in each of the categories and regrouped them into 10 broad categories that are presented in Table 1. A list of urls corresponding to the data is submitted along with the paper.
Visualization of topics: Each of the categories in our Visual Procedure Telling (ViPT) are analyzed for the topics present in them. To get a more detailed understanding of these topics in the dataset, we hosted the topic visualizations here: visual-narratives.github.io/Visual-Narratives/.
ViPT dataset: Though stories have the potential to exhibit the properties listed above, it is challenging to observe them in the ViST dataset (Huang et al., 2016) owing to the shorter sequence lengths. The extent to which adjacent groups of sentences have overlapping contexts is high in procedures as compared to stories. We had previously gathered cooking recipes to experimentally demonstrate a scaffolding technique to improve structure in long   (2019). We extend this work to gather procedures or 'how-to' articles that have step by step instructions along with an associated pairwise image to each step in several domains. To facilitate multi-domain research with stronger interleaved contexts between surrounding steps, we present a large scale visual procedure telling dataset with 46k procedures comprising of 340k pairwise images and textual descriptions. It is carefully curated from a number of how-to blogging websites. Our dataset comprises of pairwise images and textual descriptions of the corresponding images, typically describing a step in a procedure. This means that each description of the step is tethered to an image. This makes it a visual narrative telling task. We categorized the dataset into 10 distinct domains including recipes, crafts, outdoors, lifestyle, technology, styling, fitness, hobbies, pets and miscellaneous. The category wise details of the dataset are presented in Table 1. As we can observe, the dataset is domainated by cooking recipes which are relatively of similar sizes with ViST compared to the rest of the domains.

Differences between ViPT and ViST datasets:
As observed in Table 1, the average number of steps in ViPT is higher than that of ViST. However, the average number of steps in recipes and stories are similar which is 5.96 and 5.00 respectively. The average number of words per step in ViPT is also much higher, thereby presenting a more challeng-ing long form text generation task. Despite the average number of steps being similar, the average length of each step i.e, the number of words per step in cooking recipes is about 7 times that of stories. Typically, each step in the ViPT dataset comprises of multiple sentences that is indicative of the corresponding image. This is as opposed to ViST dataset, which has a single sentence per step. These long sequences also present a case for dealing with larger vocabularies as well. The recipes category alone has a vocabulary of 109k tokens while the same for stories is 25k. We also compared the diversity in vocabulary of each step by computing the average percentage of unique words in a step with respect to the rest of the narrative. While this number is a high 60% for ViST, it is 39% for ViPT. This means that there are about 40% of the words in each step in ViST that are overlapping with the rest of the story. This could be owed to the way the dataset is gathered by asking the annotators to pick a sequence of images that are likely to make a coherent story and then describing these images in sequence. While the stories-insequences sufficiently distinguish themselves from descriptions-in-isolation, the overlapping contexts are not high compared to procedures. The overlapping contexts for procedures is about 61%. This reveals the stronger cohesive and overlapping contexts in the ViPT dataset, as compared to the ViST datasets. These overlapping contexts motivates the idea of generating a sentence by bridging the contexts from surrounding sentences. Hence it forms a suitable test bed to learn interpolation from surrounding contexts with infilling technique.

Models Description
This section describes the baseline model and the infilling techniques adopted on top of it. We present infilling based techniques for learning missing visual contexts to generate narrative text from a sequence of images. As the ViST and recipes category in ViPT are of comparable sizes (both in terms of data size and the average number of steps per instance), we perform comparative experimentation on these two categories. We leave experimenting with all the domains for our future work, especially learning from one domain to generate the sequences in other domains. For our ViPT category, we use 80% for training, 10% for validation and 10% for testing. The stories are composed   of 5 steps and the cooking recipes are trucated to 5 steps to perform a fair comparison of the effect of the index being infilled. An overview of infilling based training is depicted in Figure 1. The underlying encoding and decoding stages are described here.
Encoding: Models 1, 2 and 3 here show different variants of encoding with and without infilling. Model 4 is the state of the art model for generating stories on ViST. Note that the encoding part of the missing contexts varies between these models while the decoding strategy remains the same to compare (i) the performance of encoding masked contexts as opposed to not masking, and (ii) the performance of masked span prediction between stories and procedures.

XE (baseline):
We choose a strong performing baseline model based on sequence to sequence modeling with cross entropy (XE) loss inspired from Wang et al. (2018). It is a CNN-RNN architecture. The visual features are extracted from the penultimate layer of ResNet-152 by passing the resized images ({I 1 , I 2 , ..., I n }) of size 224 X 224. These represent the image specific local features ({l 1 , l 2 , ..., l n }). These features are then passed through a bidirectional GRU layer to attain narrative level global features ({g 1 , g 2 , ..., g n }) constituting the narrative context layer in Figure 1.

V-Infill:
We introduce an infilling indicator function on the underlying XE model by randomly sampling an infilling index (in idx ). This is used to construct the final infilled local features as follows.
Other than the sampled in idx , the rest of the local features for other indices remain the same. The local features for in idx are all masked to a zero tensor. The dropout of an entire set of local features from an image forces the model to learn to bridge the context from the left and the right images of in idx . The model is optimized to predict the rest of the steps where images are present along with the masked span prediction. In this way, the infilling mechanism encourages our underlying seq2seq model to learn the local representation of the missing context from contextual global features in the narrative context layer.

V-InfillR:
This model varies the Rates in which local features are masked as training proceeds based on the indicator function above in the V-Infill model. Scheduling the number of missing features itself is a hyperparameter and we used the following setting. In the first quarter of training epochs, none are masked, then increasing it to 1 local feature for the next quarter and leaving it at 2 for the last two quarters. This is similar to the settings observed in INet model. We have experimented with other settings of scheduling but this one performed better than the others.
As mentioned earlier, the encoding of the local features change based on the infilling technique  being used in each of the above strategy. As we can see, the contribution of the global features to reconstruct the local missing context is intuitively expected to perform well in the case of narratives with overlapping contexts. Hence, we hypothesize that the infilling technique that interpolates between steps that constitute words or phrases that are similar to those of the surrounding steps benefit from this technique. A 'how-to' style of narrative explaining a procedure is more in-domain as compared to the stories and hence hypothesize that our infilling based encoding approaches perform relatively better on procedures. We then use the encoded representation to decode each step of the procedure or story. The decoding strategy is explained next which is the same in all the three of the aforementioned models.
Decoding: In all the above models, g k are fed into a GRU decoder to predict each word (ŵ t ) of the step (k). The same is done for generating each step in the five steps. In the infilling methods, the decoding strategy is agnostic to the missing context in the local features. The global features that bridges the contexts in the encoding is used directly as input to the decoder. In other words, the network remains the same once the global features are predicted. We perform beam search with a beam size of 3 during inference. Here τ is the number of words in each step and t is the current time step.

INet:
We re-implemented the model achieving the state of the art results (Hu et al., 2020) on the visual storytelling dataset. Additionally, they use a relational embedding layer that captures relations across spatio-temporal sub-spaces. Our replication of their model is close to the scores reported in their paper, though not exact. Our re-implementation achieved a 35.5 METEOR and 63.3 BLEU-1 in comparison to the scores reported in their paper which are 35.6 and 64.4.
Hyperparameter Setup: We use a GRU with hidden dimension of 256 for encoder and 512 for decoder. The word embedding dimension is 512. The learning rate is 4e-4 optimized with Adam and smoothing of 1e-8. We use a dropout of 0.2 and momentum of 0.9 with a gradient clipping of 10. The performance when experimented with a transformer based encoder along with autoregressive decoding is comparatively lesser and hence we proceed with a GRU based model. Based on the average number of steps in recipes from Table 1 which is 5.96, we truncate the recipes to 6 steps.

Results and Discussion
In this section, we present the effects of infilling both during both training and inference on ViST and ViPT datasets. We also present an analysis based on the length of generated sequences along with a qualitative demoonstration.
Infilling during training: The overall performance of the models is presented in Table 3. Both the infilling model variants achieve higher scores on the recipes while not decreasing their performances on stories. We also observed that increasing the number of masked local features beyond 2 drastically decreases the performance on both datasets.
Infilling during inference for Visual Procedure Telling: Acquiring parallel pairwise image and narrative data in the wild is often not feasible. Hence, we perform infilling not only at train time but also at inference time to evaluate the ability of the model to bridge contexts when the corresponding image is absent and deal with real world data imputation scenarios. Heat a little oil on a stainless steel or cast iron skillet on high heat and wait for the oil to start smoking .
After 3-5 minutes pass , use a spatula to check the meat.
Once the meat is to your desired temperature , rest on a plate and cover with foil , 5-10 minutes .
XE 0 Cut the chicken into small pieces .
The first thing you will need to do is put the meat in the pan .
Heat the oil in a pan and add the olive oil .
Add the salt and pepper to the pan .
Once the chicken is cooked , remove the chicken from the oven and let it cool for about 10 minutes .
Now that you have a hot pan , it 's time to cook the chicken .
-The content step 1 is being copied to step 0.
-Recovering from incorrect steps until step 2.
4 For this recipe you will need the following ingredients : -1 cup of water , salt and pepper Cut the chicken breast into small pieces .
The first thing you need to do is put the meat in the pan and let it cool for about 10 minutes .
Place the pan on the stove and heat the pan on medium heat .
Make sure it is properly cooked.
Once the chicken is done , remove the chicken from the oven and let it cool for a few minutes -The infilled sentence at step 4 is generic and underspecified V-Infill 0 You will need the following ingredients Cut the chicken breast into small pieces .
Add the chicken to the pan .
Heat the oil in a frying pan over medium heat .
Once the chicken is done , add the chicken to the pan and stir it up .
Now that you have a little bit , you will need to put the meat in the oven at 180 degrees celsius circulating air for about 20 minutes .
-The ingredients are still underspecified. 4 You will need the following ingredients : -meat , 1 cup of butter and salt Cut the onion into small pieces .
Cut the chicken breast into small pieces .
Heat the oil in a frying pan over medium heat .
Once the chicken is ready , add the chicken to the pan .
Once the chicken is done , remove the pan from the pan and let it sit for about 10 minutes .
-Phrase from step 5 copied to step 4 0 1 2 3 4 5 Figure 2: Comparison of V-Infill and XE dealing with infilling context during inference (for making chicken roast).
GT corresponds to the ground truth step. The index in each row corresponds to the index of the missing image.
during inference stage. As observed, the automatic scores get affected detrimentally when the infilled index is to the left, i.e., a lower index. This is because usually the beginning of the sentence comprises of introducing the dish followed by listing down the ingredients. For this reason, the density of the number of entities that are present in the beginning of the procedure is usually higher. Hence reconstructing that from the rest of the recipe is difficult. However, as we move from left to right, i.e as we gradually increase the infilled index, we observe an increasing trend in the automatic metric.
Infilling during inference for Visual Story Telling: Table 5 demonstrates the effects of infilling various indices during inference. This table is analogous to Table 4 for stories. As we can see, a similar trend in the increase in all the automatic metrics are present as we move the infill index to the right of the story. While that is still the case, a very interesting observation is that the difference between the performance of XE and Infill models for any given index is much higher for recipes compared to stories. The infilling technique is bringing much more value to the task when the nature of the text is procedural and dependent more on the surrounding contexts.
Lengths of generated sequences : We compare infilling during inference between baseline XE model and our V-Infill model in Table 4. While the METEOR scores remain comparable, the BLEU scores steadily increase as we move the in idx to the right. Specifically, these jumps are bigger after step 3. Quantitatively, this is the result of the model being able to produce longer sequences as we move to the right as BLEU gets penalized for short sentences. Qualitatively, this implies that the initial steps like specifying the ingredients are more crucial as compared to later ones. A similar observation emerges by analyzing the effects of infilling during training. The average length of generated recipes by XE is 71.26 and by V-Infill is 76.49. A similar trend is observed for stories in Table 5.
Qualitative Discussion: Figure 2 demonstrates an example of generated samples by infilling different indices. The top row shows the steps in the ground truth steps for the corresponding images. The indices on top row are the indices of the images or the steps and the indices on the left column (in blue) are the indices whose local features are masked. As observed, the XE model depicts two strategies to recover the missing context. The first is copying the contents that are similar from the adjacent step directly. For instance, while the 0 th index of the image is masked, the XE model generates cutting from trimming and chicken from meat from the following step. This has nothing to do with the actual description of the corresponding step. However our V-Infill model is able to generate the sentence depicting that it is listing ingredients in this case. Since the first step is incorrectly generated by the baseline, it makes it harder to recover and generate the correct sequence for the rest of the procedure. The second is the strategy of generating generic sentences. When infilled index is at 4, the baseline model generates a sentence that is generic and not specific to the given set of images. In this case, it generates a statement that says to make sure that it is properly cooked. Our V-Infill model is able to bridge the context from step 3 about heating the oil and step 5 about removing the pan and hence interpolates the missing context to be placing the chicken on the pan. Despite the recovering strategies used in both these methods, there is a common problem observed in the generated steps. The details in the steps are omitted thereby leading to the problem of under-specification. For instance, the actions in step 4 are under-specified by XE when the infilled index is 4. Similarly the V-Infill model underspecifies the ingredients  Human Evaluation: Figure 3 depicts a screenshot of our human evaluation interface. A sequence of images are presented on top of the screen. This evaluation is conducted to compare between XE and V-Infill model. The generated sentences from both the models, in this case XE and V-Infill are presented after the images. Note that the generated outputs are presented in arbitrarily random order for each example to ensure there is no bias while performing preference testing. Human subjects are asked to pick one of the generated recipes for the given sequence of images based on the relevance to them. 10 such recipes are presented for each user and we avergaed the preference scores among 20 evaluators.

Conclusions and Future Work
We demonstrate that infilling is a simple yet effective technique and a step towards maximizing the utilization of surrounding contexts in visual narratives. Infilling is the strategy of enabling the model to learn surrounding contextual information by masking spans of input while the decoding attempts in generating the entire text. The input to the model is provided with masked contexts and the model is optimized with the objective of masked span prediction. We hypothesize that this technique provides gains in narratives with higher extent of overlapping contexts, since this provides an opportunity to reconstruct the missing local context from the overall global context. To this end, we introduce a new large scale ViPT dataset of 46k procedures and 340k image-text pairs comprising 10 categories. To experimentally support our hypothesis, we compare the performance of our model. We conclusively show the higher significance of infilling based techniques in visual procedures compared to visual stories. We also perform comparisons between the infilling during training and inference phases. With infilling during training, our V-Infill model performs better on visual procedures in comparison to stories. With infilling during inference, our v-infill model performs better on both stories and procedures. In the case of stories, infilling during inference is surprisingly better than fully supervised seq2seq model and very close the state of the art model as well. In future, we plan to explore the following two directions: (1) interpolating the contexts between consecutive steps by introducing a new infilled image; this addresses the data imputation problem as well as generating longer explanations to unclear steps. And (2) addressing the underspecification problem by controlling the content in infilled image with explicit guidance; this is as opposed to the implicit content filling that we perform throough interpolation. These infilling techniques are also immensely useful when dealing with data imputation with missing contexts and collaborative authoring in real world scenarios.