Storyboarding of Recipes: Grounded Contextual Generation

Information need of humans is essentially multimodal in nature, enabling maximum exploitation of situated context. We introduce a dataset for sequential procedural (how-to) text generation from images in cooking domain. The dataset consists of 16,441 cooking recipes with 160,479 photos associated with different steps. We setup a baseline motivated by the best performing model in terms of human evaluation for the Visual Story Telling (ViST) task. In addition, we introduce two models to incorporate high level structure learnt by a Finite State Machine (FSM) in neural sequential generation process by: (1) Scaffolding Structure in Decoder (SSiD) (2) Scaffolding Structure in Loss (SSiL). Our best performing model (SSiL) achieves a METEOR score of 0.31, which is an improvement of 0.6 over the baseline model. We also conducted human evaluation of the generated grounded recipes, which reveal that 61% found that our proposed (SSiL) model is better than the baseline model in terms of overall recipes. We also discuss analysis of the output highlighting key important NLP issues for prospective directions.


Introduction
Interpretation is heavily conditioned on context. Real world interactions provide this context in multiple modalities. In this paper, the context is derived from vision and language. The description of a picture changes drastically when seen in a sequential narrative context. Formally, this task is defined as: given a sequence of images I = {I 1 , I 2 , ..., I n } and pairwise associated textual descriptions, T = {T 1 , T 2 , ..., T n }; for a new sequence I , our task is to generate the corresponding T . Figure 1 depicts an example for making vegetable lasagna, where the input is the first row and the output is the second row. We call this a 'storyboard', since it unravels the most important steps of a procedure associated with corresponding natural language text. The sequential context differ-Lasagna ingredients: tomato sauce or canned tomatoes for making sauce -at least 4-6 cups, one box no boil lasagna noodles, one zucchini, one yellow squash, one jalapeno (or bell pepper!), 1/2 an onion, pinch of oregano, pinch of basil.
To make the sauce, cook diced onion in olive oil, and then add the ground beef, garlic and tomato paste. Stir until fragrant and then meat starts to brown and break up, and then add the crushed tomatoes. Pour some water into tomato can and swish it around and then pour that into the pot. Stir well and let simmer while the veg continue to lose moisture.
Spoon your ricotta into a bowl and add a good pinch of Italian seasoning and crushed red pepper. I like to add a little black pepper too. Mix around until well combined. Shred your mozzarella or cut into small slices. This is the way I layered: spoonful of sauce on the bottom of the pan, lasagna noodles, 1/2 the ricotta cheese, 1/2 the sauteed vegetables, mozzarella cheese, sauce to cover. Do that twice and then sprinkle parmesan cheese on the top.
Bake in a 400 F oven for 30-40 minutes or until you can easily pierce through the noodles with a knife and the top is lightly browned. Try not to eat it all at once. The boy and I have eaten 1/2 of it, and it's only been a day since I made it. :D Figure 1: Storyboard for the recipe of vegetable lasagna entiates this task from image captioning in isolation. The dataset is similar to that of ViST (Huang et al., 2016) with an apparent difference between stories and instructional in-domain text which is the clear transition in phases of the narrative. This task supplements the task of ViST with richer context of goal oriented procedure (how-to). Numerous online blogs and videos depict various categories of how-to guides for games, do-it-yourself (DIY) crafts, technology etc. This task lays initial foundations for full fledged storyboarding of a given video, by selecting the right junctions/clips to ground significant events and generate sequential textual descriptions. We are going to focus on the domain of cooking recipes in the rest of this paper.In this paper, we discuss our approach in generating more structural/coherent cooking recipes by explicitly modeling the state transitions between different stages of cooking (phases). We introduce a framework to apply traditional FSMs to incorporate more structure in neural generation.
The two main contributions of this paper are: (1) A dataset of 16k recipes targeted for sequential multimodal procedural text generation, (2) Two models (SSiD: Structural Scaffolding in Decoder ,and SSiL: Structural Scaffolding in Loss) for incorporating high level structure learnt by an FSM into a neural text generation model to improve structure/coherence.

Related Work
Why domain constraint? Martin et al. (2017) and Khalifa et al. (2017) demonstrated that the predictive ability of a seq2seq model improves as the language corpus is reduced to a specialized domain with specific actions. Our choice of restricting domain to recipes is inspired from this, where the set of events are specialized (such as 'cut', 'mix', 'add') although we are not using event representations explicitly. These specialized set of events are correlated to phases of procedural text as described in the following sections. Planning while writing content: A major challenge faced by neural text generation (Lu et al., 2018) while generating long sequences is the inability to maintain structure, contravening the coherence of the overall generated text. This aspect was also observed in various tasks like summarization (Liu et al., 2018), story generation (Fan et al., 2019). Pre-selecting content and planning to generate accordingly was explored by Puduppully et al. (2018) and Lukin et al. (2015) in contrast to generate as you proceed paradigm. Fan et al. (2018) adapt a hierarchical approach to generate a premise and then stories to improve coherence and fluency. Yao et al. (2018) experimented with static and dynamic schema to realize the entire storyline before generating. However, in this work we propose a hierarchical multi task approach to perform structure aware generation. Comprehending Food: Recent times have seen large scale datasets in food, such as Recipe1M (Marin et al., 2018), Food-101 (Bossard et al., 2014).Food recognition (Arora et al., 2019) addresses understanding food from a vision perspective.  worked on generating cooking instructions by inferring ingredients from an image. Zhou et al. (2018) proposed a method to generate procedure segments for YouCook2 data. In NLP domain, this is studied as generating procedural text by including ingredients as checklists (Kiddon et al., 2016) or treating the recipe as a flow graph (Mori et al., 2014). Our work is at the intersection of two modalities (language and vision) by generating procedural text for recipes from a sequence of images. (Bosselut et al., 2017) worked on reasoning non-mentioned causal effects thereby improving the understanding and generation of procedural text for cooking recipes. This is done by dynamically tracking entities by modeling actions using state transformers. Visual Story Telling: Research at the intersection of language and vision is accelerating with tasks like image captioning (Hossain et al., 2019), visual question answering (Wu et al., 2017), visual dialog (Das et al., 2017;Mostafazadeh et al., 2017;De Vries et al., 2017;de Vries et al., 2018).
ViST (Huang et al., 2016) is a sequential vision to language task demonstrating differences between descriptions in isolation and stories in sequences. Similarly, Gella et al. (2018) created VideoStory dataset from videos on social media with the task of generating a multi-sentence story captions for them. Smilevski et al. (2018) proposed a late fusion based model for ViST challenge. Kim et al. (2018) attained the highest scores on human readability in this task by attending to both global and local contexts. We use this as our baseline model and propose two techniques on top of this baseline to impose structure needed for procedural text.

Data Description
We identified two how-to blogs from: instructables.comand snapguide.com, comprising stepwise instructions (images and text) of various how-to activities like games, crafts etc,. We gathered 16,441 samples with 160,479 photos for food, dessert and recipe topics. We used 80% for training, 10% for validation and 10% for testing our models. In some cases, there are multiple images for the same step and we randomly select an image from the set of images. We indicate that there is a potential space for research here, in selecting most distinguishing/representative/meaningful image. Details of the datasets are presented in Table 1. The data and visualization of distribution of topics is here 1 . A trivial extension could be done on other domains like gardening, origami crafts, fixing guitar strings etc, which is left for future work.

Model Description
We first describe a baseline model for the task of storyboarding cooking recipes in this section. We then propose two models with incremental improvements to incorporate the structure of procedural text in the generated recipes : SSiD (Scaffolding Structure in Decoder) and SSiL (Scaffolding Structure in Loss). The architecture of scaffolding structure is presented in Figure 2, of which different aspects are described in the following subsections.

Baseline Model (Glocal):
The baseline model is inspired from the best performing system in ViST challenge with respect to  (Kim et al., 2018). The images are first resized into 224 X 224. Image features for each step are extracted from the penultimate layer of pre-trained ResNet-152 . These features are then passed through an affinity layer to obtain an image feature of dimension 1024. To maintain the context of the entire recipe (global context), the sequence of these image features are passed through a two layered Bi-LSTM with a hidden size of 1024. To maintain specificity of the current image (local context), the image features for the current step are concatenated using a skip connection to the output of the Bi-LSTM to obtain glocal representation. Dropout of 0.5 is applied systematically at the affinity layer to obtain the image feature representation and after the Bi-LSTM layer. Batch normalization is applied with a momentum 0.01. This completes the encoder part of the sequence to sequence architecture. These glocal vectors are used for decoding each step. These features are passed through a fully connected layer to obtain a representation of 1024 dimension followed by a non-linear transformation using ReLU. These features are then passed through a decoder LSTM for each step in the recipe which are trained by teacher forcing. The overall coherence in generation is addressed by feeding the decoder state of the previous step to the next one. This is a seq2seq model translating one modality into another. The model is optimized using Adam with a learning rate of 0.001 and weight decay of 1e-5. The model described above does not explicitly cater to the structure of the narration of recipes in the generation process. However, we know that procedural text has a high level structure that carries a skeleton of the narrative. In the subsequent subsections, we present two models that impose this high level narrative structure as a scaffold. While this scaffold lies external to the baseline model, it functions on imposing the structure in decoder (SSiD) and in the loss term (SSiL).

Scaffolding Structure in Decoder (SSiD):
There is a high level latent structure involved in a cooking recipe that adheres to transitions between steps, that we define as phases. Note that the steps and phases are different here. To be specific, according to our definition, one or more steps map to a phase (this work does not deal with multiple phases being a part of a single step). Phases may be 'listing ingredients', 'baking', 'garnishing' etc., The key idea of the SSiD model is to incorporate the sequence of phases in the decoder to impose structure during text generation There are two sources of supervision to drive the model: (1) multimodal dataset M = {I, T} from Section 3, (2) unimodal textual recipes 2 U to learn phase sequences. Finer phases are learnt using clustering followed by an FSM. Clustering: K-Means clustering is performed on the sentence embeddings with compositional ngram features (Pagliardini et al., 2018) on each step of the recipe in U. Aligning with our intu-ition, when k is 3, it is observed that these clusters roughly indicate categories of desserts, drinks and main course foods (pizza, quesadilla etc,). However, we need to find out finer categories of the phases corresponding to the phases in the recipes. We use k-means clustering to obtain the categories of these phases. We experimented with different number of phases P as shown in Table 2. For example, let an example recipe comprise of 4 steps i.e, a sequence of 4 images. At this point, each recipe can be represented as a hard sequence of phases r = p 1 , p 2 , p 3 , p 4 . FSM: The phases learnt through clustering are not ground truth phases. We explore the usage of an FSM to individually model hard and a softer representation of the phase sequences by leveraging the states in an FSM. We first describe how the hard representation is modeled. The algorithm was originally developed for building language models for limited token sets in grapheme to phoneme prediction. The iterative algorithm starts with an ergodic state for all phase types and uses entropy to find the best state split that would maximize the prediction. As opposed to phase sequences, each recipe is now represented as a state sequence (decoded from FSM) i.e, r = s 1 , s 2 , s 3 , s 4 (hard states). This is a hard representation of the sequence of states.
We next describe how a soft representation of these states is modeled. Since the phases are learnt in an unsupervised fashion and the ground truth of the phases is not available, we explored a softer representation of the states. We hypothesize that a soft representation of the states might smooth the irregularities of phases learnt. From the output of the FSM, we obtain the state transition probabilities from each state to every other state. Each state s i can be represented as q ij ∀ j ∈ S (soft states), where q ij is the state transition probability from s i to s j and S is the total number of states. This is the soft representation of state sequences.
The structure in the recipe is learnt as a sequence of phases and/or states (hard or soft). This is the structural scaffold that we would like to incorporate in the baseline model. In SSiD model, for each step in the recipe, we identify which phase it is in using the clustering model and use the phase sequence to decode state transitions from the FSM. The state sequences are concatenated to the decoder in the hard version and the state transition probabilities are concatenated in the decoder in the soft version at every time step.
At this point, we have 2 dimensions, one is the complexity of the phases (P) and the other is the  complexity of the states in FSM (S). Comprehensive results of searching this space is presented in Table 2. We plan to explore the usage of hidden markov model in place of FSM in future.

Scaffolding Structure in Loss (SSiL):
In addition to imposing structure via SSiD, we explored measuring the deviation of the structure learnt through phase/state sequences from the original structure. This leads to our next model where the deviation of the structure in the generated output from that of the original structure is reflected in the loss. The decoded steps are passed through the clustering model to get phase sequences and then state transition probabilities are decoded from FSM for the generated output. We go a step further to investigate the divergence between the phases of generated and original steps. This can also be viewed as hierarchical multitask learning (Sanh et al., 2018). The first task is to decode each step in the recipe (which uses a cross entropy criterion, L 1 ). The second task uses KL divergence between phase sequences of decoded and original steps to penalize the model (say, L 2 ).When there are τ steps in a recipe, we obtain o(s τ 1 ) and g(s τ 1 ) as the distributions of phases comprising of soft states for the original and generated recipes respectively. We measure the KL divergence(D KL ) between these distributions: Each task optimizes different functions and we minimize the combination of the two losses.
This combined loss is used to penalize the model. Here, α is obtained from KL annealing (Bowman et al., 2015) function that gradually increases the weight of KL term from 0 to 1 during train time.

Results and Discussion
The two dimensions explored in FSM are P and S and exhaustive results are presented in Table  Models Phenomena

Glocal Model
This is a simple recipe for making a delicious chicken salad.
You will need: a butter knife a plate of bread flour a little bit of salt a dash of pepper flakes a couple of tablespoons of olive oil a pinch of sugar.
Add butter evenly on the pan. Put the chicken on the grill and set aside.

SSiD Model
This is a simple recipe for making a delicious and easy dish.
Ingredients: 4 pounds chicken 2 tsp salt, ½ tsp sugar, marinara sauce, mozzarella cheese ( i used provolone ). Tools: a knife, an oven for the chicken, tongs. Mix all ingredients in a bag.
Add butter evenly on the pan. Serve the baked chicken wings and enjoy the evening! -Learnt majority structure (step 1) + Got 'tongs' right because of separate tools mention. -The action of baking is not explicitly mentioned (before 'baked' wings).

SSiL Model
You will need: 5 pounds of chicken wings, ½ cup all purpose flour, ½ tsp salt, 2 tsp of paprika, melted butter, silicon mat, baking pan.
Preheat oven to 450 F. Mix dry ingredients in the dry ziplock bag.
Place a mat on the baking pan and spread butter evenly on it.
Spread the chicken pieces on butter on the baking pan. Bake until crispy for 30 minutes. Serve and enjoy! + Global context of baking maintained in preheating. + Non-repetitive ingredients phase. + Referring expressions (baking pan -> it). -Not mentioned tools (tongs).    (Papineni et al., 2002) is the highest when P is 40 and S is 100. Fixing these values, we compare the models proposed in Table 3. The models with hard phases and hard states are not as stable as the one with soft phases since backprop affects the impact of the scaffolded phases. Upon manual inspection, a key observation is that for SSiD model, most of the recipes followed a similar structure. It seemed to be conditioned on a global structure learnt from all recipes rather than the current input. However, SSiL model seems to generate recipe that is conditioned on the structure of that particular example. Human Evaluation: We have also performed human evaluation by conducting user preference study to compare the baseline with our best performing SSiL model. We randomly sampled generated outputs of 20 recipes and asked 10 users to answer two preferences: (1) overall recipe based on images, (2) structurally coherent recipe. Our SSiL model was preferred 61% and 72.5% for overall and structural preferences respectively. This shows that while there is a viable space to improve structure, generating an edible recipe needs to be explored to improve the overall preference.

Qualitative Analysis:
Figure 3 presents the generated text from the three models with an analysis described below.
Coherence of Referring Expressions: Introducing referring expressions is a key aspect of co-herence (Dale, 2006(Dale, , 1992, as seen in the case of 'baking pan' being referred as 'it' in the SSiL model. Context Maintenance: Maintaining overall context explicitly affects generating each step. This is seen in SSiL model where 'preheating' in the second step is learnt from baking step that appears later although the image does not show an oven. Schema for Procedural Text: Explicit modeling of structure has enabled SSiD and SSiL models to conclude the recipe by generating words like 'serve' and 'enjoy'. Lacking this structure, glocal model talks about 'setting aside' at the end. Precision of Entities and Actions: SSiD model introduces 'sugar' in ingredients after generating 'salt'. A brief manual examination revealed that this co-occurrence is a common phenomenon. SSiL model misses 'tongs' in the first step.

Conclusions
Our main focus in this paper is instilling structure learnt from FSMs in neural models for sequential procedural text generation with multimodal data. We gather a dataset of 16k recipes where each step has text and associated images. We setup a baseline inspired from the best performing model in ViST. We propose two ways of imposing structure from phases and states of a recipe derived from FSM. The first model imposes structure on the decoder and the second model imposes structure on the loss function by modeling it as a hierarchical multi-task learning problem. We show that our proposed approach improves upon the baseline and achieves a METEOR score of 0.31. We plan to explore explicit evaluation of the latent structure learnt. We plan on exploring backpropable variants as a scaffold for structure and also extend the techniques to other how-to domains in future.