Procedural Reasoning Networks for Understanding Multimodal Procedures

This paper addresses the problem of comprehending procedural commonsense knowledge. This is a challenging task as it requires identifying key entities, keeping track of their state changes, and understanding temporal and causal relations. Contrary to most of the previous work, in this study, we do not rely on strong inductive bias and explore the question of how multimodality can be exploited to provide a complementary semantic signal. Towards this end, we introduce a new entity-aware neural comprehension model augmented with external relational memory units. Our model learns to dynamically update entity states in relation to each other while reading the text instructions. Our experimental analysis on the visual reasoning tasks in the recently proposed RecipeQA dataset reveals that our approach improves the accuracy of the previously reported models by a large margin. Moreover, we find that our model learns effective dynamic representations of entities even though we do not use any supervision at the level of entity states.


Introduction
A great deal of commonsense knowledge about the world we live is procedural in nature and involves steps that show ways to achieve specific goals.Understanding and reasoning about procedural texts (e.g.cooking recipes, how-to guides, scientific processes) are very hard for machines as it demands modeling the intrinsic dynamics of the procedures (Bosselut et al., 2018;Dalvi et al., 2018;Yagcioglu et al., 2018).That is, one must be aware of the entities present in the text, infer relations among them and even anticipate changes in the states of the entities after each action.For example, consider the cheeseburger recipe presented in Fig. 1.The instruction "salt and pepper each patty and cook for 2 to 3 minutes on the first side" in Step 5 entails mixing three basic ingredients, the ground beef, salt and pepper, together and then applying heat to the mix, which in turn causes chemical changes that alter both the appearance and the taste.From a natural language understanding perspective, the main difficulty arises when a model sees the word patty again at a later stage of the recipe.It still corresponds to the same entity, but its form is totally different.
Over the past few years, many new datasets and approaches have been proposed that address this inherently hard problem (Bosselut et al., 2018;Dalvi et al., 2018;Tandon et al., 2018;Du et al., 2019).To mitigate the aforementioned challenges, the existing works rely mostly on heavy supervision and focus on predicting the individual state changes of entities at each step.Although these models can accurately learn to make local predictions, they may lack global consistency (Tandon et al., 2018;Du et al., 2019), not to mention that building such annotated corpora is very labor-intensive.In this work, we take a different direction and explore the problem from a multimodal standpoint.Our basic motivation, as illustrated in Fig. 1, is that accompanying images provide complementary cues about causal effects and state changes.For instance, it is quite easy to distinguish raw meat from cooked one in visual domain.
In particular, we take advantage of recently proposed RecipeQA dataset (Yagcioglu et al., 2018), a dataset for multimodal comprehension of cooking recipes, and ask whether it is possible to have a model which employs dynamic representations of entities in answering questions that require multimodal understanding of procedures.To this end, inspired from (Santoro et al., 2018), we propose Procedural Reasoning Networks (PRN) that incorporates entities into the comprehension process and al- Step 2: Form Patties Step 3: Season Step 4: Toast Buns Lightly toast the both halves of the hamburger bun, face down in the pan.Set aside.
Step 5: Cook Step 6: Chop Onions & Tomatoes For the "authentic" feel you want to get a large onion and a large tomato, then slice a large slice from the middle to use on the hamburger.
Step 7: Chop Onions & Tomatoes Step 8: Enjoy All that's left to do is enjoy this copycat double double!To be honest, this was impressively close to the real taste.I would definitely make this one again.Salt and pepper one side of the patty now, the other half will be done when grilling.
Set the patty seasoned side down on the skillet, salt and pepper each patty and cook for 2 to 3 minutes on the first side.Flip the patties over and season with salt and pepper and immediately place one slice of cheese on each one.Cook for 2-3 minutes on the other side.
1 hamburger bun, 4 oz.ground beef (25-30% fat if available) (2 ounce per patty), salt and pepper, Thousand Island dressing (or In-N-Out official spread), 1 large tomato, 1 large lettuce leaf, 1 whole onion, 2 slices real American cheese Assemble the burger in the following stacking order from the bottom up: bottom bun, thousand island dressing, tomato, lettuce, beef patty with cheese, onion slice, beef patty with cheese, top bun Begin by preheating a cast iron skillet over medium heat.Make four patties by rolling 2-ounce portions of beef into balls and weigh it out on the kitchen scale.
In-N-Out uses a 25-30% fat beef patty which is not easily available at a local grocery store, in many cases it would have to be ground by hand.Forming them slightly larger than buns.I do this by placing the 2 ounce beef in between 2 pieces of parchment paper then taking my large cast iron skillet and applying a little force to smash the beef into a patty.You will want to form them into a perfect circle with your hand if they do not come out right after the initial smash.lows to keep track of entities, understand their interactions and accordingly update their states across time.We report that our proposed approach significantly improves upon previously published results on visual reasoning tasks in RecipeQA, which test understanding causal and temporal relations from images and text.We further show that the dynamic entity representations can capture semantics of the state information in the corresponding steps.

Visual Reasoning in RecipeQA
In our study, we particularly focus on the visual reasoning tasks of RecipeQA, namely visual cloze, visual coherence, and visual ordering tasks, each of which examines a different reasoning skill2 .We briefly describe these tasks below.
Visual Cloze.In the visual cloze task, the question is formed by a sequence of four images from consecutive steps of a recipe where one of them is replaced by a placeholder.A model should select the correct one from a multiple-choice list of four answer candidates to fill in the missing piece.In that regard, the task inherently requires aligning visual and textual information and understanding temporal relationships between the cooking actions and the entities.
Visual Coherence.The visual coherence task tests the ability to identify the image within a sequence of four images that is inconsistent with the text instructions of a cooking recipe.To succeed in this task, a model should have a clear understanding of the procedure described in the recipe and at the same time connect language and vision.
Visual Ordering.The visual ordering task is about grasping the temporal flow of visual events with the help of the given recipe text.The questions show a set of four images from the recipe and the task is to sort jumbled images into the correct order.
Here, a model needs to infer the temporal relations between the images and align them with the recipe steps.

Procedural Reasoning Networks
In the following, we explain our Procedural Reasoning Networks model.Its architecture is based on a bi-directional attention flow (BiDAF) model (Gardner et al., 2018) 3 , but also equipped with an explicit reasoning module that acts on entity-specific rela-3 Our implementation is based on the implementation publicly available in AllenNLP (Gardner et al., 2018).

LSTM LSTM LSTM
Step 1: Ingredients Step 3: The Filling Bake the pumpkin cheesecake for 80-90 minutes, until the center is almost set., and barely jiggles in the middle.Use a knife to gently loosen the crust from the edge of the pan.Allow cheesecake to cool before removing the rim of the pan.Refrigerate for at least 4 hours and up to overnight.If you are traveling with the cheesecake, leave the pan in tact until ready to eat! You're gonna love this one, I just know it!
Step 4: Bake  tional memory units.Fig. 2 shows an overview of the network architecture.It consists of five main modules: An input module, an attention module, a reasoning module, a modeling module, and an output module.Note that the question answering tasks we consider here are multimodal in that while the context is a procedural text, the question and the multiple choice answers are composed of images.

Input Module extracts vector representations
of inputs at different levels of granularity by using several different encoders.2. Reasoning Module scans the procedural text and tracks the states of the entities and their relations through a recurrent relational memory core unit (Santoro et al., 2018).3. Attention Module computes context-aware query vectors and query-aware context vectors as well as query-aware memory vectors.4. Modeling Module employs two multilayered RNNs to encode previous layers outputs. 5. Output Module scores a candidate answer from the given multiple-choice list.
At a high level, as the model is reading the cooking recipe, it continually updates the internal memory representations of the entities (ingredients) based on the content of each step -it keeps track of changes in the states of the entities, providing an entity-centric summary of the recipe.The response to a question and a possible answer depends on the representation of the recipe text as well as the last states of the entities.All this happens in a series of implicit relational reasoning steps and there is no need for explicitly encoding the state in terms of a predefined vocabulary.

Input Module
Let the triple (R, Q, A) be a sample input.Here, R denotes the input recipe which contains textual instructions composed of N words in total.Q represents the question that consists of a sequence of M images.A denotes an answer that is either a single image or a series of L images depending on the reasoning task.In particular, for the visual cloze and the visual coherence type questions, the answer contains a single image (L = 1) and for the visual ordering task, it includes a sequence.
We encode the input recipe R at character, word, and step levels.Character-level embedding layer uses a convolutional neural network, namely Char-CNN model by Kim (2014), which outputs character level embeddings for each word and alleviates the issue of out-of-vocabulary (OOV) words.In word embedding layer, we use a pretrained GloVe model (Pennington et al., 2014) and extract wordlevel embeddings4 .The concatenation of the character and the word embeddings are then fed to a two-layer highway network (Srivastava et al., 2015) to obtain a contextual embedding for each word in the recipe.This results in the matrix R ∈ R 2d×N .
On top of these layers, we have another layer that encodes the steps of the recipe in an individual manner.Specifically, we obtain a step-level con-textual embedding of the input recipe containing T steps as S = (s 1 , s 2 , . . ., s T ) where s i represents the final state of a BiLSTM encoding the i-th step of the recipe obtained from the character and word-level embeddings of the tokens exist in the corresponding step.
We represent both the question Q and the answer A in terms of visual embeddings.Here, we employ a pretrained ResNet-50 model (He et al., 2016) trained on ImageNet dataset (Deng et al., 2009) and represent each image as a real-valued 2048-d vector using features from the penultimate averagepool layer.Then these embeddings are passed first to a multilayer perceptron (MLP) and then its outputs are fed to a BiLSTM.We then form a matrix Q ∈ R 2d×M for the question by concatenating the cell states of the BiLSTM.For the visual ordering task, to represent the sequence of images in the answer with a single vector, we additionally use a BiLSTM and define the answering embedding by the summation of the cell states of the BiLSTM.Finally, for all tasks, these computations produce answer embeddings denoted by a ∈ R 2d×1 .

Reasoning Module
As mentioned before, comprehending a cooking recipe is mostly about entities (basic ingredients) and actions (cooking activities) described in the recipe instructions.Each action leads to changes in the states of the entities, which usually affects their visual characteristics.A change rarely occurs in isolation; in most cases, the action affects multiple entities at once.Hence, in our reasoning module, we have an explicit memory component implemented with relational memory units (Santoro et al., 2018).This helps us to keep track of the entities, their state changes and their relations in relation to each other over the course of the recipe (see Fig. 3).As we will examine in more detail in Section 4, it also greatly improves the interpretability of model outputs.
Specifically, we set up the memory with a memory matrix E ∈ R d E ×K by extracting K entities (ingredients) from the first step of the recipe5 .We initialize each memory cell e i representing a specific entity by its CharCNN and pre-trained GloVe embeddings 6 .From now on, we will use the terms memory cells and entities interchangeably throughout the paper.Since the input recipe is given in the form of a procedural text decomposed into a number of steps, we update the memory cells after each step, reflecting the state changes happened on the entities.This update procedure is modelled via a relational recurrent neural network (R-RNN), recently proposed by Santoro et al. (2018).It is built on a 2-dimensional LSTM model whose matrix of cell states represent our memory matrix E. Here, each row i of the matrix E refers to a specific entity e i and is updated after each recipe step t as follows: where s t denotes the embedding of recipe step t and φ i,t = (h i,t , e i,t ) is the cell state of the R-RNN at step t with h i,t and e i,t being the i-th row of the hidden state of the R-RNN and the dynamic representation of entity e i at the step t, respectively.The R-RNN model exploits a multi-headed selfattention mechanism (Vaswani et al., 2017) that allows memory cells to interact with each other and attend multiple locations simultaneously during the update phase.In Fig. 3, we illustrate how this interaction takes place in our relational memory module by considering a sample cooking recipe and by presenting how the attention matrix changes throughout the recipe.In particular, the attention matrix at a specific time shows the attention flow from one entity (memory cell) to another along with the attention weights to the corresponding recipe step (offset column).The color intensity shows the magnitude of the attention weights.As can be seen from the figure, the internal representations of the entities are actively updated at each step.Moreover, as argued in (Santoro et al., 2018), this can be interpreted as a form of relational reasoning as each update on a specific memory cell is operated in relation to others.Here, we should note that it is often difficult to make sense of these attention weights.However, we observe that the attention matrix changes very gradually near the completion of the recipe.

Attention Module
Attention module is in charge of linking the question with the recipe text and the entities present in the recipe.It takes the matrices Q and R from the input module, and E from the reasoning module vector of all the words.
We'll start with a nice piece of roast, mine was 1 kilo and a half, but you can do less if you want.We'll have to cut the pieces so that it eventually fit in the bottle.This depends entirely from the size of the bottle itself, that said remember the meat will shrink in the oven.
Step 1: Slicin', Dicin'... Then comes the phase that is known in italian as "Pillottare".Using a mortar, grind together the spices, the salt, the crushed garlic and add a drop or two of olive oil so that the mixture sticks together After that, take a knife, stab the meat and start filling the cavities with the spices.When you're finished it should look like your meat had grown a beard.
Quickly clean the potatoes and the onion and chop them in medium sized pieces.Put half an inch of Olive oil in the pan and put everything in it.Add the remaining spices and, if you like, add some more.
Preheat the oven to 180C (356F) and then put this baby to roast.Turn it from time to time so that both sides cook evenly.I kept it one hour and ten, but it depends really from the size of your roast.You can always go old school and check with a toothpic from time to time.
Bottle has to be clean, so after washing and drying it, and right before putting the meat in it, boil some water and pour it in for a quick rinse off.To avoid breaking the bottle pour some cold water in it and pour the boiling water into the cold water.You do not need much of it, just a cup or so, quickly rinse the bottle and throw the water away.
Wait till the meat is cold, then put it into the freshly sterilized bottle and cover in olive oil.The meat has to rest for at least two days, then you can start eating it.
Step 5: Ready the Bottle.
Step 6: Put the Piggies to Sleep.and constructs the question-aware recipe representation G and the question-aware entity representation Y.Following the attention flow mechanism described in (Seo et al., 2017a), we specifically calculate attentions in four different directions: (1) from question to recipe, (2) from recipe to question, (3) from question to entities, and (4) from entities to question.
The first two of these attentions require computing a shared affinity matrix S R ∈ R N ×M with S R i,j indicating the similarity between i-th recipe word and j-th image in the question estimated by where w R is a trainable weight vector, • and [; ] denote elementwise multiplication and concatenation operations, respectively.Recipe-to-question attention determines the images within the question that is most relevant to each word of the recipe.Let Q ∈ R 2d×N represent the recipe-to-question attention matrix with its i-th column being given by Qi = j a ij Q j where the attention weight is computed by a i = softmax(S R i ) ∈ R M .Question-to-recipe attention signifies the words within the recipe that have the closest similarity to each image in the question, and construct an attended recipe vector given by r = i b i R i with the attention weight is calculated by b = softmax(max col (S R )) ∈ R N where max col denotes the maximum function across the column.The question-to-recipe matrix is then obtained by replicating r N times across the column, giving R ∈ R 2d×N .
Then, we construct the question aware representation of the input recipe, G, with its i-th column G i ∈ R 8d×N denoting the final embedding of i-th word given by Attentions from question to entities, and from entities to question are computed in a way similar to the ones described above.The only difference is that it uses a different shared affinity matrix to be computed between the memory encoding entities E and the question Q .These attentions are then used to construct the question aware representation of entities, denoted by Y, that links and integrates the images in the question and the entities in the input recipe.

Modeling Module
Modeling module takes the question-aware representations of the recipe G and the entities Y, and forms their combined vector representation.For this purpose, we first use a two-layer BiLSTM to read the question-aware recipe G and to encode the interactions among the words conditioned on the question.For each direction of BiLSTM , we use its hidden state after reading the last token as its output.In the end, we obtain a vector embedding c ∈ R 2d×1 .Similarly, we employ a second BiL-STM, this time, over the entities Y, which results in another vector embedding f ∈ R 2d E ×1 .Finally, these vector representations are concatenated and then projected to a fixed size representation using o = ϕ o ([c; f ]) ∈ R 2d×1 where ϕ o is a multilayer perceptron with tanh activation function.

Output Module
The output module takes the output of the modeling module, encoding vector embeddings of the question-aware recipe and the entities Y, and the embedding of the answer A, and returns a similarity score which is used while determining the correct answer.Among all the candidate answer, the one having the highest similarity score is chosen as the correct answer.To train our proposed procedural reasoning network, we employ a hinge ranking loss (Collobert et al., 2011), similar to the one used in (Yagcioglu et al., 2018), given below.
where γ is the margin parameter, a + and a − are the correct and the incorrect answers, respectively.

Experiments
In this section, we describe our experimental setup and then analyze the results of the proposed Procedural Reasoning Networks (PRN) model.

Entity Extraction
Given a recipe, we automatically extract the entities from the initial step of a recipe by using a dictionary of ingredients.While determining the ingredients, we exploit Recipe1M (Marin et al., 2018) and Kaggle Whats Cooking Recipes (Yummly, 2015) datasets, and form our dictionary using the most commonly used ingredients in the training set of RecipeQA.For the cases when no entity can be extracted from the recipe automatically (20 recipes in total), we manually annotate those recipes with the related entities.

Training Details
In our experiments, we separately trained models on each task, as well as we investigated multi-task learning where a single model is trained to solve all these tasks at once.In total, the PRN architecture consists of ∼12M trainable parameters.We implemented our models in PyTorch (Paszke et al., 2017) using AllenNLP library (Gardner et al., 2018).We used Adam optimizer with a learning rate of 1e-4 with an early stopping criteria with the patience set to 10 indicating that the training procedure ends after 10 iterations if the performance would not improve.We considered a batch size of 32 due to our hardware constraints.In the multi-task setting, batches are sampled round-robin from all tasks, where each batch is solely composed of examples from one task.We performed our experiments on a system containing four NVIDIA GTX-1080Ti GPUs, and training a single model took around 2 hours.We employed the same hyperparameters for all the baseline systems.We plan to share our code and model implementation after the review process.

Baselines
We compare our model with several baseline models as described below.We note that the results of the first two are previously reported in (Yagcioglu et al., 2018).
Hasty Student (Yagcioglu et al., 2018) is a heuristics-based simple model which ignores the recipe and gives an answer by examining only the question and the answer set using distances in the visual feature space.
Impatient Reader (Hermann et al., 2015) is a simple neural model that takes its name from the fact that it repeatedly computes attention over the recipe after observing each image in the query.
BiDAF (Seo et al., 2017a) is a strong reading comprehension model that employs a bi-directional attention flow mechanism to obtain a questionaware representation and bases its predictions on this representation.Originally, it is a span-selection model from the input context.Here, we adapt it to work in a multimodal setting and answer multiple choice questions instead.
BiDAF w/ static memory is an extended version of the BiDAF model which resembles our proposed PRN model in that it includes a memory unit for the entities.However, it does not make any updates on the memory cells.That is, it uses the static entity embeeddings initialized with GloVe word vectors.We propose this baseline to test the significance of the use of relational memory updates.

Results
Table 1 presents the quantitative results for the visual reasoning tasks in RecipeQA.In single-task training setting, PRN gives state-of-the-art results compared to other neural models.Moreover, it achieves the best performance on average.These results demonstrate the importance of having a dynamic memory and keeping track of entities extracted from the recipe.In multi-task training set-

Vanilla-Apricot Shortbread Cookies
Add to the whipped butter 1 cup of baker's sugar.Stir until the sugar and butter mix thoroughly.Add the whole egg and the egg yolk and stir well.

Toffee Bottomed Brownies
Cut the brownie into small squares, cleaning your knife after each cut.The topload of cocoa powder makes this dessert so very rich that you don't need much, and there will be ...

Cherry Almond Torrone (Italian Nougat)
I used a knife, spatula, and pizza roller.Use what you've got.Corn starch and butter will help to prevent sticking.... Apple Pie ...the apple pie filling should not have the skins on them, BUT...I made this one for a friend of mine who is a health conscious women and she insisted on me leaving the skins on for all the nutritional values....

Henderson's Sauce
After it has been simmering for around 5 minutes, it is time to add some other ingredients.Add all these being; Add around 1 soup-spoon of sugar (1 soup spoon brown or 2 soup spoons white)....

Absolutely Amazing Cream of Celery Soup
Add cream, lemon juice, hot sauce, salt and pepper.Reheat and simmer for about five minutes....

Miniature Doughnut Coconut Creatures
Chill a can of coconut milk or cream in the fridge overnight.When you're ready to make the whipped cream, open the can and scoop out the hardened coconut....

Mango Mint Ice Tea
Take the measured amount of water and heat it till hot.I used the microwave here.You can heat the water even on the stove top.To the hot water add the Black tea powder or the Black tea bags.

Creme Brulee Recipe
Place the ramekins into a pan with high sides and carefully fill the pan with hot water until half way up the sides of the ramekins.Make sure not to splash any water into the custard.

bread
Step: 4 Entity: water Step: 3 Entity: water Step: 1 Entity: cream Step: 6 Entity: cream Step: 2 Entity: sugar Step: 6 Entity: sugar Step: 5 Entity: butter Step: 6 Entity: butter Step: 3 Entity: sugar   ting where a single model is trained to solve all the tasks at once, PRN and BIDAF w/ static memory perform comparably and give much better results than BIDAF.Note that the model performances in the multi-task training setting are worse than single-task performances.We believe that this is due to the nature of the tasks that some are more difficult than the others.We think that the performance could be improved by employing a carefully selected curriculum strategy (McCann et al., 2018).
In Fig. 4, we illustrate the entity embeddings space by projecting the learned embeddings from the step-by-step memory snapshots through time with t-SNE to 3-d space from 200-d vector space.Color codes denote the categories of the cooking recipes.As can be seen, these step-aware embeddings show clear clustering of these categories.Moreover, within each cluster, the entities are grouped together in terms of their state characteristics.For instance, in the zoomed parts of the figure, chopped and sliced, or stirred and whisked entities are placed close to each other.Here, we show that the learned embedding from the memory snapshots can effectively capture the contextual information about the entities at each time point in the corresponding step while taking into account of the recipe data.This basic arithmetic operation suggests that the proposed model can successfully capture the semantics of each entity's state in the corresponding step 7 .

Related Work
In recent years, tracking entities and their state changes have been explored in the literature from a variety of perspectives.In an early work, Henaff et al. (2017) proposed a dynamic memory based network which updates entity states using a gating mechanism while reading the text.Bansal et al. (2017) presented a more structured memory augmented model which employs memory slots for representing both entities and their relations.Pavez et al. (2018) suggested a conceptually similar model in which the pairwise relations between attended memories are utilized to encode the world 7 We used Gensim for calculating entity arithmetics using cosine distances between entity embeddings.
Step 1: This is a cheap and easy method of an ancient cooking technique known as clay pot cooking using a common terra cotta flowerpot and saucer.You can spend over $100 on a clay cooker at a gourmet kitchen gadget store, or about $20 at a garden supply.You choose.Some of you may already have the pot lying in your yard, garage or shed.Once you try this you will probably be cooking all kinds of things in it!onions (Flowerpot Chicken) Step 3: Prepare Vegetables.Chop your vegetables while the pot is soaking.You can use whatever you like for this, root vegetables mixed with onions are always a nice base.This time I used leeks, bell peppers, garlic and red onions.
: onions (Flowerpot Chicken) :: Step 1: This is a cheap and easy method of an ancient cooking technique known as clay pot cooking using a common terra cotta flowerpot and saucer.You can spend over $100 on a clay cooker at a gourmet kitchen gadget store, or about $20 at a garden supply.You choose.Some of you may already have the pot lying in your yard, garage or shed.Once you try this you will probably be cooking all kinds of things in it!tomatoes (Flowerpot Chicken) ?: Step 1: Prepping the Vegetables.The first step is to have all the Vegetables prepped and ready to go in the pan, so finely dice the Garlic, onions and Peppers.Don't worry about mixing them up in the bowl, all of these items are going to be sauteed in a small amount of oil at the next stage.Picture 1. Finely dice up the Garlic, you want it to be almost puree consistency.Picture 2. Finely dice up the Onions, this doesn't need to be as fine as the garlic but you should ensure that they are all roughly the same size.Picture 3. Lastly dice up the bell pepper, I show you how i cut this in the video, but i will go over it quickly.Firstly i take off the four walls of the pepper, flatten them then cut them in to strips, then simply cut the other way so i have them diced.

tomatoes (Chilli Con Carne)
Step 1: Ingredients ... pepperoni (I used what was left in a package which was enough for one layer) 1/2 onion 2 roma tomatoes dried rosemary shredded mozarella and parmesan fresh savory, basil, tarragon, and thyme 2 or 3 cloves of garlic salt (sea or kosher salt are best) and pepper Slice the tomatoes and onion as thin as is reasonable, slice the garlic as thin as possible.Thoroughly wash the fresh herbs and pull the leaves from the stems.Discard the stems.

tomatoes (Seven Layer Seven Grain Bread)
Step 1: Gather Your Ingredients... ... 1 teaspoon dried oregano, 1/8 teaspoon red pepper flakes (see step five for a bit of humor on this note), 3/4 to 1 cup wine -Honestly, folks, don't be too particular about the wine.Red or white is fine.(you may substitute chicken broth, or even add broth in addition to the wine.Be creative!)(you may substitute chicken broth, or even add broth in addition to the wine.Be creative!) 1 -28 ounce can diced tomatoes (save the juice!) 1/2 teaspoon dried Porcino mushrooms (Optional, see step #2) tomatoes (How to Make Chicken Cacciatore) Step 1: This is absolutely mind-blowingly good.Goat basically tastes like lamb, but is far leaner.(Lamb is the fattiest of the red meats.)It's very popular in a variety of different countries' cuisines, but for some reason has yet to gain a real following in the US.This recipe is inspired by the curried goat roti from Penny's Caribbean Cafe.While Penny doesn't share her secrets, this tastes awfully similar.Go get yourself some goat (or lamb if you must) and try it out!water (Caribbean Curried Goat) Step 4: Add Everything Else.Add the rest of the curry powder and stir things about.When it starts to stick again add the water and deglaze again.Pour in just enough water to cover the meat, and leave a cup full of water near the pot to refill as it boils off.You want the meat to stay wet during the entire cooking process.In the picture below I've dropped in another boullion cube because they didn't all make it in with the onions.The details really don't matter too much in this dish -it cooks long enough that you've got LOTS of leeway to taste and modify..

:
water (Caribbean Curried Goat) :: Step 1: All that sounded logic to me, and instead of looking on the net how others did it I started thinking how Bricobart would build such a device -I mean a bbq, not an anti-troll gun.And since I didn't want to spend any money I decided to build it from scratch.The project failed in the first trial, but ran like a small dog chased by a beeswarm in the second.Enjoy my poor men's vertical birdcage-based bbq! milk (Birdcage-BQ) ?: Step 3: Cooking.Melt the butter and add 1/3 cup chopped onions.When the onions are cooked add the bacon bits.Now add the potatoes back to the pot and mash the potato mixture.I use a potato masher or you can just use a fork.You still want it lumpy but the potatoes will help thicken the soup.Pour the milk and mix well.Add salt and pepper and heat until it is a slow boil.Remove from the stove and add the cheese and stir until melted.If you add the cheese too early it will go to the bottom and burn milk (Potato Soup for One) Step 2: Meat Sauce Preheat oven to 180 degrees celsius.Brown off the mince in a large pan, depending on the fat content of the meat, you may or may not need a little oil.Drain the mince onto some paper towel to remove any oil and then place back in the pan.Add 4 slices of chopped prosciutto (or bacon/pancetta) and fry for a few minutes.Add beef stock, tomato sauce, nutmeg, bayleaf and oregano.Simmer for at least 30 minutes.

milk (Family Size Lasagne)
Step 1: Potato Prep + Seasonings Make sure all potatoes are peeled and cut into chunks.In a saucepan over medium heat, drop in the tablespoon of butter, the red pepper flakes and Italian seasoning.Let the butter melt and stir the seasonings around until they start smelling nice.:) milk (Potato Soup) Figure 5: Step-aware entity representations can be used to discover the changes occurred in the states of the ingredients between two different recipe steps.The difference vector between two entities can then be added to other entities to find their next states.For instance, in the first example, the difference vector encodes the chopping action done on onions.In the second example, it encodes the pouring action done on the water.When these vectors are added to the representations of raw tomatoes and milk, the three most likely next states capture the semantics of state changes in an accurate manner.
state.The main difference between our approach and these works is that by utilizing relational memory core units we also allow memories to interact with each other during each update.Perez and Liu (2017) showed that similar ideas can be used to compile supporting memories in tracking dialogue state.Wang et al. (2017) has shown the importance of coreference signals for reading comprehension task.More recently, Dhingra et al. (2018) introduced a specialized recurrent layer which uses coreference annotations for improving reading comprehension tasks.On language modeling task, Ji et al. (2017) proposed a language model which can explicitly incorporate entities while dynamically updating their representations for a variety of tasks such as language modeling, coreference resolution, and entity prediction.
Our work builds upon and contributes to the growing literature on tracking states changes in procedural text.Bosselut et al. (2018) presented a neural model that can learn to explicitly predict state changes of ingredients at different points in a cooking recipe.Dalvi et al. (2018) proposed another entity-aware model to track entity states in scientific processes.Tandon et al. (2018) demon-strated that the prediction quality can be boosted by including hard and soft constraints to eliminate unlikely or favor probable state changes.In a followup work, Du et al. (2019) exploited the notion of label consistency in training to enforce similar predictions in similar procedural contexts.Das et al. (2019) proposed a model that dynamically constructs a knowledge graph while reading the procedural text to track the ever-changing entities states.As discussed in the introduction, however, these previous methods use a strong inductive bias and assume that state labels are present during training.In our study, we deliberately focus on unlabeled procedural data and ask the question: Can multimodality help to identify and provide insights to understanding state changes.

Conclusion
We have presented a new neural architecture called Procedural Reasoning Networks (PRN) for multimodal understanding of step-by-step instructions.Our proposed model is based on the successful BiDAF framework but also equipped with an explicit memory unit that provides an implicit mecha-nism to keep track of the changes in the states of the entities over the course of the procedure.Our experimental analysis on visual reasoning tasks in the RecipeQA dataset shows that the model significantly improves the results of the previous models, indicating that it better understands the procedural text and the accompanying images.Additionally, we carefully analyze our results and find that our approach learns meaningful dynamic representations of entities without any entity-level supervision.Although we achieve state-of-the-art results on RecipeQA, clearly there is still room for improvement compared to human performance.We also believe that the PRN architecture will be of value to other visual and textual sequential reasoning tasks.

Figure 1 :
Figure 1: A recipe for preparing a cheeseburger (adapted from the cooking instructions available at https: //www.instructables.com/id/In-N-Out-Double-Double-Cheeseburger-Copycat).Each basic ingredient (entity) is highlighted by a different color in the text and with bounding boxes on the accompanying images.Over the course of the recipe instructions, ingredients interact with each other, change their states by each cooking action (underlined in the text), which in turn alter the visual and physical properties of entities.For instance, the tomato changes it form by being sliced up and then stacked on a hamburger bun.

Figure 2 :
Figure 2: An illustration of our Procedural Reasoning Networks (PRN).For a sample question from visual coherence task in RecipeQA, while reading the cooking recipe, the model constantly performs updates on the representations of the entities (ingredients) after each step and makes use of their representations along with the whole recipe when it scores a candidate answer.Please refer to the main text for more details.

Figure 3 :
Figure3: Sample visualizations of the self-attention weights demonstrating both the interactions among the ingredients and between the ingredients and the textual instructions throughout the steps of a sample cooking recipe from RecipeQA (darker colors imply higher attention weights).The attention maps do not change much after the third step as the steps after that mostly provide some redundant information about the completed recipe.

Figure 4
Figure 4: t-SNE visualizations of learned embeddings from each memory snapshot mapping to each entity and their corresponding states from each step for visual cloze task.

Fig. 5
Fig.5demonstrates the entity arithmetics using the learned embeddings from each entity step.

Step 2: The Crust CNN LSTM
180C).Using a food processor (or a mallet and a baggie -go for it!),turn your gingersnaps into crumbs!Add butter to crumbs and process until well incorporated.(If you're using the mallet method, you can use a fork for this part!)I like to line just the bottom of a 9" springform pan with parchment, but that is optional.Pat the crust mixture into your pan, covering just the bottom, or going up the sides as far as you dare!If you're going full-crust, it's a good idea to parbake your crust (meaning bake it before filling) for 5-10 mins.

Table 1 :
Quantitative comparison of the proposed PRN model against the baselines.
* Taken from the RecipeQA project website, based on 100 questions sampled randomly from the validation set.