Procedural Text Generation from a Photo Sequence

Multimedia procedural texts, such as instructions and manuals with pictures, support people to share how-to knowledge. In this paper, we propose a method for generating a procedural text given a photo sequence allowing users to obtain a multimedia procedural text. We propose a single embedding space both for image and text enabling to interconnect them and to select appropriate words to describe a photo. We implemented our method and tested it on cooking instructions, i.e., recipes. Various experimental results showed that our method outperforms standard baselines.


Introduction
A multimedia procedural text, e.g. instruction sentences with photos, inspires users to learn a new skill. Some web services, such as Cookpad and Instructables, capitalize on this characteristics allowing users to submit photos or video clips in addition to instruction sentences to explain procedures better. An automatic system outputting instruction sentences given a photo sequence supports authors of such services.
In this paper, we propose a method for generating a procedural text from a photo sequence. As shown in Figure 1, given a photo sequence, it outputs a step consisting of some instruction sentences for each photo. Among various kinds of procedural texts, we take the cooking domain for example because cooking is daily activity and recipe is one of the most familiar procedural texts.
Our task may resemble visual storytelling (Huang et al., 2016) sharing the input. The main difference is, however, that the output of our task is a procedural text that should be concise and concrete allowing its readers to execute it. In cooking domain the output, a recipe consisting of multiple sentences, should have necessary and sufficient foods, tools, and actions in the correct or- Figure 1: An overview of our task. The input is a photo sequence (left). The task is to output a step consisting of instruction sentences (right) for each photo.
der. For this reason procedural text generation seemed to be difficult and was initially solved by formulating it as a retrieval task (Salvador et al., 2017;Zhu et al., 2019;Chen and Ngo, 2016). Another similar task setting is recipe generation from a photo of the final dish using ingredient predictor (Salvador et al., 2019). This setting may be, however, very difficult or even impossible because a single photo of the final dish does not contain sufficient information for its production procedure.
In this background, we focus on procedural text generation from a photo sequence and, as a solution, we propose to incorporate a retrieval method into a generation model. Our method generates a procedural text in two phases. First given a photo sequence, it retrieves relating steps using a joint embedding model, which has been pre-trained on a large amount of image/step pairs available in the Web. Then it generates word sequences referring to these retrieved steps.
We conducted experiments to evaluate our method in comparison with existing methods in BLEU, ROUGE-L, and CIDEr-D. The results showed the effectiveness of the proposed method. However, as often pointed out, these metrics are not perfect because they ignore importance of each token. Thus we investigated the ratios of correctly verbalized important terms, i.e., foods, tools, and actions in the recipe case. The result showed that the proposed method verbalizes them more correctly. Some qualitative analyses also suggested that the proposed method generates a suitable procedural text for a given photo sequence.

Related Work
Some researchers have been tackling problems to generate a procedural text from various inputs. In cooking domain, Salvador et al. (2019) tried to generate a recipe from an image of a complete dish. Bosselut et al. (2018) and Kiddon et al. assumed a title and ingredients as the input. It may be, however, almost impossible to generate a good recipe due to lack of information on mesomorphic states of ingredients. Mori et al. (2014a) generated a procedural text from a meaning representation taking intermediate states into account. Close look at these studies suggests the importance of the information on intermediate processes for a procedural text generator to be practical.
Thus we assume a photo sequence as the input. Since authors of multimedia procedural texts at least take a photo at each important step, this setting is realistic. Sharing the input and output media the most similar task may be the visual storytelling (Huang et al., 2016). Liu et al. (2017) proposed a joint embedding model for image and text to interconnect them. Contrary to this, we propose to generate sentences directly from the vectors in this shared space. Figure 2 shows an overview of our method. (i) We pre-train the joint embedding model using image/text pairs. Then, given a photo sequence, our method repeats the following procedures for each photo: (ii) retrieve the top K nearest steps to the photo in the embedding space, (iii) compute the vector by the encorder from the input photo and the average of the K vectors of the retrieved steps, and (iv) decode a step represented by the photo.

Joint embedding model
First, (i) we train a joint embedding model based on the two branch networks (Wang et al., 2016), which transform different modality representations, i.e., text and image, into a common feature space using multiple layer perceptrons with nonlinear functions. With the resulting joint embedding model we can calculate similarity between a step and an image. In our preliminary experiment, the original networks did not achieve a good performance because there are many omissions in procedural texts (Malmaud et al., 2014). To solve this problem, we propose to insert a bi-directional LSTM (biLSTM) to the textual encoder to refer to the preceding and following steps in addition to the current one.

Procedural text generation assisted by vector retrieval
The input is a photo sequence Each photo v n is converted into an image embedding vectorv n through the image encoder of the joint embedding model. For each photo we execute the following procedures. Image vector enhancement (ii): We retrieve the top K nearest vectors R = (r 1 , r 2 , . . . , r K ) among those converted from the steps in the training dataset for the embedding space. Then we calculate their averagē and concatenate it to the image embedding vector for the photo to have u n = (v n ,r n ). Encoding (iii): We provide the enhanced image embedding vector to a biLSTM o n = biLSTM(u n ). (2) Decoding (iv): We provide an LSTM with the output of the encoder o n as the initial vector. It decodes repeatedly outputting a token in the vocabulary including period, beginning of step (⟨step⟩), and its ending (⟨/step⟩) to form a step consisting of multiple sentences. We also use the general attention mechanism (Luong et al., 2015), which helps the model to generate important terms by recieving feedback from retrieved step embedding vectors. Based on a hidden vector h t at decoding t-th token and the series of retrieved step embedding vectors R, we calculate the attention weight of kth step a t k at t-th token decoding as follows: where W a and W c are trainable parameters. The probability distribution of the output tokens p(y t |y <t , o n ) is calculated as follows: where W o is the weight matrix to transform the size of the vectorh t into the vocabulary size and b o is a bias weight. In the test phase the model outputs the token of the highest probability. After decoding, the last hidden state of the decoder is reset to the initial state of the decoder to get ready to generate the next step.
In the training phase, we minimize the sum of the negative log likelihood over all the tokens in the training set where D is the entire training dataset and θ is all the parameters and T is the length of target instruction sentences.

Evaluation
In order to evaluate our method, we implemented it and tested it in the cooking domain.

Parameter setting
We employed ResNet-50 (He et al., 2016)   layer of biLSTM to 1,024, hence the dimension of the output vector is its doubble (2,048) because the bi-directional output vectors are concatenated.
Training procedure is the same as the two branch networks (Wang et al., 2016). In our generation model, we set the dimension size of the hidden vector to 512 in both of the biLSTM encoder and the LSTM decoder. To train the model, we freeze the joint embedding weights and all other weights were optimized by Adam (Kingma and Ba, 2015) with the initial value α = 0.001. The number of retrieved steps K was set to ten.

Dataset
To prepare the dataset we selected all recipes (in Japanese) from the Cookpad Image Dataset (Harashima et al., 2017) under the condition that   Table 3: Results of overlap metrics for generated procedural texts by the models and the baselines.
an image is attached to all the steps in a recipe 1 . To obtain reliable results, we extracted recipes consisting of reasonable length (7-10 steps), which are denoted by D gen , for text generation test. We used the rest, D emb , as the training set for the joint embedding model. The size of D gen is not enough to train the joint embedding model and generation model jointly, thus we train each model using D gen and D emb independently. All tokens appearing less than three times were replaced with the unknown word symbol. Table 1 shows statistics of the datasets.

Effect on the joint embedding space
First we check the effect of the biLSTM insertion.
We calculated the cosine similarity in the common space for ranking the relevant steps and relevant images and measured image2step and step2image retrieval performance in median rank (MedR). Table 2 shows the results on a subset of randomly selected 1,000 step-image pairs from the test set. From this result, we see that the insertion of the biLSTM improves the original two branch networks enabling to refer to the context.

Results and Discussion
To evaluate our method, we measured overall generation qualities as well as ratios of important terms. We also present some generated examples.

Overlap metrics
To evaluate the proposed method, we calculated BLEU1, BLEU4, ROUGE-L, and CIDEr-D scores over all the recipes in the test set. As the baselines, we train the model to output texts using an LSTM from multiple images (Huang et al., 2016) and mean word vectors of a title and ingredients,  which are calculated by word2vec (Mikolov et al., 2013). The results, Table 3, show that the proposed method achieves a higher performance than the baselines in these metrics.

Important term verbalization
Traditional overlap metrics do not measure verbalization of important terms in the generated procedural text. In the cooking domain, they are foods (F), tools (T), and actions (Ac) as the statistics on the flow graph corpus (Mori et al., 2014b) indicate. Thus we calculated the ratios of correctly verbalized ones in these categories. Although this is more important than the ordinary overlap metrics, synonyms and spelling variants prevent us from automatic calculation. Therefore we selected 50 generated recipes randomly from the test set and manually counted numbers of important terms occurring in the generated recipes, in their references, and both. Table 4 shows the results. We see that clearly Top1 retrieval outperforms the baseline and TopK is far better than top1 for all the term categories, showing advantages of our image vector enhancement in procedural text generation.

Qualitative analysis
In Figure 3 we present example generated sentences by the baseline, those by the proposed method, and their reference. It can be seen that the proposed method is capable of generating recipes which contain the ingredients really shown in the photos, while the baseline tends to just enumerate frequent ingredients in the training set.

Conclusion
In this paper, we proposed a method for generating a procedural text from a photo sequence and tested it in the cooking domain. Our main ideas are (1) biLSTM to overcome omissions in the text side for the joint embedding space, (2) image vector enhancement by top K retrieval, and (3) overall design for procedural text generation from a photo sequence. Various analyses on experimental results, which are also important contributions of this paper, showed that our method outperforms standard baselines and each one of our ideas contributes to it. The generated sentences have the correspondence to the source photos allowing us to generate multimedia procedural texts as a natural extension of our method.