Routing Enforced Generative Model for Recipe Generation

One of the most challenging part of recipe generation is to deal with the complex restrictions among the input ingredients. Previous re-searches simplify the problem by treating the inputs independently and generating recipes containing as much information as possible. In this work, we propose a routing method to dive into the content selection under the internal restrictions. The routing enforced generative model (RGM) can generate appropriate recipes according to the given ingredients and user preferences. Our model yields new state-of-the-art results on the recipe generation task with signiﬁcant improvements on BLEU, F1 and human evaluation.


Introduction
Food is a critical contributor to physical well being, a major source of pleasure, worry and stress, a major occupant of waking time, and across the world, the single greatest category of expenditures for human beings (Rozin et al., 1999). Recipes are a specific genre of instructional language to teach people how to prepare delicious food. They have been gaining interests in recent researches as recipes contain immensely rich information about the real world (Yagcioglu et al., 2018).
Among previous efforts towards computational recipe studies, there are two lines in obtaining cooking recipes for users: recipe retrieval and recipe generation. Recipe retrieval (Chen and Ngo, 2016;Min et al., 2017) matches the entities from the given dish pictures or text inputs to find the corresponding recipes. Provided with ingredients (Yang et al., 2017), recipe titles (Kiddon et al., 2016), or dish photos (Salvador et al., 2019), recipe generation models introduce additional mechanisms to assure the generated recipes containing as much * The two authors contributed equally to this paper. Contribution was done at Peking University. given ingredients as possible. In previous studies, the target recipes are exactly composed of the given ingredients. However, in practice, people usually have a number of ingredients at hand and do not know what to cook. They have difficulty in choosing appropriate set of ingredients. And it can be hard for them to input an accurate recipe title for the models (Kiddon et al., 2016;Majumder et al., 2019). What's more, users may have preferences on some ingredients (e.g. "olive oil") or categories (e.g. "Low Sugar"). As Figure 1 shows, given the same ingredient list, there can be different sets of ingredients contributing to recipes with different user demands. Previous researches have not discussed this common scene of life. There is a clear need in finding suitable cooking recipes that match user demands. As increasing the variety and creativity of daily dishes can promote our happiness, in this work, we manage to obtain the desired recipes in a generation manner.
Our task is generally defined as follows: Given the input of objective background information I and subjective semantic constraints C, our model is to help the machine automatically grounding the inputs to the text output Y .
Such application contexts is common for many specific tasks. For recipe generation, we define I as a set of ingredients (e.g. "beef", "olive oil") the user has. And C = {CI, CD} denotes user preferences on some ingredients (CI ⊆ I) and category demands (CD ⊆ D, D represents the set of dish categories. e.g. "Low Fat", "Low Sugar"). The generated Y is the text of the recipe which satisfies the user preferences with the given ingredients. There are two significant challenges: how to select appropriate set of ingredients to satisfy user demands; and how to generate recipes accordingly. We propose a novel approach to solve these problems. Inspired by (Sabour et al., 2017), we propose a selective routing algorithm to cluster the given ingredients into five categories (Low Sugar, High Fiber, Low Fat, Grilling and Frying) and get the category vectors. Length of the category vector represents the probability of generated Y belonging to a specific category. We augment attention mechanism to capture ingredient information according to the routing weights between ingredients and categories. Then decoder generates words in sequence. We introduce both manual ways and automatic metrics to evaluate the generated recipes. Experimental results demonstrate the efficacy of our approach. To summarize, the contributions of our work are as follows 1 : • To our knowledge, our work is the first endeavor to take ingredient selection into con-sideration in recipe generation process. We propose a novel algorithm to calculate the ingredient collocation weights to enforce recipe generation model.
• Given ingredients with noises, our model can satisfy personalized user demands by taking ingredients and category constraints into consideration.
• Our approach yields significant improvements on both automatic and human evaluation.

Related Work
Recipes have been gaining interest in recent researches, including recipe processing (Mori et al., 2012(Mori et al., , 2014Bosselut et al., 2017), recipe parsing (Malmaud et al., 2014;Jermsurawong and Habash, 2015), recipe retrieval (Chen and Ngo, 2016;Min et al., 2017), regional cuisine style transformation (Kazama et al., 2018), recipe QA (Yagcioglu et al., 2018) and recipe generation (Kiddon et al., 2016;Yang et al., 2017). Among previous efforts towards recipes researches, our study is closer to recipe generation. Kiddon et al. (2016) define the task as given a goal (recipe title) and an agenda (ingredient list) to generate a complete recipe. They present the neural checklist model to improve the semantic coverage of the agenda in the generated texts. Yang et al. (2017) develop a language model that treats reference as an explicit stochastic latent variable and create mentions of entities together with their attributes by accessing external databases. Majumder et al. (2019) take the historical user preference records into consideration. They generate personalized recipes from incomplete input specifications (name and incomplete ingredient details). Different from the existing methods on recipe generation which focus on covering all of the given ingredients, we extend the previous generation task with a selection of given items. Selective generation is a task to produce the natural language description for a salient subset of a rich records (Mei et al., 2016). A lot of attention has been paid to individual content selection and selective realization sub-problems (Barzilay and Lee, 2004;Barzilay and Lapata, 2005;Liang et al., 2009). Recent works (Chen and Mooney, 2008;Chen et al., 2010;Mei et al., 2016) explore full selective generation and learn alignments between generated texts and input data using a translation model.
We find appropriate set of ingredients by selective routing algorithm. And our model can generate texts according to the user constraints on both ingredients and categories. The core inspiration for our routing module comes from following works. Hinton et al. (2011) propose transformation matrices that learn to encode the intrinsic spatial relationship between a part and a whole constitute viewpoint. Later, Sabour et al. (2017) propose an iterative routing-by-agreement mechanism to learn the intrinsic relationship between two layers. Hinton et al. (2018) propose a new iterative routing procedure based on the EM algorithm. Inspired by the previous work, Yang et al. (2018) firstly investigate the performance of dynamic routing on text classification. They propose three strategies (orphan category, leaky-softmax, coefficient amendment) to stabilize the dynamic routing process and alleviate the disturbance of some noises.

Our Approach
The basic structure of our model is shown in Figure  2. Generally speaking, we take two steps to achieve the recipe generation: Select with Routing: In this part, our task is to select an appropriate set (soft selection as weight distribution) of ingredients to support a dish for each category with the given constraints. The procedure can be defined as where ingredients I and constraints C = {CI ⊆ I, CD ⊆ D} are inputs. We propose a selective routing algorithm to cluster the ingredients I into different dish categories D. Assuming that I and D contains n and m items independently, the output O ∈ IR n×m . We define O as the routing weights, where o i,j stands for the importance of the ingredient i in the category j. Generate with Attention: After the content selection, we choose a proper category d and use the routing weights o * ,d 2 to help with the attentionbased generation process. And we get the recipe Y by The generative module is an improved encoderdecoder framework with hierarchical attention mechanism.
In the following sections, we will give more details about the model design and training objective.

Routing Module (RM)
The routing module is designed to find routing weights that contribute to the generation process. The process is shown in Algorithm 1. We first apply an LSTM network as the encoder to obtain the representation of the i-th ingredient h (i) ∈ IR z ( z is the size of hidden vectors). h (i) is the average of the encoder hidden states H (i) ∈ IR n i ×z (n i is the number of words in the i-th ingredient). We then obtain the corresponding routing vectors U ∈ IR n×z from the ingredient representation, where u i, * = h (i) M and M ∈ IR z×z is a mapping matrix to map the ingredient semantic information to the routing space. Given routing vectors U = {u 1, * , u 2, * , ...u n, * } and the routing iteration number r, we use selective routing algorithm f R to obtain the routing weights O ∈ IR n×m and category vectors V ∈ IR m×z .
We define coupling coefficients as B ∈ IR n×m . The initial values b i,j are the log prior probabilities that ingredient i should be coupled to category j. And then we calculate the routing weights o i,j by Eq 3. Inspired by (Sabour et al., 2017), we use the length of category vector v j, * to represent the probability that an appropriate recipe with a specific category j exists. To get the category vector, we apply a non-linear squash function on the weighted sum s j, * by Eq 4. By this means, short vectors get shrunk to almost zero length and long vectors get Get the representation of the i-th ingredient by LSTM encoder: Initialize all the coupling coefficients of ingredient i and category j: shrunk to a length slightly below 1 (Sabour et al., 2017). In the training phase (detailed in Section 4.3), the category vectors and routing weights are used to calculate the loss for routing module. When making predictions with user constraints C, we set the prior probabilities of desired ingredients CI to α to emphasize the ingredient preferences. The presetting will lead the routing to converge to a desired category. And we also mask the vectors of undesired category (D \ CD) in each iteration. If the preferred category is not specified in CD, we extract the routing weights o * ,ĵ as the routing weights for generation model, whereĵ = argmax j ||v j, * || denotes the most possible category of the target recipe. .

Routing Enforced Generative Model (RGM)
The generation module shares the same LSTM encoder with RM. We augment the attention mechanism (Luong et al., 2015) to capture relevant ingredient information to help with predicting the current target word. The alignments a i ∈ IR n i between the last target hidden state h t−1 ∈ IR z and hidden states H (i) ∈ IR n i ×z of the ingredient i is calculated as Eq 5, where W T ∈ IR z×z is a linear transformation matrix for the ingredient representations. Different from the previous attention mechanism, we obtain the context vector c t ∈ IR z taking both routing weights and alignment vector into consideration as Eq 6 shows. In this way, the undesired ingredients (with lower routing weights) get lower attentions. We produce an attention hidden state by concatenating the context vector and the hidden state. And then we use an LSTM decoder to get the word distribution p t formulated as Eq 7, whereŶ = {Ŷ 1 , ...,Ŷ |Ŷ | } denotes the word sequences of the target text. The word with highest probability in the distribution is selected as the generated word.

Model Training
We pre-train the routing module. We mix the input ingredients and target categories of n r recipes as one datum to build a mixed training set M T . The loss function of RM consists of two parts: classification loss and routing loss. For multiple classification, we use classification lossL j for each category j as Eq 8, where e j = 1 iff category j exists in the mixed target categories, otherwise e j = 0. As for routing loss, we define the gold routing weights as G ∈ IR n×m×nr . g i,j,k = 1 iff the i-th ingredient in the inputs is used for the j-th category in the k-th recipe, otherwise g i,j,k = 0. As there may be multiple input combinations for one target category in one mixed datum, we hope our predicted routing weights of the category have a good consistency with one gold combination. Therefore, we maximize the max sum of weights along the gold combinations as Eq 9. The loss function for RM is calculated on the mixed training set M T as Eq 10.
L j = e j (1 − ||v j, * ||) 2 + (1 − e j )(||v j, * ||) 2 . (8) For training RGM, we use the expectation of negative log likelihood loss over the generation training set GT as Eq 11. The probability is modeled by the encoder and decoder of RGM with parameters θ. As RM and generative module (GM) share the encoders, the whole model is jointly trained by 4 Experiments

Data sets
To keep in line with the previous work on recipe generation (Yang et al., 2017), we use the recipe data from Allrecipe 3 to train the generative model. We exclude the recipes that contain less than 10 tokens or more than 500 tokens. As the vocabulary is limited in recipes, we keep all the words appearing in the training set. The vocabulary sizes of ingredients and recipes are 6,121 and 19,168 respectively. The training set GT contains 73,088 recipe data, while valid set and test set (Standard Inputs) each contains 8,000 recipe data. To explore the generality of our model, we build another test data set (Mixed Inputs). It mixes the ingredients of two recipes for each test case to provide redundant ingredients.
To pre-train the routing module, we use 312,707 ingredients with their corresponding recipe categories from Yummly 4 . We mix the input ingredients and target categories of nr = 2 recipes to build the mixed training set M T .

Baseline Models
In this work, we investigate how to improve recipe generation over strong baselines in both our setting (Mixed Inputs) and common setting (Standard Inputs). We compare our routing enforced generative model against the baseline models below.
Attseq: As the bidirectional LSTM encoder has proved strong representation capability (Devlin et al., 2019). We use the model with bidirectional encoder and Luong attention decoder (Luong et al., 2015) as our baseline model.
Pointer: As the vocabulary used in instructional language is limited and there is a strong relationship between given ingredients and recipes, seq2seq model with the pointer network performs particularly well in previous recipe generation work (Yang et al., 2017). We use the pointer network (Vinyals et al., 2015) with ingredient attention, which provides comparable performance to reference-aware language model (Yang et al., 2017) and higher BLEU 5 .
Retrieve: The model retrieves recipes from the training set according to the overlap of the input ingredients.
To explore the importance of routing algorithm, we report the performance of our model without routing module ( w/o RM).

Training Details
Attseq, Pointer and our model RGM all use 2layer LSTM encoders and decoders. All the hidden sizes in three generation models are 512. To avoid over-fitting, we set the dropout rates to 0.3 in these models. We use Adam (Kingma and Ba, 2014) as the optimizer and the learning rate is 0.0001. As for hyper parameters, we set the routing iteration r = 3, 6 weight increase factor α = 100.

Automatic Evaluation
Metrics In order to evaluate the effectiveness of our methods, we introduce the automatic metrics as follows: BLEU-4 (Papineni et al., 2002) is a commonly used metric to measure the quality of machine generated texts. Dis2 (Li et al., 2016) is the ratio of distinct bi-grams in the generated recipes, which depicts the diversity 7 . We define the set of used ingredients in the model outputs and gold reference are SO and SG respectively. Prec. denotes the ratio of SO∩SG in SO. Rec. denotes the ratio of SO ∩ SG in SG. F1 is the harmonic average of the Prec. and Rec., which is an overall measurement. For all the metrics, a higher value means better.
Analysis Results of generated texts from all the models when given Mixed Inputs or Standard Inputs are shown in Table 1. As we mixed ingredients of two recipes to create Mixed Inputs, we take each recipe as the ground truth respectively and report the max score over the two ground truths via automatic evaluation metrics. The general trends are consistent in both Standard Inputs and Mixed Inputs, so we discuss them together.
As the vocabulary used in instructional language is limited and there is a strong relationship between given ingredients and recipes, Pointer performs better than Attseq in almost all the evaluation metrics. Sometimes Pointer may directly copy ingredients from the given inputs in the generation process, and it achieves a rather high Prec. With overmuch attention on the limited ingredients, Pointer generates recipes of low Rec. On the contrary, Retrieve finds the recipes that contain most ingredients in the training set as outputs, which results in the highest Rec. But gaps exist in training set and test set, and the gold recipes from two sets might be different even if the inputs are the same, letting alone the retrieved outputs are always corresponding to the super sets of the given inputs in the test set. And thus Retrieve obtains the lowest Prec. In all collected data sets, the ingredient names in the ingredient lists may disagree with corresponding expressions in the recipes. We use ingredient mapping rather than word mapping to calculate the automatic evaluation metrics, because "green onion" is not the same ingredient as "onion". However, "basil leaf" in the ingredient list is used as "basil" in the recipe and both expressions represent the same ingredient. Due to this inconformity, Rec. of Retrieve is not 1. Besides, the outputs are all handcrafted recipes with a higher diversity compared to the generative

models.
To explore the efficacy of selective routing, we remove the routing module in the test phase. Due to the strong representation capability of shared encoder, BLUE-4 , Prec. and F1 of w/o RM achieve evident promotion compared to Attseq. Combined with RM, our RGM achieves the best performance on BLUE-4 and F1 in both cases. And the recipes generated by RGM use words with the highest diversity among all the automatically generated recipes. This ablation study demonstrates the empirical contribution of routing algorithm. Automatic evaluation results demonstrate that our model learns the internal relationships of the ingredients well and can generate recipes with high quality.

Human Evaluation
Settings Because recipes are a kind of instructional language to teach people how to cook, performing well on the automatic metrics is not enough. For a more comprehensive evaluation, we sample 50 inputs and obtain 50 recipes generated by each model above and then get 50 groups of recipes (25 for each case) to do the human evaluation. We ask 9 judges 8 on Amazon Mechanical Turk to rate the recipes in a Likert scale (∈ [1, 5]). Three native English speakers are asked to give a score on each recipe with the following descrip- tions: Readability denotes whether the recipe is fluent and easy to understand. Accuracy denotes whether the given ingredients are correctly used in the recipe. Feasibility denotes whether the recipe is feasible. Creativity denotes whether the recipe is innovative. Overall denotes the overall quality of the recipe.

Analysis
The results in Table 2 show that there is a large gap between automatic evaluation and human evaluation on Pointer. It outperforms Attseq on automatic evaluation but gets the worst rating scores on human evaluation. It may be due to that Pointer sometimes reuses the same ingredients in a recipe for several times or even repeats phrases, which does not affect the automatic metrics much but discourages people from reading. This inconsistency suggests the deficiencies of the automatic evaluation. Another interesting discovery is that Retrieve achieves lower scores on all the aspects compared to our model. As Retrieve returns handcrafted recipes as outputs, scores of its Readability and Feasibility should have been higher than our model's. We compare the outputs of both models and find Retrieve outputs rather long recipes. On the test sets, the average word numbers of the outputs are 173.63 and 96.95 for Retrieve and RGM respectively, while gold recipes contain 104.41 words averagely. Longer texts mean that there are more operation steps or more ingredients used, which makes the recipe difficult to understand and follow. What's more, Retrieve outputs common recipes in the data set, while RGM generates recipes with novelty. Therefore our model also beats Retrieve on Creativity. Human evaluation demonstrates that effectively calculating the routing weights of ingredients is informative for recipe generation.

Case Study
For an intuitive comparison, we show some examples in Figure 3. As Pointer achieves rather low scores in human evaluation, we only show the recipes generated by two stronger baseline models here. We apply some ellipsis because Retrieve always outputs long texts 9 . The noise ingredients (given in the Mixed Inputs but not expected to use) are in purple and extraneous ingredients (not in the given inputs) are in red. The ingredients which are correctly used in the recipes are in bold black. Italic words denote that the ingredients are supposed to appear in the recipes and have been used before. In both cases, Attseq generates recipes containing few expected ingredients. Particularly, it introduces extraneous ingredients when given Mixed Inputs and incorrectly reuses "sugar" given Standard Inputs. As for Retrieve, it introduces quite a few extraneous ingredients in both cases. Retrieve outputs complicated recipes with overmuch ingredients and lengthy operation steps, which are not practical for most people. As a contrast, RGM uses most of the expected ingredients and only one noise ingredient, 9 Please refer to appendix for details which proves that our routing module is effective for content selection. What's more, compared to attention-based model Attseq, RGM alleviates the problem of inappropriately repeating words.

Controllable Recipe Generation
Our model is different from existing methods mainly in two aspects: routing based selection model and recipe generation in accordance with constraints (user demands). For all the results discussed above, RGM selects ingredients and generates the recipes following the routing weights. In this section, we assign the constraints of ingredients or categories as Figure 4 shows. In the first example, the generated recipe does not use "mustard". We constrain the ingredient by promoting the initial weight of it in the selective routing algorithm. As a result, RGM generates the recipe with expected mustard. Further, we conduct experiments on constraining a number of ingredients and the results confirm the validity of the selective routing algorithm. Considering the second example, RGM uses 6 ingredients we input without any constraints. If a user especially prefers some ingredients like:" thyme", "butter" and "roast", we set corresponding constraints on them. The generated recipe exactly contains the assigned ingredients. For people having special tastes or demands, they need dishes of certain categories, like: "Low Sugar". In the third example, RGM first generates the recipe following the maximum likelihood without any constraints. The generated recipe is a "High Fiber" one. We then give a constraint as "Low Sugar". The new recipe contains almost the same ingredients as before except for "honey". It is well known that "honey" is a sweet produced by bees and some related insects. It should not appear in a "Low Sugar" recipe. And the operation steps are also adjusted accordingly. The results of extended experiments show that RGM is able to generate reasonable recipes with user demands.

Conclusion and Future Work
In this paper, we make an effort on introducing routing algorithm to enforce the recipe generation model. We model the internal relationships between ingredients by selective routing algorithm. Given ingredients with noises, our model selects reasonable ingredient collocations and generates recipes based on user demands. Extensive experiments shows that the generated recipes are not only fluent and feasible, but also creative. There are several directions to explore in the future. For example, the clustering ability of routing algorithm can be used to control the style of generated texts.