A Framework for Procedural Text Understanding

,


Introduction
Among many sorts of texts in natural languages, procedural texts are clear and related to the real world. Thus they are suitable for the first target of natural language understanding (NLU). A procedural text is a sequence of sentences describing instructions to create an object or to change an object into a certain state. If a computer understands procedural texts, there are potentially tremendous applications: an intelligent search engine for howto texts , more intelligent computer vision (Ramanathan et al., 2013), a work help system teaching the operator what to do the next (Hashimoto et al., 2008), etc.
The general natural language processing (NLP) tries to solve the understanding problem by a long * This work was done when the first author was at Kyoto University. series of sub-problems: word identification, partof-speech tagging, parsing, semantic analysis, and so on. Contrary to this design, in this paper, we propose a concise framework of NLU focusing on procedural texts. There have been a few attempts at procedural text understanding. Momouchi (1980) tried to convert various procedural texts into so-called PT-chart on the background of automatic programming. Hamada et al. (2000) proposed a method for interpreting cooking instruction texts (recipes) to schedule two or more recipes. Although their definition of understanding was not clear and their approach was based on domain specific heuristic rules, these pioneer works inspired us to tackle a major problem of NLP, text understanding.
As the meaning representation of a procedural text we adopt a flow graph. Its vertices are important concepts consisting of word sequences denoting materials, tools, actions, etc. And its arcs denote relationships among them. It has a special vertex, root, corresponding to the final product. The problem which we try to solve in this paper is to convert a procedural text into the appropriate flow graph. The input of our NLU system is the entire text, but not a single sentence.
Our framework first segments sentences into words (word segmentation; abbreviated to WS hereafter). This process is only needed for some languages without clear word boundary. Then we identify concepts in the texts and classify them into some categories (concept identification; abbreviated to CI hereafter). And finally we connect them with labeled arcs. For the first process, WS, we adapt an existing tool to the target domain and achieve an enough high accuracy. The second process, CI, can be solved by the named entity recognition (NER) technique given an annotated corpus (training data). The major difference is the definition of named entities (NE). Contrary to many other NERs we propose a method that does not require part-of-speech (POS) tags. This makes our text understanding framework simple. For the final process we extend a graph-based parsing method to deal with the entire text, a sequence of sentences, at once. The difference from sentence parsing is that the vertices are concepts but not words and there are words not covered by any concept functioning as clues for the structure.
As a representative of procedural texts, we selected cooking recipes, because there are many available resources not only in the NLP area but in the computer vision (CV) area. For example, the TACoS dataset (Regneri et al., 2013), is a collection of short videos recording fundamental actions in cooking with descriptions written by Amazon Mechanical Turk. Another example, the KUSK dataset (Hashimoto et al., 2014), contains 40 videos recording entire executions (20 recipes by two persons). The recipes in the KUSK dataset are taken from the r-FG corpus , in which each recipe text is annotated with its "meaning." We tested our framework on recipe texts manually annotated with word boundary information, concepts, and a flow graph. We compare a naive application of an MST dependency parser and our extension for flow graph estimation. We also measure the accuracy at each step with the gold input assuming the perfect preceding steps. Finally we evaluate the full automatic process of building a flow graph from a raw text. Our result can be a solid baseline for future improvement in the procedural text understanding problem.

Related Work
Some attempts at procedural text understanding were proposed in the early 80's (Momouchi, 1980). Then Hamada et al. (2000) proposed treebased representation of cooking instruction texts (recipes) from the application point of view. These approaches used rule-based methods, but they, along with the current success of the machine learning approach, inspired us to conceive that the procedural text understanding can be a tractable problem for the current NLP.
In our framework the procedural text understanding problem is decomposed into three processes. The first process is the well-known WS. There have been many researches reporting high accuracies in various languages based on the corpus-based approach (Merialdo, 1994;Neubig et al., 2011, inter alia). The second one is CI, which can be solved in the same way of NER (Chinchor, 1998) with a different definition of named entities. The accuracy of the general NER is less than WS but is more than 90% when a large annotated corpus is available (Sang and Meulder, 2003, inter alia). So we can say that CI can also be solved given an annotated corpus. The only open question is how many examples are required to achieve a practically high accuracy. This paper gives a solution to this. The third one is our original text parsing, which outputs a flow graph taking a text and the concepts in it as the input. To solve this problem, we follow the idea of the graph-based dependency parsing (McDonald et al., 2006;McDonald et al., 2005). Dependency parsing attempts to connect all the words in an input sentence with labeled arcs to form a rooted tree. In our method, the units are concepts instead of words and the input is an entire text (a sequence of sentences), not a single sentence. The words not forming concepts (mainly function words), are only referred to as features to estimate the flow graph. We also add another module to form a directed acyclic graph (DAG).
From the NLP viewpoint, the major problems we are solving are 1) dependency parsing (Buchholz and Marsi, 2006) among concepts only, 2) predicate-argument structure analysis (Taira et al., 2010;Yoshino et al., 2013), 3) semantic parsing (Wong and Mooney, 2007;Zettlemoyer and Collins, 2005), and 4) coreference, anaphora, and ellipsis resolution (Nielsen, 2004;Fernández et al., 2004). For dependency parsing we resolve the target of modifiers such as quantities, durations, timing clauses. For predicate-argument structure analysis, we figure out which action is applied to what object with what tools, even if it is stated in passive form or just by a past participle. For semantic parsing we resolve the relationships between concepts. For coreference, anaphora, and ellipsis resolution, our DAG constructor links an action to another action that takes the result of the former action or an abstract expression to a concrete intermediate product. Our method solves these problems focusing on important notions at once.
The understanding of procedural texts may allow a more sophisticated combination of NLP an CV. Recently there have been some attempts at aligning videos and natural language descriptions 1. ( Add Ac broth F , the water F , macaroni F , and pepper F , and simmer Ac until the pasta F is Af tender Sf . ) 3. Ac F Ac ( Sprinkle Ac the snipped Ac sage F . ) Figure 1: Examples of a procedural text and its flow graph. (Naim et al., 2014;. In these researches, the NLP part is very naive. They just identify the nouns in the text and apply a sequence-based alignment tool. Now the machine translation community is shifting to the tree-based approach to capture structural differences in two languages. The flow graph representation enables grounding of tuples consisting of an action and its target objects, and also absorbs the difference in the execution order of a procedural text and the video recording its execution. Although NLU is the major scientific problem of AI, procedural text understanding is important from the viewpoint of applications as well. For cooking recipes for example, on which we test our framework in this paper, we can realize a more intelligent search engine, summarization, or a help system Yamakata et al., 2013;Hashimoto et al., 2008).

Recipe Flow Graph Corpus
As a test bed of the text parsing problem, we adopt the recipe flow graph corpus (r-FG corpus) . To our best knowledge, this is the only corpus annotated with flow graphs that matches with our requirements. In addition cooking recipes are representative procedural texts describing very familiar activities of our daily life, and its meaning representation has various applications. Our framework is, however, not limited to this corpus.

r-FG Corpus
The r-FG corpus contains randomly crawled recipes in Japanese from a famous Internet recipe #recipes #sentences #NEs #words 200 1,303 7,268 25,446 site. 1 The specification of the corpus is shown in Table 1. The text part of a recipe consists of a sequence of steps and the steps have some sentences. All the concepts (entities and actions) appearing in the sentences are identified and annotated with a concept tag. 2 The text part is annotated with a rooted DAG representing its meaning as shown in Figure 1.

Vertices
Each vertex of a flow graph corresponds to a concept represented by a word sequence in the text and a concept type such as food, tool, action. Table 2 lists the concept types along with the average number of occurrences per recipe. There is one special vertex, root, corresponding to the final dish. In the Figure 1 example, the node of "splinkle" is the root.

Arcs
An arc between two vertices indicates that they have a certain relationship. An arc has a label denoting its relationship type.  Table 3: Arc labels with frequencies per recipe.
ure 1 for example, "macaroni" is equal to "pasta." According to the world knowledge, macaroni is a sort of pasta, but in this recipe they are identical. An example of a null-instantiated argument is the relationship between "heat" and "add." Celery etc. should be added not to the initial cold Dutch oven without oil but to the hot Dutch oven with oil, which is the implicit result of the action "heat."

Overview of Procedural Text Understanding
Our framework of procedural text understanding consists of the following three processes combined in the cascaded manner.

Flow graph estimation
The input of WS is a raw sentence and the output is a word sequence. For example the WS takes the first sentence in Figure 1 without any tag as the input as follows: Then WS outputs the following word sequence separated by whitespace as the output.
The input of CI is the word sequence, the output of WS, and it identifies concepts, which are spans of words without overlap annotated with its type sequences. For the above example, the CI outputs three concepts as follows: This part is similar to NER. Contrary to a normal NER, however, our method does not require POS tag for the words in the input. Thus we do not need to adapt a POS tagger to the target domain. For English or other languages with obvious word boundary, we can start from CI. Now we have a text consisting of some sentences with concepts identified. An example is the left hand side of Figure 1. This is the input of the flow graph estimation step and the output is a flow graph as show on the right hand side of Figure 1 for example.
In the traditional NLP approach, many subproblems proceed after NER. Syntactic parsing clarifies the intra-sentential relationships among NEs, then anaphora/coreference resolution figures out their inter-sentential relationships. Contrary, we process the entire text at once. In the subsequent section, we describe the above three process in detail.

Word Segmentation
Some languages such as Japanese or Chinese, have no obvious word boundary like whitespace in English. The first step of our framework is WS. For many European languages this process is almost obvious and instead of WS we only need to decompose some special words like "isn't" to "is" + "not" in English or "du" to "de" + "le" in French.
For WS we adopt the pointwise method (Neubig et al., 2011) because of its flexibility for language resource addition. 3 This characteristics is suitable especially for domain adaptation. Below we explain pointwise WS briefly and our method to improve its accuracy for user generated recipes.

Type
Feature setting Character Table 4: Features for word segmentation. The fuction c(·) maps a character into one of six character types: symbol, alphabet, arabic, number hiragana, katakana, and kanji. The fuction d(·) returns whether the string is in the dictionary or not. And the functions L(·) and R(·) return whether substrings of any length on the left hand side or right hand side match with a dictionary entry.

Pointwise Method
The pointwise method formulate WS as a binary classification problem, estimating boundary tags b I−1 1 . Tag b i = 1 indicates that a word boundary exists between characters x i and x i+1 , while b i = 0 indicates that a word boundary does not exist. This classification problem can be solved by tools in the standard machine learning toolbox such as support vector machines (SVMs).
The features are character n-grams surrounding the decision point i, which are substrings of x i−2 x i−1 x i x i+1 x i+2 x i+3 , character type ngrams, and whether character n-grams matches an entry in the dictionary or not. Table 4 lists the features.
As we can see, the pointwise WS does not refer to the other decisions, thus we can train it from a partially segmented sentences, in which only some points between characters are annotated with word boundary information.

Domain Adaptation
As the WS adaptation to recipes, we convert the r-FG corpus into partially segmented sentences following (Mori and Neubig, 2014). In the corpus only r-NEs are segmented into words. That is to say, only both edges of the r-NEs and the inside of the r-NEs are annotated with word boundary information. If the r-NE in focus is composed of two words, then the partially segmented sentences are where the symbols "|," "-," and " " mean word boundary, no word boundary, and no information, respectively. Then we use the partially annotated sentences which we obtained in this way as an additional language resource to train the model.

Concept Identification
The second step is the concept identification. The concept in the text parsing problem is a span of words without overlap annotated with its type. Thus the concept identification (CI) can be solved in the same manner as the named entity recognition (NER). NER is a sequence labeling problem and many solutions have been proposed so far (Borthwick, 1999;Sang and Meulder, 2003, inter alia).
The standard NER method is based on linear chain conditional random fields (CRFs). In this paper we use an NER which allows a partially annotated corpus as a training data as well as a normal fully annotated corpus . 4 In the training step this NER estimates the parameters of a classifier based on logistic regression (Fan et al., 2008) from sentences fully (or partially) annotated with NEs (concepts). The features are word n-grams surrounding the word in the focus w i , Table 5 lists the features. At run-time, given a word sequence, the classifier enumerates all possible BIO2 tags t i for each word w i with their probabilities as follows: where w − and w + are the word sequences preceding it and following it, respectively. Then this NER searches for the tag sequence of the highest probability satisfying the tag sequence constraints. 5

Parsing an Entire Text
The final step is to build a flow graph. The input is a text whose sentences are segmented into words and all the concepts are identified. We call this part a text parsing. As we mentioned in Section 1, text parsing deals with various language phenomena at once, such as dependency, predicateargument structure, and anaphora/coreference.
For text parsing we extend an MST parser (Mc-Donald et al., 2005). Since the flow graph is a labeled DAG, we add some labeled arcs to the MST. Below we explain the processes one by one.

Spanning Tree Estimation
We first build a labeled spanning tree covering all the concepts (vertices) of the input text. Let V be a set of vertices and G be a set of possible spanning trees on V . We assume that there exists a score function s(u, v, l) which represents the likelihood of making a labeled arc from u to v with label l. Then the maximum spanning tree (MST) can be found as follows: We solve this problem using the Chu-Liu-Edmonds algorithm (Chu and Liu, 1965;Edmonds, 1967). We define the score function s(u, v, l) as a probability 6 : Here L is the arc label set (See Table 3), Θ is a vector of weight parameters and f (u, v, l) is a 5 For example, the tag sequence F-I T-I is invalid. 6 This is the probability of a directed arc with label l from a fixed vertex u, but not a probability over all the directed arcs. We have tried the latter scoring function but the result was worse than the former scoring function which we report in this paper. 1: G ← Maximum spanning tree of V . 2: A ← Sequence of arcs that can be added to G without violating the acyclic condition. 3: Sort A in the descending order of the value of the score function s.
n ← n + 1 9: end if 10: end for 11: return G Figure 2: Algorithm of DAG estimation function that maps a labeled arc into a feature vector. The score function s(u, v, l) computes the probability of making a labeled arc from u to v with label l referring to their word sequences, concept tags, surrounding words in the original recipe text, and label l. A detailed description is given in Subsection 7.3.
We use a log-linear model (Berger et al., 1996) in order to train the weight parameters Θ. Let {(V t , u t , v t , l t )} T t=1 denote a set of T training instances, where V t is a set of the vertices and (u t , v t , l t ) is a gold standard arc with label l in the t-th training instance. Given these training examples, weight parameters are estimated so that they maximize the following likelihood: Because our flow graph is not a tree but a DAG, there can be more than one arcs outgoing from a single vertex. In other words it may contain two arcs, (u, v, l) and (u, v , l ), which share the same start vertex u. In these cases, we add both (V, u, v, l) and (V, u, v , l ) to the training data.

Arc Addition
Given the labeled MST G, we add some labeled arcs to form a flow graph of a labeled DAG with a root. Our algorithm for adding arcs is described in Figure 2.
The most important point is to choose the best arcs one by one among those which do not create cycles and add them to the MST. In Figure 2 s(v, w, l) is the same score function used in the MST estimation and p(n) is a function that gives a penalty when the n-th additional arc is added to G. Thus the arc of the highest s is added if the s value is larger than the penalty p(n).
We adopt an exponentially increasing function as the penalty function p(n) with parameters λ and ξ as follows 7 : The denominator on the right-hand side is an exponential distribution with parameter λ. The numerator ξ is a scale parameter. At the training step we first estimate λ which minimizes the square error between the training-data distribution of the number of arcs added to the tree and the exponential distribution. Then we choose ξ which maximizes the F-measure on the held-out data, a small portion of training data, as we do in the deleted interpolation method (Jelinek, 1985).

Features
The feature function outputs a high dimensional feature vector that represents a characteristic of a labeled arc (u, v, l).
We extract features from labeled arcs (u, v, l) by two processes: first we extract features from the arc and input recipe text and then we concatenate the label l to each feature we extract. First the following features are extracted from the input arc (u, v) and the recipe text: The reason is that, in small training data, the relationship between the number of additional arcs and the number of the flow graph matched with an exponentially decreasing function well. 8 The pronunciations are automatically estimated based on the method described in   Table 6: Accuracies of the baseline and proposed methods.
F7: concept tag of u ∧ concept tag of v ∧ whether u and v are in the same sentence, F8: concept tag of u ∧ concept tag of v ∧ whether Ac exists between u and v ∧ whether a function word exists between u and v, F9: concept tag of u ∧ concept tag of v ∧ function words between u and v.
Here the symbol ∧ indicates the combination of individual features. Next we simply concatenate the label l with each feature to construct feature vectors of labeled arcs. For example, we extracts a feature "concept tag of u ∧ concept tag of v." Then, this type of feature becomes "l ∧ concept tag of u ∧ concept tag of v." by concatenating the label l. The same concatenation of a label is done on the other features.
So-called higher order features which refer to the neighboring vertices in the DAG are common in work on dependency parsing (McDonald et al., 2006;Koo and Collins, 2010), but we do not use these kinds of features because we only have 200 recipes annotated with DAGs 9 . This number is much smaller than roughly 40,000 sentences of the Wall Street Journal which are commonly used to train dependency parsers (Marcus et al., 1994).

Evaluation
We evaluated our framework on the r-FG corpus described in Table 1. We executed 10-fold cross validation for more reliable results.
DAG estimation accuracy is measured by the F-measure of labeled arcs, which is the harmonic mean of precision and recall. Let N sys , N ref , and N int be the number of the estimated arcs, the gold standard arcs, and their intersection, respectively. Then precision = N int /N sys , recall = N int /N ref , and F-measure = 2N int /(N ref + N sys ).  Table 7: F-measure of each task and the overall task.

Flow Graph Estimation
As the first evaluation, we compared the simple application of the MST parser and our extension. We assumed the gold WS and CI results.

Settings of Flow Graph Estimation
We compared two methods in the following way.
Baseline A naive method for the text parsing is a simple application of MST parser (McDonald et al., 2005) to a concept sequence input. An MST parser takes a sequence of words annotated with POSs and outputs a labeled tree connecting all the words. Thus our baseline for flow graph estimation takes a sequence of concepts (pairs of a word sequence and a concept type) as the input. The output of an MST parser is a tree, but not a DAG. So we add our arc addition module for a fair comparison. As the implementation, we modified a Japanese dependency parser (Flannery et al., 2012) that uses the logistic regression as the scoring function.
Proposed This combines the spanning tree estimation (Subsection 7.1) and the arc addition (Subsection 7.2) in the cascaded manner. Different from the Baseline, this method referred to words not included in concepts such as function words as the features . Table 6 shows the accuracies of the baseline method and our MST extension. We can see that there is a significant difference in accuracy between Baseline and Proposed. The major difference between these two methods is whether or not they refer to the words not covered by the concepts in the original texts, such as in F5 for example. These words are mainly function words. Therefore we can say that the function words are, as we know intuitively, important for the relationships among the concepts.

Text Parsing on a Raw Text
We also executed the text parsing taking a raw text as the input. For this problem, we combine WS, CI, and flow graph estimation (Proposed) in the cascaded manner.
The performance measurement is F-measure. Different from the first experiment, the unit is a triplet ( w s , c s , w e , c e , l). Here, w s and c s are the word sequence of the out-going vertex of the arc and its concept type, respectively. w e and c e are those of its in-coming vertex. And l is its label. A triplet is correct if and only if all these elements match with those of an arc in the manually annotated data.

Settings of Word Segmentation and
Concept Identification We built an automatic word segmenter and an automatic concept identifier in the following way.

WS:
The word segmenter (see Section 5) is trained on the following corpora.

Balanced Corpus of Contemporary
Written Japanese (Maekawa, 2008) containing fully segmented 53,899 sentences from newspaper articles, books, magazines, whitepapers, Web logs, and Web QAs. 2. The partially segmented sentences derived from 208 recipes in the r-FG corpus and additional 208 recipes annotated with concept types. In the experiment, we excluded the test part in 10fold cross validation. Thus we built 10 models in total. 3. Partially annotated 1,651 sentences crawled from another recipe Web site 10 .

CI:
The concept identifier (see Section 6) is trained on the corpus 2. used in the WS training in the same way. Thus we built 10 models in total for 10-fold cross validation 11 . Table 7 shows the accuracies of WS, CI, and flow graph estimation on the gold results of the preceding task and that of the combination of three tasks starting from a raw text to a flow graph.

Results
As we see in the table, the flow graph estimation task is the most difficult and has a large room for improvement. The accuracy of WS without adaptation was about 95% and our adaptation technique raised it to about 99% which is as high as in the general domain case. The accuracy of CI, trained on less than 3,000 sentences, is as high as the general NER whose accuracy is about 90% by a model trained on about 10,000 sentences. This is still lower than WS accuracy, so the concept identifier is also a target of improvement.
From Table 7, the accuracy of the cascade combination of three tasks (WS + CI + flow graph estimation) is 51.6. This value is lower than the simple multiplication result that assumes the independence among the tasks 57.2 = (0.986 1.27 × 0.907) 2 × 0.721 × 100, where 1.27 is the average word length of the concepts. This indicates that it is worth trying to investigate joint methods for WS, CI, and flow graph estimation.

Conclusion
In this paper, we proposed a framework of procedural text understanding consisting of the three processes. The first process is the well-known word identification. The second one is concept identification, which can be solved in the same way of named entity recognition with different definition of named entities. The third one is our original text parsing, which estimates a flow graph taking a text and the concepts in it as the input.
We tested our framework on recipe texts manually annotated with a flow graph. The results showed that our method outperforms a naive application of an MST dependency parser. Thus we can say that the simple application of dependency parsing to flow graph estimation does not work well, and that it is important to focus on not only concepts but also words surrounding them. Finally we evaluated the full automatic process of building a flow graph from a raw text. Our result can be a solid baseline for future improvement in the procedural text understanding problem.
Our method is applicable to various procedural texts allowing us to realize more intelligent search engine for how-to texts, more sophisticated sym-bol grounding by combining NLP and CV, etc.