Analogous Process Structure Induction for Sub-event Sequence Prediction

Computational and cognitive studies of event understanding suggest that identifying, comprehending, and predicting events depend on having structured representations of a sequence of events and on conceptualizing (abstracting) its components into (soft) event categories. Thus, knowledge about a known process such as"buying a car"can be used in the context of a new but analogous process such as"buying a house". Nevertheless, most event understanding work in NLP is still at the ground level and does not consider abstraction. In this paper, we propose an Analogous Process Structure Induction APSI framework, which leverages analogies among processes and conceptualization of sub-event instances to predict the whole sub-event sequence of previously unseen open-domain processes. As our experiments and analysis indicate, APSI supports the generation of meaningful sub-event sequences for unseen processes and can help predict missing events.


Introduction
Understanding events has long been a challenging task in NLP, to which many efforts have been devoted by the community.However, most existing works are focusing on procedural (or horizontal) event prediction tasks.Examples include predicting the next event given an observed event sequence (Radinsky et al., 2012) and identifying the effect of a biological process (i.e., a sequence of events) on involved entities (Berant et al., 2014).These tasks mostly focus on predicting related events in a procedure based on their statistical correlations in previously observed text.As a result, understanding the meaning of an event might not be crucial for these horizontal tasks.For example, simply selecting the most frequently cooccurring event can offer acceptable performance on the event prediction task (Granroth-Wilding and Clark, 2016).
Computational and cognitive studies (Schank and Abelson, 1977;Zacks and Tversky, 2001) suggest that inducing and utilizing the hierarchical structure2 of events is a crucial component of how humans understand new events and can help many aforementioned horizontal event prediction tasks.Consider the example in Figure 1.Assume that one has never bought a house, but is familiar with how to "buy a car" and "rent a house"; referring to analogous steps in these two relevant processes would still provide guidance for the target process of "buy a house".Motivated by this hypothesis, our work proposes to directly evaluate a model's event understanding ability.We define this as the Figure 2: Demonstration of the proposed APSI framework.Given a target process P , we first decompose its semantics into two dimensions (i.e., predicate and argument) by grouping processes that share a predicate or an argument.For each such group of processes, we then leverage the observed process graphs G to generate an abstract and probabilistic representation for their sub-event sequences.In the last step, we merge them with an instantiation module to produce the sub-event sequence of P .ability to identify vertical relations, that is, to predict the sub-event sequence of a new process3 .We require models to generate the sub-event sequence for a previously unobserved process given observed processes along with their sub-event sequences, which we refer to as "the observed process graphs" in the rest of this paper.This task is more challenging than "conventional" event predictions tasks, since it requires the generation of a sub-event sequence given a new, previously unobserved, process definition.
To address this problem, we propose an Analogous Process Structure Induction (APSI) framework.Given a new process definition (e.g., 'buy a house'), we first decompose it into two dimensions: predicate and argument.For each of these, we collect a group of processes that share the same predicate (i.e., 'buy-ARG') or same argument (i.e., 'PRE-house'), and then induce an abstract and probabilistic sub-event representation for each group.Our underlying assumption is that processes that share the same predicate or argument could be analogous to each other, and thus could share similar sub-event structures.Finally, we merge these two abstract representations, using an instantiation module, to predict the subevent structure of the target process.By doing so, we only need a small number of analogous pro- cesses (as we show, 20, on average) to generate unseen sub-events for the target process.Intrinsic and extrinsic evaluations show that APSI outperforms all baseline methods and can generate meaningful sub-event sequences for unseen processes, which are proven to be helpful for predicting missing events.
The rest of the paper is organized as follows.Section 2 introduces the Analogous Process structure induction (APSI) framework.Section 3 describes our intrinsic and extrinsic evaluation, demonstrating the effectiveness of APSI and the quality of the induced process knowledge.We discuss related works in Section 4 and conclude this paper with Section 5.

The APSI Framework
Figure 2 illustrates the details of the proposed APSI framework.Given an unseen process P , a target sub-event sequence length k, and a set of observed process graphs G, the task is to predict a k-step sub-event sequence [E 1 , E 2 , ..., E k ] for P .Each process graph G ∈ G in the input contains a process definition P G and an n-step temporally ordered sub-event sequence . We assume that each process P is described as a combination of a predicate and an argument (e.g., 'buy+house') and each sub-event E ∈ E is given as verb-centric dependency graph as used in (Zhang et al., 2020b) (see examples in Figure 3).In APSI, we decompose the target process into two dimensions (i.e., predicate and argument).For each target process, we collect a group of observed process graphs that share either the predicate or the argument with the target process; we assume that processes in these groups have sufficient information for predicting the structure of the target process.We then leverage an event conceptualization module to induce an abstract representation of each process group.Finally, we merge the two abstract, probabilistic representations and instantiate it to generate a ground sub-event sequence as the final prediction.Detailed descriptions of APSI components are introduced as follows.

Semantic Decomposition
Each process definition P is given as a predicate and its argument, which we term below the two "dimensions" of the process definition.We then collect all process graphs in G that have the same predicate as P into G p and those that have the same argument into G a .We assume that these two sets provide the information needed to generate an abstract process representation that would guide the instantiation of the event steps for P .

Semantic Abstraction
The goal of the semantic abstraction step is to acquire abstract representations S p and S a for G p and G a respectively, to help transfer the knowledge from the grounded observed processes to the target new process.To do so, we first need to conceptualize observed sub-events in G p and G a (e.g., "eat an apple") to a more abstract level (e.g., "eat fruit").Clearly, each event could be conceptualized to multiple abstract events.For example, "eat an apple" can be conceptualized to "eat fruit" but also to "eat food", and the challenge is to determine the appropriate level of abstraction.On one hand, the conceptualized event cannot be too general, as we do not want to lose touch with the original event, and, on the other hand, if it is too specific, we will not aggregate enough instances of sub-events into it, thus we will have difficulties transferring knowledge to the new unseen process.To automatically achieve the balance between these conflicting requirements and select the best abstract event for each observed sub-event, we model it as a weighted mutually exclusive set cover problem (Lu and Lu, 2014) and propose an efficient algorithm, described below, to solve it.We then merge the repeated conceptualized events and determine their relative positions.

Modeling Event Conceptualization
For each event E, we first identify all potential events that it can be conceptualized to.If two subevents E 1 and E 2 can be conceptualized to the same event C, we place E 1 and E 2 into the set E C .To qualitatively guide the abstraction process we introduce below a notion of semantic loss that we incur as we move up to more abstract representations.To measure the semantic loss during the conceptualization, we assign weight to each set: where F (E, C) is a scoring function, defined below in Eq. 2, that captures the amount of "semantic details" preserved due to abstracting from E to C. With this definition, the event conceptualization problem can be formalized as finding exclusive4 sets (such as C) that cover all observed events with minimum total weight.In the rest of this section, we first introduce how to collect potential conceptualized events for each E, how we define F , and how we solve this discrete optimization problem.
Identifying Potential Conceptualizations Assume that sub-event E contains m words w E 1 , w E 2 , ..., w E m , each corresponds to a node in Figure 3; for each of these, we can retrieve a list of hypernym paths from Word-Net (Miller, 1998).For example, given the word "house", WordNet returns two hypernym paths5 : (1) "house"→"building"→"structure"→...; (2) "house"→"firm"→"business"→....As a result, we can find w∈E L(w) potential conceptualized events for E, where L(w) is the number of w's hypernyms.We denote the potential conceptualized event set for E as C E and the overall set as C.
Algorithm 1 Event Conceptualization INPUT: Set of events E. Each E ∈ E is associated with a set of potential conceptualization events C E .The overall conceptualized event set C.
Conceptualization Scoring As mentioned above, for each pair of a sub-event E and its potential conceptualization C, we propose a scoring function F (E, C) to measure how much "semantic information" is preserved after the conceptualization.Motivated by Budanitsky and Hirst (2006) and based on the assumption that the more abstract the conceptualized event is, the more semantic details are lost, we define F (E, C) to be: where D(w E i , w C i ) is the depth from w E i to w C i on the taxonomy path, and w is a hyper-parameter6 measuring how much "semantics" is preserved following each step of the conceptualization.Conceptualization Assignment Now we are able to model the procedure of finding proper conceptualized events as a weighted mutually exclusive set cover problem.Note that this is an NP-complete problem and requires a prohibitive computational cost to obtain the optimum solution (Karp, 1972).To obtain an efficient solution that is empirically sufficient for assigning conceptualized events with reasonable amount of in-stances, we develop a greedy procedure as described in Algorithm 1.For each retrieved process graph set G p or G a , we collect all its sub-events as E and use it as the input for the conceptualization algorithm.In each iteration, we first compute the conceptualization score F for all the (E, C) pairs and then compute the weight score for all conceptualization sets E C .After selecting the set with minimum weight, E C min , we remove all the events covered by it from E and repeat the process until no event is left.After the conceptualization, we merge sub-events that are conceptualized to the same event and represent them with the resulting conceptualized event C, whose weight is defined to be W Compared with the naive algorithm, which first expands all possible subsets (i.e., it includes all subsets of E C for all C) and then leverages the sort and filter technique to select the final subsets, we reduce the time complex- where n is the number of conceptualized events and is typically much smaller than |E|.

Conceptualized Event Ordering
After conceptualizing and merging all sub-events, we need to determine their loosely temporal order (e.g., whether they typically appear at the beginning or the end of these sub-event sequences).Let the set of selected conceptualized events be C * .For each C ∈ C * , we define its order score T (C), indicating how likely C is to appear first, as: where θ is the unit step function and t(E C , E C ) represents how many times E C appears before E C in an observed process graph.

Sub-event Sequence Prediction
In the last step, we leverage the two abstract representations we got for the predicate and argument of the target process definition to predict its final sub-events.To do so, we propose the following instantiation procedure.We are given the abstract representations S p and S a , for the predicate and argument, respectively.Each is a set of conceptualized events associated with weights and order scores.For each conceptualized event C p ∈ S p , using each event C a ∈ S a , we can generate a new instantiated event Ĉp .For example, if C p is "cut fruit" and C a is 'buy an apple', then our model would create the new event "cut an apple".Specifically, for each w ∈ C p , if we can find a word ŵ such that ŵ is a hyponym of w, we will replace w with ŵ and repeat this process until no hyponym can be detected in C p .We denote the generated event by Ĉp .To account for the semantic loss during the instantiation procedure, we define the weight and order score of Ĉp as follows: Similarly, we apply the same procedure to C a with C p , and denote the resulted event Ĉa .We then repeatedly merge instantiated events by summing up their weights and averaging their order scores.
In the end, we select top k sub-events based on the weights and sort them based on the order score as the sub-event sequence prediction.

Evaluation
In this section, we conduct intrinsic and extrinsic evaluations to show that APSI can generate meaningful sub-event sequences for unseen processes, which can help predict the missing events.

Dataset
We collect process graphs from the WikiHow website7 (Koupaee and Wang, 2018).In Wiki-How, each process is associated with a sequence of temporally ordered human-created steps.For each step, as shown in Figure 3, we use the tool released by ASER (Zhang et al., 2020b) to extract events and construct the process graphs.We select all processes, where each step has one and only one event, and randomly split them into the train and test data.As a result, we got 13,501 training process graphs and 1,316 test process graphs8 , whose average sub-event sequence length is 3.56.

Baseline Methods
We compare with the following baseline methods: Sequence to sequence (Seq2seq): One intuitive solution to the sub-event sequence prediction task would be modeling it as a sequence to sequence problem, where the process is treated as the input and the sub-event sequence the output.Here we adopt the standard GRU-based encoder-decoder framework (Sutskever et al., 2014) as the base framework and change the generation unit from words to events.For each process or sub-event, we leverage pre-trained word embeddings (i.e., GloVe-6b-300d (Pennington et al., 2014)) or language models (i.e., RoBERTa-base (Liu et al., 2019)) as the representation, which are denoted as Seq2seq (GloVe) and Seq2seq (RoBERTa).
Top One Similar Process: Another baseline is the "top one similar process".For each new process, we can always find the most similar observed process.Then we can use the sub-event sequence of the observed process as the prediction.We employ different methods (i.e., token-level Jaccard coefficient or cosine similarity of GloVe/RoBERTa process representations) to measure the process similarity.We denote them as Top one similar process (Jaccard), (GloVe), and (RoBERTa), respectively.
For each process, we also present a randomly generated sequence and a human-generated sequence9 as the lower-bound and upper-bound for sub-event sequence prediction models.

Intrinsic Evaluation
We first present the intrinsic evaluation to show the quality of the predicted sub-event sequences of unseen processes.For each test process, we provide the process name and the sub-event sequence length10 to evaluated systems and ask them to generate a fixed-length sub-event sequence.

Evaluation Metric
Motivated by the ROUGE score (Lin, 2004), we propose an event-based ROUGE (E-ROUGE) to evaluate the quality of the predicted sub-event sequence.Specifically, similar to ROUGE, which evaluates the generation quality based on N-gram token occurrence, we evaluate how much percentage of the sub-event and time-ordered sub-event pairs in the induced sequence is covered by the human-provided references.We denote the evaluation over single event and event pairs as E-ROUGE1 and E-ROUGE2, respectively.We also provide two covering standards to better understand the prediction quality: (1) "String Match": all words in the predicted event/pairs must be the same as the referent event/pairs; (2) "Hypernym Allowed": the predicted and referent event must  1: Intrinsic evaluation results of the induced process structures.On average, we have 1.7 human-generated sub-event sequences as the references for each test process.Best performing models are marked with the bold font.
have the same dependency structure, and for the words on the same graph position, they should be the hypernym of or same as each other.For example, if the referent event is "eat apple" and the predicted event is "eat fruit", we still count it as a match.The "String Match" setting is stricter, but the "Hypernym Allowed" setting also has its unique value to help better understand if our system is predicting relevant sub-events.

Implementation Details
In terms of training, we set both w v and w n to be 0.5 for our model.For the seq2seq baselines, we set the learning rate to be 0.001 and train the models until they converge on the training data.All other hyper-parameters following the original paper.In terms of the evaluation, we also provide two settings.(1) Basic: we follow previous works (Glavas et al., 2014) to predict and evaluate events based on verbs; (2) Advanced: we predict and evaluate events based on all words.

Result Analysis
We show the results in Table 1.In general, there is still a notable gap between current models' performance and human performance, but the pro-posed APSI framework can indeed generate sufficiently relevant sub-events.For example, if we only consider the verb.Even in the string match setting, 14.8% of the predicted event and 6.6% of the ordered event pairs are covered by the references, which is much better than the random guess and nearly half of the performance of human beings.If hypernym is allowed, 36% and 19% of the predicted event and event pairs are covered.Besides that, if we take all words in the event into consideration, the task becomes more challenging.Specifically, even human can only achieve 11.63 E-ROUGE1 and 5.59 E-ROUGE2, which suggests that low scores achieved by current models are probably due to the limitation of the current dataset (e.g., on average, we only have 1.7 references for each test process).If more references are provided, the performance of all models will also increase.In the rest of the intrinsic evaluation, we present more detailed analysis based on the advanced setting (string match) and a case study to help better understand the performance of APSI.

Effect of the Instantiation Module
One key step in our framework is how to leverage the two abstract representations to predict the fi-Figure 4: Hyper-parameter influence on the quality of APSI generated sub-event sequences.For both w v and w n , 0 indicates no conceptualization and the larger the value, the deeper the conceptualization is.Best performing ranges are marked with red boxes, which indicate that the suitable conceptualization level is the key to APSI's success.nal sub-event sequence.In APSI, we propose an instantiation module, which jointly leverages the two representations to generate detailed events.To show its effect, we compare it with two other options: (1) Simple Merge: Merge two representation and select the top k sub-events based on the weight; (2) Normalized: First normalize the weight of all sub-events based on each representation and then select the top k sub-events.
From the result in Table 2, we can see that due to the imbalanced distribution of the two representations, simply choosing the most weighted subevents is problematic.On average, for each predicate, we can collect 18.04 processes, while we can only collect 1.92 processes for each argument.As a result, the sub-events in the predicate representation typically have a larger weight.Thus if we simply merge them, most of the predicted sub-events will come from the predicate representation.Ideally, the "normalized" method can eliminate the influence of such imbalance, but it also amplifies the noise and achieves worse empirical performance.Differently, the proposed instantiation module uses events in one representation as the reference to help instantiate the events in the other one.As a result, we jointly use these two representations to generate a group of detailed events, and then we can select the top k generated new events.By doing so, we do not only go detailed from the abstract representation but also avoid the imbalanced distribution issue.

Hyper-parameter Analysis
In APSI, we use two hyper-parameters w v and w n to control the conceptualization and instantiation depth we want over verbs and nouns respectively.0 means no conceptualization and the larger value indicates more conceptualization we encourage.We show the performance of APSI with different hyper-parameter combinations in Figure 4, from which we can see that a suitable level of conceptualization is the key to the success of APSI.If no conceptualization is allowed, all the predicted events are restricted to the observed sub-event, thus we cannot predict "search house" after seeing "search car" and some events about the house.On the other hand, if we do not restrict the depth of conceptualization, all the sub-events will be conceptualized to be too general.As a result, even with the instantiation module, we could not predict the detailed sub-event as we want.Pre-trained language models are trained to predict the masked event given other events as the context.

Case Study
Figure 5 shows an example that we use to analyze the current limitations of APSI.We can see that APSI can successfully predict events like "identify symptoms", but fails to predict event "identify causes".Instead, it predicts "take supplements".This is because APSI learns to predict such sequence from other processes like "treat diarrhea" or other diseases in the observed process graphs.Treating those diseases typically does not involve identifying the cause, which is not the case for treating pain.And, treating diseases often involves taking medicines, which can be conceptualized to "take supplement".As no events about pain helps instantiate "supplement", APSI just predicts it.

Extrinsic Evaluation
As discussed by (Rumelhart, 1975), the knowledge about process and sub-events can help understand event sequences.Thus, in this section, we investigate whether the induced process knowledge can help predict the missing events.Given a sub-event sequence, for each event in the sequence, we can use the rest of the sequence as the context and ask models to select the correct event against one negative event example.To make the task challenging, instead of random sampling, we follow Zellers et al. (2019) to select similar but wrong negative candidates based on their representation (i.e., BERT (Devlin et al., 2019)) similarity.We use the same training and test as the intrinsic experiment and as a result, we got 13,501 training sequences and 7,148 test questions.
The baseline method we are comparing with is the event-based masked language model11 , whose demonstration is shown in figure 6.We use pretrained RoBERTa-base (Liu et al., 2019) to initialize the tokenizer and transformer layer and all sequences of training processes as the training data.To show the value of understanding the relationship between process and their sub-event sequence, for each sub-event sequence in the test data, we first leverage the process name and different structure prediction methods to predict subevent sequences and use them as additional context to help the event masked LM to predict the missing event.To show the effect upper bound of adding process knowledge, we also tried adding the process structure provided by human beings as the context12 , which is denoted as '+Human'.All models are evaluated based on accuracy.
From the results in Table 3, we can make the following observations.First, adding high-quality process knowledge (i.e., APSI and Human) can significantly help the baseline model, which indicates that adding knowledge about the process can help better understand the event sequence.Second, the effect of process knowledge is positively correlated with their quality as shown in Table 1.Adding a low-quality process structure may hurt the performance of the baseline model due to the introduction of the extra noise.Third, the current way of using process knowledge is still very simple and there is room for better usage of the process knowledge, as the research focus of this paper is predicting process structure rather than applying it, we leave that for the future work.
Throughout history, considering the importance of events in understanding human language (e.g., commonsense knowledge (Zhang et al., 2020a)), many efforts have been devoted to define, represent, and understand events.For example, Verb-Net (Schuler, 2005) created a verb lexicon to represent the semantic relations among verbs.After that, FrameNet (Baker et al., 1998) proposed to represent the event semantics with schemas, which has one predicate and several arguments.Apart from the structure of events, understanding events by predicting relations among them also becomes a popular research topic (e.g., Time-Bank (Pustejovsky et al., 2003) for temporal relations and Event2Mind (Rashkin et al., 2018) for causal relations).Different from these horizontal relations between events, in this paper, we propose to understand event vertically by treating each event as a process and trying to understand what is happening (i.e., sub-event) inside the target event.Such knowledge is also referred to as event schemata (Zacks and Tversky, 2001) and shown crucial for how humans understand events (Abbott et al., 1985).One line of related works in the NLP community is extracting super-sub event relations from textual corpus (Hovy et al., 2013;Glavas et al., 2014).The difference between this work and them is that we are trying to understand events by directly generating the sub-event sequences rather than extracting such information from text.Another line of related works is the narrative schema prediction (Chambers and Jurafsky, 2008), which also holds the assumption that event schemata can help understand events.But their research focus is using the overall process implicitly to help predict future events while this work tries to understand events by knowing the relation between processes and their sub-event sequences explicitly.

Conclusion
In this paper, we try to understand events vertically by viewing them as processes and predicting their sub-event sequences.Our APSI framework is motivated by the notion of analogous processes, and attempts to transfer knowledge from (a very small number of) familiar processes to a new one.The intrinsic evaluation demonstrates the effectiveness of APSI and the quality of the predicted sub-event sequences.Moreover, the extrinsic evaluation shows that, even with a naive ap-plication method, the process knowledge can help better predict missing events.

Figure 1 :
Figure 1: An illustration of leveraging known processes to predict the sub-event sequence of a new process.

Figure 5 :
Figure 5: Case Study.We mark the covered and not covered predictions with green and red colors.

Figure 6 :
Figure 6: Demonstration of the event masked LM.Pre-trained language models are trained to predict the masked event given other events as the context.
Advanced Setting (for each sub-event, we predict and evaluate all words)

Table 2 :
Performance of different merging methods.

Table 3 :
Results on the event prediction task.† and ‡ indicate the statistical significance over the baseline with p-value smaller than 0.01 and 0.001 respectively.