What Are You Trying to Do? Semantic Typing of Event Processes

This paper studies a new cognitively motivated semantic typing task,multi-axis event process typing, that, given anevent process, attempts to infer free-form typelabels describing (i) the type of action made bythe process and (ii) the type of object the pro-cess seeks to affect. This task is inspired bycomputational and cognitive studies of eventunderstanding, which suggest that understand-ing processes of events is often directed by rec-ognizing the goals, plans or intentions of theprotagonist(s). We develop a large dataset con-taining over 60k event processes, featuring ul-tra fine-grained typing on both the action andobject type axes with very large (10ˆ3∼10ˆ4)label vocabularies. We then propose a hybridlearning framework,P2GT, which addressesthe challenging typing problem with indirectsupervision from glosses1and a joint learning-to-rank framework. As our experiments indi-cate,P2GTsupports identifying the intent ofprocesses, as well as the fine semantic type ofthe affected object. It also demonstrates the ca-pability of handling few-shot cases, and stronggeneralizability on out-of-domain processes.


Introduction
Events are the fundamental building blocks of natural languages.To help machines understand events, extensive research effort has been devoted to inducing how events described in text are procedurally connected (Ning et al., 2017;Radinsky et al., 2012), and how they form event processes 3 (Pichotta and Mooney, 2014;Berant et al., 2014;Jindal and Roth, 2013).Consequently, such prototypical schematic sequences of events have found important use cases including storyline construction (Do et al., 2012;Radinsky and Horvitz, 2013), narrative cloze (Chaturvedi et al., 2017;Lee and Goldwasser, 2019), biological process comprehension (Berant et al., 2014) and diagnostic prediction (Zhang et al., 2020b).
Nonetheless, understanding an event process is not just about inducing temporal relations between events or inferring missing steps in an event sequence.As suggested by cognitive studies (Zacks et al., 2001;Zacks and Tversky, 2001;Kurby and Zacks, 2008), a process of events is defined more by the goals, plans, intentions, or traits of its performer, rather than by physical characteristics.For example, a series of events digging a hole, putting in some seeds, filling with soil and watering the soil, occurs in a specific sequence since these steps are directed towards the central goal of planting a plant by the performer.Similarly, we can tell that making a dough, adding toppings, preheating the oven and baking the dough is likely a chain of actions aimed at cooking pizza.Indeed, aforementioned studies show that humans understand a plausible event process by hypothesizing the objectives those co-occurring events aim for, or the ultimate consequence the process seeks to accomplish.Accordingly, we suggest that computational methods for event understanding would benefit from conceptualizing the intentions behind the processes.Moreover, inducing intentions is crucial to rich understanding of text (Rashkin et al., 2018), and could potentially support other applications such as commonsense reasoning (Sap et al., 2019), summarization (Daumé III and Marcu, 2006), reading comprehension (Berant et al., 2014) and schema induction (Huang et al., 2016).
To understand the intentions of event processes, the first contribution of this paper is to propose a new semantic typing task.The event process typing task seeks to retrieve ultra fine-grained type  information to summarize the goal and intention of the associated events.Specifically, each event process is typed along two axes: the action type that describes the type of action the process takes, and the object type that semantically types the object(s) that the process seeks to affect.Figure 1 shows several accordingly typed event processes.Motivated by recent works on entity typing (Choi et al., 2018;Zhou et al., 2018), our task employs large type vocabularies supporting diverse free-form semantic labels for both axes.
To facilitate related research, we developed a large dataset extracted from wikiHow4 , as the second contribution of this paper.This dataset contains over 60,000 processes of primitive events, and features fine-grained action and object type labels for each process.While the dataset aims at creating rich examples of event process intentions, it is also a challenging dataset from two perspectives.First, vocabularies on both type axes are remarkably diverse, giving over 1,000 action type labels and over 10,000 object type labels.And, these fine-grained type vocabularies occur quite sparsely -around 68% of action types and 88% of object types occur fewer than 10 times.This leads to a few-shot learning scenario and, in nearly half of the cases, one-shot.Second, the free-form type labels are generally external to the lexical content of the associated events appearing in a process.Hence, this typing task could not be easily handled with an extractive method (Nenkova and McKeown, 2012).
While the task and dataset pose a non-trivial learning problem, the free-form type system allows for a practical form of indirect supervision based on gloss knowledge.As the third contribution, we propose a hybrid learning framework, P2GT (i.e., process-to-gloss based typing), to leverage such in-direct supervision for event process typing.Instead of directly inferring the multi-axis type labels, we find it to be much easier to seize on the semantic relatedness between the process-gloss pair, as the gloss provides richer semantic information than the label itself.For few-shot cases, gloss definitions also represent useful side information to jump-start inducing labels that are rarely seen or completely unseen in training.
The proposed framework fine-tunes a pre-trained language model to capture the relatedness of an event process and the gloss of types with a ranking task objective.To incorporate more precise gloss information, the training process deploys a word sense disambiguation (WSD) module for both verbs and nouns.Joint learning for both action and object types is enforced to further complement scarce supervision signals.Based on extensive experimental evaluation, the proposed framework exhibits promising performance of inferring the finegrained multi-axis type information.Specifically, it outperforms a strong RoBERTa-based baseline by 2.4-3.0 folds in recall@1.We also show that the incorporated gloss knowledge supports few-shot case prediction, and benefits our model's generalization to out-of-domain event processes.

Task and Dataset
We hereby formulate the task of multi-axis event process typing, and introduce the contributed dataset.

Task Definition
Following Chambers and Jurafsky (2008), we define a process as a sequence of primitive events P = [e 1 , e 2 , ..., e l ] performed by one common protagonist (or performer).Since the protagonist is shared among the events, each event e i thereof contains a predicate a i mentioning an action performed    by the protagonist, and an object o i describing the object(s) that the action is taken upon.The goal is to conceptualize the overall intention behind the process P into two labels, i.e.A from the verb vocabulary that describes the overall action of P , and O from the noun vocabulary that describes what object(s) the process is most likely to affect.Such type inference is important to applications that require commonsense reasoning based on chains of activities, including event-based summarization, narrative prediction and open-domain QA.

Dataset
We construct a large corpus of typed event processes based on wikiHow -an online wiki-style community containing a collection of professionally edited how-to guideline articles.
Construction A set of the articles are crawled from wikiHow, where each included article describes ordered steps of activities to complete a central goal (e.g. the article "How to book a flight" describes necessary steps to complete an airline booking).Each described step of an article forms a standalone section, which provides an easy-toconsume format for obtaining event processes with clear intentions.We use AllenNLP (Gardner et al., 2018) to perform SRL on section titles of a goalstep article, and extract the VERB (predicate a i ) and ARG1 (object o i ) outputs from the section titles to form the corresponding sequence of (primitive) events.Note that some articles may contain multiple step sequences for the same goal, e.g.booking a flight can be separated to two alternatives, either about booking online or via phone call.In such cases, each alternative is extracted as a separate process.Moreover, we only preserve processes where every primitive event contains both VERB and ARG1.Any ARG0's are however omitted, since all events in a process share the same protagonist.
To obtain the type information, we first run SRL on the clause after "how to" in the article titles, from which the VERB term is seized as the action type label.Then on the ARG1 output of SRL, we fetch only the lemmatized head word based on dependency parsing and lemmatization (Bird and Loper, 2004).This typically gives us the non-plural noun that represents the object type, whereas other dependents including modifiers are dropped.Consider the clause in "How to make a birthday cake", after make is fetched with SRL, the head word cake will be preserved from the ARG1 "a birthday cake", providing an adequately abstracted label for object typing while being consistent to task definition.

Statistics
The above effort obtains 62,277 clean event processes, each of which is labeled with both action and object types.Lengths of the processes are varied, for which the distribution is plotted in Figure 2.While the dataset gives a rich variety of instances for processes and intentions, it features a challenging type system for several reasons: • Diversity.The fine-grained type vocabularies consist of 1,336 action types and 10,441 object types.As shown in Figure 3, both sets of labels generally form long-tail distributions.
• Few-shot cases.There are 68.3% of action type labels and 88.2% of object type labels occuring fewer than 10 times across all processes.This fact indicates extreme few-shot cases that are challenging to learning and inference.
• External labels.In around 91.2% processes, the action type labels are different from the predicates of associated events, while 84.2% of processes have object type labels that do not appear as event objects.Such generally external labels easily cause extractive or sequence-to-label prediction methods to fall short.

Process Typing with Gloss Knowledge
In this section, we present our method for the multiaxis event process typing task.The proposed P2GT framework conducts learning in three steps.A pretrained language model is first used to produce the representations of processes.Then, the gloss information of type vocabularies is encoded as intermediate representations for type labels using the same language model, for which WSD is performed to refine the gloss information of polysemous labels during training.Finally, the language model is finetuned with a ranking task objective to capture the association of process-gloss pairs.In the last step thereof, joint learning is performed for typing on both axes to complement the scarce supervision signals, where a process representation is separately projected and handled for action and object types in the latent space.Figure 4 displays the overall model architecture.
In the rest of this section, we introduce the technical details of each step for learning and inference.

Process Representation
We use the officially released RoBERTa-base (Liu et al., 2019) for representations of event processes.RoBERTa improves the original BERT (Devlin et al., 2019) with a modified training procedure.It is considered one of the SOTA models for semantic representation of lexical sequences.
To encode a process P , we concatenate the predicate and object (a i and o i ) of each event (e i ).Then those contents of all primitive events in P are sequentially concatenated, while the separator token of RoBERTa </s> is added between the contents of every consecutive two events.The entire lexical sequence is enclosed between tokens <s> and </s> to denote the beginning and end of the sequence.Following convention (Bommasani et al., 2020), mean-pooling of hidden states produces the encoded representation of the process, denoted P.

Label Representation
In our problem setting, directly capturing the association between a process and a free-form type label can be difficult.Hence, we propose a way of indirect supervision by using gloss knowledge as intermediate representations of type labels.The sense definitions in the glosses contain much richer semantic information of the labels themselves.Therefore, leveraging intermediate representations seeks to better characterize the semantic relatedness of processes and labels, especially when the labels are often external to the event content.Glosses also adequately provide side information to jump-start few-shot label representations.
Given a label L for either type axis, we use the same RoBERTa model (with shared parameters) for process representation to encode its gloss sense definition.The mean-pooling result produces the gloss-based label representation denoted S L .Consider that the verb and noun terms in label vocabularies can be polysemous, we employ either of the following two techniques to select the glosses in the learning phase: • Pre-trained WSD models.One technique is to employ off-the-shelf WSD models that handle both verb and noun senses (Hadiwinoto et al., 2019;Huang et al., 2019).This could more precisely find the right definition for each label given the specific context of a process, and allows each (polysemous) label to have varied representations when typing different processes.
During training P2GT, we run WSD on the concatenation of type labels [A, O] to select the glosses of A and O for each training case.
• Most frequent senses (MFS).Suppose a WSD model is not available, then the default way is to match a label only to its most frequent (or predominant) sense in sense-annotated corpora (Langone et al., 2004;Camacho-Collados et al., 2016).The MFS method has been a very strong baseline for unsupervised WSD (Tripodi and Navigli, 2019), as it is natural in language text that words generally express their predominant senses in most cases (McCarthy et al., 2007).Specifically for our task, the purpose is not to infer the exact sense, but rather generating a semantically rich (and allowably noisy) repre-sentation for type labels.In practice, we find this simple technique to perform reasonably well as we type the event processes ( §4.3).
Besides these two techniques, we also tried others to represent a label, including concatenating all its gloss sense definitions, or concatenating most frequent two or more senses.They however do not perform as well as the aforementioned two techniques, hypothetically due to the noise introduced to label representations.More technical details about WSD and the source inventory of glosses are to be described in Experiments ( §4.1).

Learning Objective
Let (P, A, O) be a process P denoted by action and object labels A and O, our model captures the semantic associations between a RoBERTa encoded process P and label glosses S A and S O by optimizing a ranking task objective.In detail, we define the margin ranking loss for action typing as and that for object typing as [x] + thereof denotes the positive part of the input x (i.e.max(x, 0)).γ 1 and γ 2 are two positive constant margins.M 1 and M 2 correspond to two learnable linear projections dedicated to the two type axes respectively.s(•) is the cosine similarity measure.A ∈ V and O ∈ N are negative-sample labels.In the setting with WSD deployed in training, negative sampling randomly fetches from all glosses of labels that appear in the training data, except for the gloss(es) of the positive label.This allows chances for different glosses of a polysemous label to serve as negative samples.Otherwise, the only gloss of every negative-sample label is utilized in the MFS setting.
The eventual learning objective is to optimize the following joint loss, where D denotes the dataset: Note that we have incorporated different margins that can trade-off between L P 1 and L P 2 , hence we do not use weight coefficients to combine these two terms of ranking losses.

Inference
The inference phase of P2GT performs a nearest neighbor search to type a process P .Let M refer to either M 1 for the action type or M 2 for the object type, our framework finds the gloss-based label representation that is closest to M • P from the corresponding vocabulary.Specifically for the setting with polysemous label representations, it is sufficient to consider for each label only its gloss that is embedded most closely to M • P, so as to not redundantly consider candidate labels.

Experiments
To evaluate the proposed P2GT framework for event process typing, we conduct several experiments on the contributed dataset, and compare with a wide selection of baseline methods ( §4.1- §4.3).A case study is also provided on typing processes from an external dataset ( §4.4).

Experimental Settings
Similar to Rashkin et al. ( 2018), we randomly separate the 62,277 processes into a training/dev./testset using an 80/10/10% split.We report three ranking metrics, i.e.MRR (mean reciprocal rank), recall@1 and recall@10.All metrics are preferred to be higher to indicate better performance.
We compare our framework with a number of its variants by performing the following modification: (i) Simplifying the framework by separately learning for the two type axes, instead of performing joint training; (ii) Different settings of gloss selection in training, using either WSD or FSM; (iii) Different information used to represent each primitive event e i , e.g., only using either a i or o i (marked with partial event) according to the type axis, instead of using both.Besides, we compare with sequence-to-label (S2L) generators (Rashkin et al., 2018).A method of such is an encoder-decoder architecture trained to directly map from sequences to unigrams of the type vocabulary, which is originally used by recent work (Rashkin et al., 2018) to infer intentions from a single-clause description of a primitive event.Specifically, we employ three variants of S2L using different encoders.Besides one based on RoBERTa (marked as S2L-RoBERTa), the two others are the BiGRU encoder (S2L-BiGRU) and mean-pooling encoder (S2Lmean) with Skip-Gram word embeddings used by Rashkin et al. (2018).Note that to train S2L models, the original paper uses an cross-entropy loss  to model the distribution of unigrams.We instead train the process encoder to directly fit the embeddings of label surface forms similar to a reverse dictionary (Hill et al., 2016;Chen et al., 2019), which offers notably better performance.

Model Configuration
We use sense definitions from WordNet (Miller, 1995) to define the labels.While such glosses cover all verbs in the action type vocabularies, there are 7.92% of processes where object type labels do not find WordNet senses.For each such case, we select from WordNet the lexeme that is embedded most closely to the label, and use the predominant sense of that lexeme to generate the label representation.For the training setting with WSD, we use the BERT-NN model (Hadiwinoto et al., 2019), which is one of the SOTA WSD methods that is trained on the SemCor corpus (Langone et al., 2004).In fact, despite the ones that are dedicated to nouns (Scarlini et al., 2020;Pasini and Navigli, 2017), other SOTA methods for WSD (Huang et al., 2019;Maru et al., 2019;Tripodi and Navigli, 2019) may also apply to our framework, for which we leave the investigation to future work.We use AMSGrad (Reddi et al., 2018) to optimize the learning objective, with the learning rate set to 0.0001.The batch size is set to 64 to fit the memory of one Titan RTX 6000 GPU.Training is limited to 50 epochs that is enough for all models to converge.Margins are chosen from 0.0 to 0.4 with a step of 0.1, based on recall@1 performance on the dev.set.Accordingly, γ 1 = 0.2 and γ 2 = 0.1 are selected for Single P2GT methods, while both margins are set to 0.1 for the joint-learning P2GT.

Results
We report the results of event process typing on both axes in Table 1, whereof the results for typing actions are generally better than those for the object Table 3: Case study for typing event processes in the news domain.The predictions are given by Joint P2GT-WSD trained on our full dataset.Each case is given top 3 predictions on both axes, whereof reasonably correct ones are boldfaced, and relevant ones are italic.Few-shot labels appearing up to 10 times in our dataset are in blue.axis due to different sizes of candidate spaces.
The results by the S2L baseline methods show that incorporating pre-trained RoBERTa offers noticeably better performance than other encoding techniques.However, it is drastically superseded by the Single P2GT setting without WSD-based gloss selection.When typing the action, with the same representation of event processes, P2GT supercedes S2L-RoBERTa with an absolute increase of MRR by 15.47% (ca.1.88× relative increase), and that of recall@1 by 14.36% (ca.2.7× relative increase).For object typing, the absolute increments are 8.83% in MRR (ca.1.82× relatively) and 5.62% in recall@1 (ca.1.72× relatively).This also indicates that incorporating gloss knowledge for label representations brings along the most substantial improvement to the task, inasmuch as glosses attain rich semantic information to jump-start type labels and realistically support sequence-level learning-to-rank.
On the other hand, incorporating WSD for gloss selection in training slightly causes absolute increment of up to 1.73% in MRR for action typing and 0.48% in MRR for object typing.This is partly due to that the predominant sense definitions can generally seize precise or close definitions to represent the labels in most cases.Hence, sense selection provides lesser improvement, especially when the candidate space is large.It is noteworthy that, partially giving the predicate or object information of associated events is not enough to infer the type information.In fact, the performance drop is in accordance with human cognition, as giving a chain of predicates only or objects only is not enough to predict the intentions of the event process.Consider the first example in Figure 1, observing only a chain of the protagonist's actions dig, put, fill and water, or only the participating objects hole, seed and soil are clearly not enough to infer the overall action and the objective that directs the entire process.Accordingly, the partial event representation causes significant performance drop of 3.35-7.76%in terms of MRR, and 2.49-5.86% in terms of recall@1.Lastly, joint learning brings along performance gain by 1.51-4.47% in MRR and 0.96-1.76% in recall@1, indicating the effectiveness of lever-aging complementary supervision signals.Note that the evaluation strictly enforces exact match in large candidate spaces, thus underestimating the system performance.While it is difficult for the model to always rank the ground-truth labels on the top, it can often infer reasonably close labels as top predictions, for which a couple of examples are shown in Table 2.
To understand how differently our method performs on processes of different characteristics, we additionally perform an error analysis.In Figure 5, we compare action prediction by P2GT with joint learning and MFS-based gloss selection on different proportions of the test set.It is expected that the performance on more frequent labels are better than on infrequent ones due to ampler training cases.Nonetheless, on the extremely challenging one-shot cases, our method still performs reasonably well, and drastically excels the overall results by baseline methods.Additionally, we observe that typing longer event processes is easier, as they provide more contextual information of associated events to help inferring the central goal.In contrast, as short processes are less informative, MRR scores for those sized 2 and 3 are 24.17% and 25.41%.

Case Study
We conduct a case study using a subset of the NYT narrative cloze dataset provided by Lee and Goldwasser (2019).This dataset includes a series of event processes extracted from news reports, and we use those processes to showcase the prediction of P2GT on out-of-domain processes.According to Table 3, although the content and concepts of processes in military and political news are mostly irrelevant to the intentional goals in our dataset, P2GT is able to infer reasonably correct type information on both axes.Particularly, many of the top predictions give few-shot labels.This further exhibits that gloss knowledge is effective to improve the generalization of the typing model, both in terms of handling domain shifting and few-shot cases.Specifically, the case study also points out the direction of our further study on how well glossbased label representations can generally benefit domain adaptation and few-shot learning in natural language understanding tasks.

Related Work
Prediction tasks on event processes have attracted much attention recently, while many ex-isting works focus on extraction and completion of event processes.For example, Radinsky et al. (2012;2013) mine sequences of frequently co-ocurring events from multiple temporally connected documents, and use the sequence knowledge to predict the future event(s) of a process.Berant et al. (2014) propose to extract biological processes with SRL, and help machine reading comprehension for biological articles.A series of other works learn for sequential event prediction using language models (Chaturvedi et al., 2017;Peng et al., 2019) or association rules (Letham et al., 2013), and further cope with downstream tasks such as narrative cloze tests.On the contrary, fewer efforts have been made for inferring the intentions or central goals behind a composite of events.A recent work by Rashkin et al. ( 2018) is particularly relevant to this topic, which learns a sequence-to-label generator to predict the intention of one primitive event based on a single-clause description.This is however essentially different from our focus on processes of multiple events.
Semantic typing has been investigated for language components other than events, such as entities and word senses.Due to the large body of work in this line of research, we can only provide a highly selected summary for most recent outcomes.For entity typing, recent research has coped with highly challenging problem settings.Those include few-shot or zero-shot typing with contextual distant supervision (Zhou et al., 2018) and description-based label embeddings (Obeidat et al., 2019).Others realize ultra-fine type systems with the help of head-word supervision (Choi et al., 2018), hierarchical learning-to-rank (Chen et al., 2020) and structured label representations (Xiong et al., 2019;Hao et al., 2019).Several aforementioned techniques are also employed to supersense typing (Levine et al., 2020;Peters et al., 2019) and POS tagging (Owoputi et al., 2013).In terms of type labeling, our work is inspired by Choi et al. (2018)'s way of leveraging free-form lexemes for ultra-fine entity types.Nevertheless, besides typing on a different modality, our work is also distinguished in the multi-axis typing system, and the way of leveraging gloss-based indirect supervision.
Representation learning of gloss knowledge has been incorporated in various tasks.A number of works encode gloss definitions for monolingual (Hill et al., 2016;Noraset et al., 2017;Pilehvar, 2019;Hedderich et al., 2019) and cross-lingual (Chen et al., 2019;Zhang et al., 2020a) reverse dictionary prediction, as well as out-of-vocabulary lexical representation (Kumar et al., 2019;Prokhorov et al., 2019;Bahdanau et al., 2017).Definitions have also been leveraged to generate zero-shot entity representations in knowledge graphs (Kartsaklis et al., 2018;Chen et al., 2018;Long et al., 2017).Some other works inject gloss representations to improve WSD (Huang et al., 2019;Luo et al., 2018;Blevins and Zettlemoyer, 2020).Gloss-BERT (Huang et al., 2019) thereof formalizes the WSD problem as classifying context-gloss pairs.Our learning approach on process-gloss pairs is connected to that approach, whereas we handle a learning-to-rank objective, and make inference in a much larger candidate space than the sense space of a single word.

Conclusion
We propose a new task of event process understanding, by semantically typing the intended action of an event process and the object(s) it seeks to affect.To facilitate research in this direction, we develop a new dataset, gathering over 60 thousand event processes with ultra fine-grained type vocabularies.We further propose a hybrid learning framework, which leverages indirect supervision from gloss knowledge.The proposed P2GT framework fine-tunes RoBERTa to capture the association of process-gloss pairs.Label gloss selection mechanisms and joint training are incorporated to further improve the performance.Experiments show that P2GT offers promising performance on inferring the fine-grained type information, and exhibits satisfactory generalizability on out-of-domain event processes.
For future work, we are interested in identifying salient events in processes, i.e., those that most significantly define the central goals.Incorporating process typing into downstream tasks such as summarization and commonsense QA is also an important direction.

Figure 1 :
Figure 1: Examples of type inference for event processes.

Figure 3 :
Figure 3: Distribution of action and object types.Number of frequencies are shown in the brackets.
Figure 4: A gloss selection module selects the proper glosses of the training labels.Then a RoBERTa language model captures the event process, and separately generates gloss-based representations for positive and negative sampled labels.The entire learning process conducts joint learning-to-rank on both type axes.
⇒ Obtain a container ⇒ Obtain shrapnel ⇒ Install a trigger A: detonate, assemble, blacken O: grenade, blaster, mine Ignore order ⇒ Enter area ⇒ Enforce blockade ⇒ Force to retreat from area A: conquer, disarm, invade O: barrier, soldier, fortress Capture two opposition posts ⇒ Kill many fighters ⇒ Destroy three armed trucks ⇒ Confiscate artillery guns A: kill, demolish, fight O: melee, conflict, stronghold Cooperate with the counsel investigation ⇒ Open his remarks ⇒ Apologize many times ⇒ Try to restore public trust A: respond, disagree, accept O: apology, disagreement, slander Travel in a presidential motorcade ⇒ Be shot once in the back ⇒ Be taken to hospital ⇒ Be pronounced dead A: survive, die, tackle O: assassin, crash, roadkill Give advance notice ⇒ Give notice ⇒ Issue dividends A: honor, pay, reward O: finance, equity, subsidy Target quotes ⇒ Target shares quotes ⇒ Ask to clarify offer ⇒ Challenge to merge agreement ⇒ Challenge to merge businesses A: compare, maximize, negotiate O: prospectus, quote, settlement Clean windows ⇒ Buy plants ⇒ Hang pictures ⇒ Paint walls ⇒ Carpet floors A: redecorate, decorate, refurbish O: room, bedroom, makeover

Figure 5 :
Figure 5: Comparison of action typing on different portions of the test set.We compare results by Joint P2GT-MFS for top 100 frequent types, one-shot types and the rest, as well as results on processes of different lengths.5 is the median length of processes in the dataset.

Table 1 :
(Rashkin et al., 2018)) for multi-axis event process typing.S2L methods with different encoding techniques are original or adopted from Event2Mind(Rashkin et al., 2018).partial event marks the cases where only a i (or o i ) is encoded for each event e i in the process to infer the action (or object) type.Joint or Single denotes whether to use joint training for both type axes or not.MFS and WSD marks ways of gloss selection in training.
A: strop, highlight, thread, blunt, sharpen O: unibrow, eyebrow, straightener, eyelash, razor Learn how to strum ⇒ Use a metronome ⇒ Play to recorded songs ⇒ Grow skills A: play, practice, strum, tune, box O: cymbal, mandolin, guitar, dulcimer, flute Get a referral ⇒ Verify the specialist 's qualifications ⇒ Ask questions ⇒ Assess whether treatment is working A: find, choose, use, apply, drink O: therapist, physician, specialist, surgeon, psychiatrist Go to DMV ⇒ Take photos ⇒ Take vision test ⇒ Take permit test ⇒ Take road test A: obtain, verify, explore, drive, polish O: license, check, visa, carfax, toll Create your clan ⇒ Maintain your clan ⇒ Add another clan ⇒ Defend the borders ⇒ Do the hunting A: adopt, create, spawn, homestead, become O: clan, warrior, headhunter, skirmish, necrons Prepare the jack ⇒ Locate the filler hole ⇒ Fill the oil ⇒ Close the filler hole A: bleed, grease, add, fill, inflate O: oil, pump, biodiesel, blowing, choke

Table 2 :
Top 5 predictions on examples of test cases by Joint-P2GT-WSD.Ground truths are underscored, reasonably correct labels are boldfaced, and close ones are italic.Few-shot labels appearing ≤10 times are in blue.