Cross Sentence Inference for Process Knowledge

For AI systems to reason about real world situations, they need to recognize which processes are at play and which entities play key roles in them. Our goal is to extract this kind of role-based knowledge about processes, from multiple sentence-level descriptions. This knowledge is hard to acquire; while semantic role labeling (SRL) systems can extract sentence level role information about individual mentions of a process, their results are often noisy and they do not attempt create a globally consistent characterization of a process. To overcome this, we extend standard within sentence joint inference to inference across multiple sentences. This cross sentence inference promotes role assignments that are compatible across different descriptions of the same process. When formulated as an Integer Linear Program, this leads to improvements over within-sentence inference by nearly 3% in F1. The resulting role-based knowledge is of high quality (with a F1 of nearly 82).


Introduction
Knowledge about processes is essential for AI systems in order to understand and reason about the world. At the simplest level, even knowing which class of entities play key roles can be useful for tasks involving recognition and reasoning about processes. For instance, given a description "a puddle drying in the sun", one can recognize this as an instance of the process evaporation using a macrolevel role knowledge: Among others, the typical undergoer of evaporation is a kind of liquid (the pud-1) Evaporation is the process by which liquids are converted to their gaseous forms. 2) Evaporation is the process by which water is converted into water vapor.
3) Water vapor rises from water due to evaporation. 4) Clouds arise as water evaporates in the sun. dle), and the enabler is usually a heat source (the sun).
Our goal is to acquire this kind of role-based knowledge about processes from sentence-level descriptions in grade level texts. Semantic role labeling (SRL) systems can be trained to identify these process specific roles. However, these were developed for sentence-level interpretation and only ensure within sentence consistency of labels (Punyakanok et al., 2004;Toutanova et al., 2005;Lewis et al., 2015), limiting their ability to generate coherent characterizations of the process overall. In particular, the same process participant may appear in text at different syntactic positions, with different wording, and with different verbs, which makes it hard to extract globally consistent descriptions. In this work, we propose a cross sentence inference method to address this problem.
To illustrate the challenge consider some example sentences on evaporation shown in Table 1.The underlined spans correspond to fillers for an undergoer role i.e., the main entity that is undergoing evaporation. However, the filler water occurs as different syntactic arguments with different main actions. Without large amounts of process-specific training data, a supervised classifier will not able to learn these variations reliably. Nevertheless, since all these sentences are describing evaporation, it is highly likely that water plays a single role. This expectation can be encoded as a factor during inference to promote consistency and improve accuracy, and is the basis of our approach.
We formalize this cross sentence joint inference idea as an Integer Linear Program (ILP). Our central idea is to collect all sentences for a single process, generate candidate arguments, and assign roles that are globally consistent for all arguments within the process. This requires a notion of consistency, which we model as pairwise alignment of arguments that should receive the same label. Argument-level entailment alone turns out to be ineffective for this purpose.
Therefore, we develop an alignment classifier that uses the compatibility of contexts in which the candidate arguments are embedded. We transform the original role-label training data to create alignment pairs from arguments that get assigned the same label, thus avoiding the need for additional labeling. Finally, the ILP combines the output of the SRL classifier and the alignment classifier in an objective function in order to find globally consistent assignments.
An empirical evaluation on a process dataset shows that proposed cross sentence formulation outperforms a strong within sentence joint inference baseline, which uses scores from a custom built role classifier that is better suited for the target domain.
In summary, this work makes the following contributions: 1. A cross-sentence, collective role-labeling and alignment method for harvesting process knowledge.
2. A high quality semantic resource that provides knowledge about scientific processes discussed in grade-level texts including physical, biological, and natural processes.
3. An evaluation which shows that the proposed cross sentence inference yields high quality process knowledge.

Related Work
Role-based representations have been shown to be useful for Open-domain factoid question answering (Shen and Lapata, 2007;Pizzato and Mollá, 2008), grade-level science exams (Jauhar et al., 2016) , and comprehension questions on process descriptions (Berant et al., 2014). Similar to process comprehension work, we target semantic representations about processes but we focus only on a high-level summary of the process, rather than detailed sequential representation of sub-events involved. Moreover, we seek to aggregate knowledge from multiple descriptions rather than understand a single discourse about each process. There has been substantial prior work on semantic role labeling itself, that we leverage in this work. First, there are several systems trained on the PropBank dataset, e.g., EasySRL (Lewis et al., 2015), Mate (Björkelund et al., 2009), Generalized-Inference (Punyakanok et al., 2004). Although useful, the PropBank roles are verb (predicate) specific, and thus do not produce consistent labels for a process (that may be expressed using several different verbs). In constrast, frame-semantic parsers, e.g., SEMAFOR (Das et al., 2010), trained on FrameNetannotated data (Baker et al., 1998) do produce concept (frame)-specific labels, but the FrameNet training data has poor (< 50%) coverage of the grade science process terms. Building a resource like FrameNet for a list of scientific processes is expensive.
Several unsupervised, and semi-supervised approaches have been proposed to address these issues for PropBank style predicate-specific roles (Swier and Stevenson, 2004;Lang and Lapata, 2011;Fürstenau and Lapata, 2009;Fürstenau and Lapata, 2012;Lang and Lapata, 2010;Klementiev, 2012). A key idea here is to cluster syntactic signatures of the arguments and use the discovered clusters as roles. Another line of research has sought to perform joint training for syntactic parsing and semantic role labeling (Lewis et al., 2015), and in using PropBank role labels to improve FrameNet processing using pivot features (Kshirsagar et al., 2015).
Some SRL methods account for context information from multiple sentences (Ruppenhofer et al., 2010;Roth and Lapata, 2015). They focus on an-  notating individual event mentions in a document using discourse-level evidence such as co-reference chains. Our task is to aggregate knowledge about processes from multiple sentences in different documents. Although both tasks require raw SRL-style input, the different nature of the process task means that a different solution framework is needed.
Our goal is to acquire high quality semantic role based knowledge about processes. This allows us an unique opportunity to jointly interpret sentences that are discussing the same process. We build on ideas from previous within sentence joint inference (Punyakanok et al., 2004), argument similarity notions in semi and unsupervised approaches (Fürstenau and Lapata, 2012), and combining PropBank roles to propose a cross-sentence inference technique (Kshirsagar et al., 2015). The inference can be integrated with existing trained supervised learning pipelines, which can provide a score for role assignments for a given span.

Approach
Processes are complex events with many participating entities and inter-related sub-events. In this work, we aim for a relatively simple macro-level role-based knowledge about processes. Our task is to find classes of entities that are likely to fill key roles within a process namely, the undergoer, enabler, result, and action 1 (different verbs denoting the main action when the process is occurring, e.g., "dry"). We select these roles based on an initial analysis of grade science questions that involve recognizing instances of processes from their descriptions. Table 2 shows some examples of the target knowledge roles.
We develop a scalable pipeline for gathering such role-based process knowledge. The input to our system is the name of a process, e.g., "evaporate". Then we use a set of query patterns to find sentences that describe the process. A semantic role classifier then identifies the target roles in these sentences. The output is a list of typical fillers for the four process roles.
This setting presents a unique opportunity, where the goal is to perform semantic role labeling on a set of closely related sentences, sentences that describe the same process. This allows us to design a joint inference method that can promote expectations of consistency amongst the extracted role fillers.
There is no large scale training data that can be readily used for this task. Because we target process-specific and not verb-specific semantic roles, existing ProbBank (Kingsbury and Palmer, 2003) trained SRL systems cannot be used directly. Frame-semantic parsers trained on FrameNet data (Baker et al., 1998) are also not directly usable because FrameNet lacks coverage for many of the processes discussed in the science domain. Therefore, we create a process dataset that covers a relatively small number of processes, but demonstrate that the role extraction generalizes to previously unseen processes as well.

Cross-Sentence Inference
Given a set of sentences about a process, we want to extract role fillers that are globally consistent i.e., we want role assignments that are compatible. Our approach is based on two observations: First, any given role is likely to have similar fillers for a particular process. For instance, the undergoers of the evaporation process are likely to be similar -they are usually liquids. Second, similar arguments are Figure 1: A factor graph representation of cross sentence inference. S 11 and S 12 denote role assignments for arguments a 11 and a 12 in one sentence, and S 21 and S 22 denote for arguments a 21 and a 22 in another. The φ role factors score each role assignment to the arguments, and the φ align factors score the compatibility of the connected arguments. φ sent factors encode sentence level constraints.
unlikely to fill different roles for the same process. In evaporation, for example, it is highly unlikely that water is an undergoer in one sentence but is a result in another. These role-specific selectional preferences vary for each process and can be learned if there are enough example role fillers for each process during training (Zapirain et al., 2009;Zapirain et al., 2013). Since, we wish to handle processes for which we have no training data, we approximate this by modeling whether two arguments should receive the same role given their similarity and their context similarity. Figure 1 illustrates the cross sentence inference problem using a factor graph. The S ij random variables denote the role label assignment for the j th argument in sentence i. Each assignment to an argument S ij is scored by a combination of the role classifier's score (factor φ role ), and its pairwise compatibility with the assignments to other arguments (factor φ align ). The factors φ sent capture two basic within sentence constraints.

Inference using ILP
We formulate the cross sentence inference task using an Integer Linear Program shown in Figure 2. The ILP seeks an assignment that maximizes a combination of individual role assignment scores and their global compatibility, which is measured as the similarity of fillers for the same role minus similarity of

k)
Role classifier score where compatibility with same roles is: and compatibility with other roles is: Penalty when role n = k subject to: fillers of different roles. The decision variables z ijk denote role assignments to arguments. When z ijk is set it denotes that argument j in sentence i (a ij ) has been assigned role k. The objective function uses three components to assign scores to an assignment. 1. Classifier Score φ role (a ij , k) -This is the score of a sentence-level role classifier for assigning role k to argument a ij . 2. Within Role Compatibility ∆(a ij , k) -This is a measure of argument a ij 's compatibility with other arguments which have also been assigned the same role k. We measure compatibility using a notion of alignment. An argument is said to align with another if they are similar to each other in some respect (either lexically or semantically). As we show later, we develop an alignment cclassifier which predicts an alignment score φ align for each pair of arguments. The compatibility is defined as a normalized sum of the alignment scores for argument a ij paired with other arguments that have also been assigned the role k. Without some normalization roles with many arguments will receive higher compatibility scores.To avoid this, we normalize by (1/Ñ k ), whereÑ k refers to the number of arguments that the role classifier originally labeled with role k, an approximation to the number of arguments that are currently assigned role k by the ILP. 3. Across Role Incompatibility ∇(a ij , k) -This is a measure of how well a ij aligns with the other arguments that are assigned a different role (n = k). For good assignments this quantity should be low. Therefore we add this as a penalty to the objective. As with ∆, we use an approximation for normalization (1/Ñ k ), which is the product of other roles and the number of arguments in other sentences that can receive these roles. Becausẽ N k is typically higher, we boost this score by 2 to balance against ∆. Last, we use two sets of hard constraints to enforce the standard within-sentence expectations for roles: 1) A single argument can receive only one role label, and 2) A sentence cannot have more than one argument with the same label, except for the NONE role.
We use an off-the-shelf solver in Gurobi (www.gurobi.com) to find an approximate solution to the resulting optimization problem.

Role Classifier (Φ role )
The role classifier provides a score for each role label for a given argument. Although existing SRL and frame semantic parsers do not directly produce the role information we need (Section 2), we build on them by using their outputs for building a process role classifier.
Before we can assign role labels, we first need to generate candidate arguments.
Using EasySRL (Lewis et al., 2015), a state-of-the-art SRL system, we generate the candidate argument spans for each predicate (verbs) in the sentence. Then, using a linear SVM classifier (Fan et al., 2008), we score the candidate arguments and the predicates for our four roles and a special NONE role to indicate the argument is not one of the four. The classifier is trained with a set of annotated examples (see Section 4) with the following sets of features.
i) Lexical and Syntactic -We use a small set of standard SRL features such as lexical and syntactic contexts of arguments (e.g., head word, its POS tag) and predicate-argument path features (e.g, dependency paths). We also add features that are specific to the nature of the process sentences. In particular, we encode syntactic relationships of arguments with respect to the process name mention in the sentence. We use Stanford CoreNLP toolkit  to obtain POS tags, and dependency parses to build these features. ii) PropBank roles -While they do not have a 1to-1 correspondence with process roles, we use the EasySRL roles coupled with the specific predicate as a feature to provide useful evidence towards the process role.
iii) Framenet Frames -We use the frames evoked by the words in the sentence to allow better feature sharing among related processes. For instance, the contexts of undergoers in evaporation and condensation are likely to be similar as they are both state changes which evoke the same Undergo Change frame in FrameNet. iv) Query patterns -We use query patterns to find sentences that are likely to contain the target roles of interest. The query pattern that retrieved a sentence can help bias the classifier towards roles that are likely to be expressed in it.

Alignment Classifier (Φ align )
Our goal with alignment is to identify arguments that should receive similar role labels. One way to do this argument alignment is to use techniques developed for phrase level entailment or similarity which often use resources such as WordNet and distributional embeddings such as word2vec (Mikolov et al., 2013) vectors. It turns out that this simple entailment or argument similarity, by itself, is not enough in many cases for our task 2 . Moreover, the enabler, and the result roles are often long phrases whose text-based similarity is not reliable. A more robust approach is necessary. Lexical and syntactic similarity of arguments and the context in which they are embedded can provide valuable additional information. Table 3 shows a complete list of features we use to train the alignment classifier.

Lexical
Entailment score of arguments. Word2vec similarity of argument vectors. Word2Vec similarity of head nodes of arguments. Word2Vec similarity of parent of the head nodes. Word2Vec similarity of verbs of argument sentences. Jaccard similarity of children of the head node.

Syntactic
Similarities of frames to right and left of arguments. Jaccard similarity of POS tags of argument. POS tag of head nodes match (boolean). POS tag of head node parents match (boolean). Similarity of dep. path from arg to process name. Similarity of POS tags on arg to process name path. Similarity of POS tags of arg's children. Similarity of the dependencies of the arg's head.

Sentence
Query patterns match argument sentences (boolean). Fortunately, learning this classifier does not require any additional training data. The original data with annotated semantic role labels can be easily transformed to generate supervision for this classifier. For any given process, we consider all pairs of arguments in different sentences (i.e., (a ij , a lm ) : i = l) and label them as aligned if they are labeled with the same role, or unaligned otherwise.

Evaluation
Our goal is to generate knowledge about processes discussed in grade-level science exams. Since existing semantic resources such as FrameNet do not provide adequate coverage for these, we created a dataset of process sentences annotated with the four process roles: undergoer, enabler, action, and result.

Dataset
This dataset consists of 1205 role fillers extracted from 537 sentences retrieved from the web. We first compiled the target processes from a list of process-oriented questions found in two collections: (i) New York Regents science exams (Clark, 2015), and (ii) helpteaching.com, a Web-based collection Query Patterns name is the process of x name is the process by which x name {occurs when} x name { helps to | causes } x of practice questions. Then, we identified 127 process questions from which we obtained a set of 180 unique target processes. For each target process, we queried the web using Google to find definitionstyle sentences, which describe the target process. For each process we discarded some noisy sentences through a combination of automatic and manual filtering. Table 4 shows some examples of the 14 query patterns that we used to find process descriptions. Because these patterns are not process-specific, they work for unseen processes as well.
To find role fillers from these sentences, we first processed each sentence using EasySRL (Lewis et al., 2015) to generate candidate arguments. Some of the query patterns can be used to generate additional arguments. For example, in the pattern " name is the process of x " if x is a noun then it is likely to be an undergoer, and thus can be a good candidate. 3 . Then two annotators annotated the candidate arguments with the target roles if one were applicable and marked them as NONE otherwise. Disagreements were resolved by a third annotator. The annotations spanned a random subset of 54 target processes. The role label distribution is shown below:  We conducted five fold cross validation experiments to test role extraction. To ensure that we are testing the generalization of the approach to unseen processes, we generated the folds such that the processes in the test fold were unseen during training. We compared the basic role classifier described in Section 3.3, the within sentence and the cross sentence inference models. We tune the ILP parameter λ for cross sentence inference based on a coarsegrained sweep on the training folds.
We also compared with a simple baseline that learned a mapping from PropBank roles produced by EasySRL system to the process roles by using the roles and the verb as features. We also add the FrameNet frames invoked by the lexical unit in the sentence. Note this is essentially a subset of the features we use in our role classifier. As a second baseline, we compare with a (nearly) out-of-thebox application of SEMAFOR (Das et al., 2010), a FrameNet based frame-semantic parser. We modified SEMAFOR to override the frame identification step since the process frame information is already associated with the test sentences. Table 6 compares performance of the different methods. The learned role mapping of shallow semantic roles performs better than SEMAFOR but worse than the simple role classifier. SEMAFOR uses a large set of features which help it scale for a diverse set of frames in FrameNet. However, many of these many not be well suited for the process sentences in our relatively smaller dataset. Therefore, we use our custom role classifier as a strong baseline to demonstrate within and cross sentence gains. Enforcing sentence-level consistency through joint    Figure 3 shows the precision/recall plots for the basic role classifier and within and cross sentence inference. Both inference models trade recall for gains in precision. Cross sentence yields higher precision at most recall levels, for a smaller overall loss in recall compared to within sentence (1.6 versus 4.9).  y-axis is truncated at 0.7 to better visualize the differences.  We also studied the effect of varying the number of arguments that ILP uses to measure the compatibility of role assignments. Specifically, we allow inference to use just the top k alignments from the alignment classifier. Figure 4 shows the main trend. Using just the top similar argument already yields a 1 point gain in F1. Using more arguments tends to increase gains in general but with some fluctuations. Finding an assignment that respects all compatibilities across many argument pairs can be difficult. As seen in the figure, at some of the shorter span lengths we see a slightly larger gain (+0.3) compared to using all spans. This hints at benefits of a more flexible formulation that makes joint decisions on alignment and role label assignments. Table 9 shows an ablation of the alignment classifier features. Entailment of arguments is the most informative feature for argument alignment. Adding lexical and syntactic context compatibilities adds significant boosts in precision and recall. Knowing that the arguments are retrieved by the same query pattern (sentence feature) only provides minor improvements. Even though the overall classification performance is far from perfect, cross sentence can benefit from alignment as long as it provides a higher score for argument pairs that should align compared to those that should not.

Error Analysis
We conduct an error analysis over a random set of 50 errors observed for cross sentence inference. In addition to issues from noisy web sentences and nested arguments from bad candidate extraction, we find the following main types of errors: • Dissimilar role fillers (27.5 %) -In some processes, the fillers for the result role have high levels of variability that makes alignment error prone. For the process camouflage, for instance, the result roles include 'disorientation', 'protect from predator', 'remain undetected' etc.
• Bad role classifier scores (37.5%) -For some instances the role classifier assign high scores to incorrect labels, effectively preventing the ILP from flipping to the correct role. For example, the argument that follows "causes" tends to be a result in many cases but not always, leading to high scoring errors. For example, in the sentence with "...when heat from the sun causes water on earth's ...", the role classifier incorrectly assigns 'water' to a result role with high confidence.
• Improper Weighting (7.5%)-Sometimes the ILP does not improve upon a bad top choice from the role classifier. In some of these cases, rather than the fixed lambda, a different weighted combination of role and alignment classifier scores would have helped the ILP to flip. For example, the argument 'under autumn conditions' from the sentence 'hibernation occurs when the insects are maintained under autumn conditions.' has a good role score and is similar to other correctly labeled enablers such as 'cold , winter conditions' but yet is unable to improve.
The rest (27.5 %) are due to noisy web sentences, incorrect argument extraction and errors outside the scope of cross sentence inference.

Conclusions
Simple role-based knowledge is essential for recognizing and reasoning about situations involving processes. In this work we developed a cross sentence inference method for automatically acquiring such role-based knowledge for new processes. The main idea is to enforce compatibility among roles extracted from sentences belonging to a single process. We find that the compatibility can be effectively assessed using an alignment classifier built without any additional supervision. Empirical evaluation on a process dataset shows that cross sentence inference using an Integer Linear Program helps improve the accuracy of process knowledge extraction.