Coarse Lexical Frame Acquisition at the Syntax–Semantics Interface Using a Latent-Variable PCFG Model

We present a method for unsupervised lexical frame acquisition at the syntax–semantics interface. Given a set of input strings derived from dependency parses, our method generates a set of clusters that resemble lexical frame structures. Our work is motivated not only by its practical applications (e.g., to build, or expand the coverage of lexical frame databases), but also to gain linguistic insight into frame structures with respect to lexical distributions in relation to grammatical structures. We model our task using a hierarchical Bayesian network and employ tools and methods from latent variable probabilistic context free grammars (L-PCFGs) for statistical inference and parameter fitting, for which we propose a new split and merge procedure. We show that our model outperforms several baselines on a portion of the Wall Street Journal sentences that we have newly annotated for evaluation purposes.


Introduction
We propose a method for building coarse lexical frames automatically from dependency parsed sentences; i.e., without using any explicit semantic information as training data. The task involves grouping verbs that evoke the same frame (i.e., are considered to be the head of this frame) and further clustering their syntactic arguments into latent semantic roles. Hence, our target structures stand between FrameNet (Ruppenhofer et al., 2016) and PropBank (Palmer et al., 2005) frames. Similar to FrameNet and in contrast to PropBank, we assume a many-to-many relationship between verb types and frame types. But similar to PropBank, we aim to cluster syntactic arguments into general semantic roles instead of frame-specific slot * Both authors contributed equally to this work. types in FrameNet. This allows us to generalize across frames concerning semantic roles. As part of this, we study possible ways to automatically generate more abstract lexical-semantic representations from lexicalized dependency structures.
In our task, grouping verb tokens into frames requires not only distinguishing between different senses of verbs, but also identifying a range of lexical relationships (e.g., synonymy, opposite verbs, troponymy, etc.) among them. Hence (as Modi et al., 2012;Green et al., 2004), our problem definition differs from most work on unsupervised fine-grained frame induction using verb sense disambiguation (e.g., Kawahara et al., 2014;Peng et al., 2017). Similarly, forming role clusters yields generalization from several alternate linkings between semantic roles and their syntactic realization. Given, for instance, an occurrence of the verb pack and its syntactic arguments, not only do we aim to distinguish different senses of the verb pack (e.g., as used to evoke the FILLING frame, or the PLACING frame), but also to group these instances of 'pack' with other verbs that evoke the same frame (e.g., to group instances of pack that evoke the frame PLACING with instances of verbs load, pile, place, and so on when used to evoke the same PLACING frame).
The motivation for this work is twofold. On the one hand, the frame induction techniques we propose can be useful in the context of applications such as text summarization (Cheung and Penn, 2013), question answering (Frank et al., 2007;Shen and Lapata, 2007), and so on, for languages where we lack a frame-annotated resource for supervised frame induction, or to expand the coverage of already existing resources. On the other hand, we are interested in theoretical linguistic insights into frame structure. In this sense, our work is a step towards an empirical investigation of frames and semantic roles including hierarchical relations between them.
We cast the frame induction task as unsupervised learning using an L-PCFG (Johnson, 1998;Matsuzaki et al., 2005;Petrov et al., 2006;Cohen, 2017). As input, our model takes syntactic dependency trees and extracts input strings corresponding to instances of frame expressions, which are subsequently grouped into latent semantic frames and roles using an L-PCFG. We use the insideoutside (i.e., Expectation-Maximization (Dempster et al., 1977;Do and Batzoglou, 2008)) algorithm and a split-merge procedure (Petrov et al., 2006) for dynamically adapting the number of frames and roles to the data, for which we employ new heuristics. As implied, one advantage of the L-PCFGs framework is that we can adapt and reuse statistical inference techniques used for learning PCFGs in syntactic parsing application (e.g., split-merge). Our experiment shows that the method outperforms a number of baselines, including frame grouping by lexical heads and one based on agglomerative clustering.
The main contributions of this paper are a) using L-PCFGs for coarse lexical frame acquisition; b) a new split-merge routine adapted for this task; and, c) a new dataset for evaluating the induced lexical frame-role groupings. In the remainder of the paper, § 2 describes our statistical model and its formalization to an L-PCFG. § 3 describes procedures used for statistical inference. § 4 describes our evaluation dataset and reports results from experiments. § 5 discusses related work followed by a conclusion in § 6.

From a Latent Model to L-PCFG
We assume that frames and semantic roles are the latent variables of a probabilistic model. Given the probability mass function pmf(F 1 , . . . , F n , R 1 . . . R k , D 1 , . . . , D m ; C, θ) as our model, we denote latent frames F i , 1 ≤ i ≤ n, and roles R i , 1 ≤ i ≤ k for observations that are annotated syntactically using D i , 1 ≤ i ≤ m in the input corpus C. Inspired by , we approximate the probability of a specific frame f with head v, semantic roles r 1 . . . r k filled by words w 1 . . . w k and corresponding syntactic dependencies d 1 . Figure 1: Sample frame structure for (1). parameters θ) as: (1) To estimate the parameters of our model, we translate Eq. (1) to an L-PCFG that captures the required conditional and joint probabilities.
First, we convert input lexicalized dependency parses to a set of strings E. Given a verb v and its dependents w i in a dependency parse tree, we build input strings in the form of v root w 1 d 1 . . . w l d l EOS, for which we assume v lexicalizes the head of a frame, w 1 . . . w l are the arguments fillers and d 1 , . . . , d l are the respective dependencies that link these fillers to the head v; EOS is a special symbol to mark the end of string. For the step from the sentence to input strings, we assume that dependencies are ordered (e.g., subj precedes dobj and iobj and they precede prepositional and complement dependents (i.e., nmod:* and *comp). 1 Consider (1) as an example; the corresponding string is the yield of the tree in Fig. 1.
(1) John subj offerroot flowers dobj to Mary nmod:to Given the fixed structure of input strings, we design a CFG that rewrites them to our expected hierarchical frame structure consisting of elements F , R, D while capturing the conditional probabilities from Eq. (1). The tree assigning a frame F of type x with semantic roles of type a, b, c to (1) is for 1 Phrasal arguments are reduced to their syntactic head given by the underlying UD parser. We normalize passive structures by replacing nsubjpass with dobj. Other syntactic dependents (e.g., case, aux, conj, etc.) are removed. In case of the same dependencies, surface order in the sentence is relevant. If necessary, conjunctions are treated when transforming dependency parses to input strings. instance given in Fig. 1. More generally, given finite sets of frames F and of semantic roles R, our underlying CFG G = N, T, P, S is as follows: • T = T v ∪T n ∪D∪{root, EOS}, where T v is the set of possible verbal heads, T n is the set of possible lexicalizations (fillers) for arguments, and D is a finite set of dependency relations; root and EOS are special symbols.
• P contains the following rules: With this grammar, an input string derived from a dependency parsed sentence fully determines the shape of the tree and the node labels are fixed except for the choice of the frame f and the semantic roles r of the k fillers (i.e., x, a, b, and c in Fig. 1).
The probabilities of the rules correspond to the conditional probabilities in Eq. (1). The probability of S → F f h F f rem gives p(F = f ), the probability of V f → v gives p(V = v|F = f ), and so on. During the subsequent inside-outside (IO) split-and-merge training procedure, the inventory of frames and roles and the probabilities corresponding to our rules are estimated so that the overall likelihood of observations is maximized.

Method
This section describes statistical inference methods used for inducing latent frame clusters from input strings. The scenario we used for parameter fitting (split, merge, smoothing, and generating clusters) is described in § 3.1. In § 3.2, we describe our method for computing embedding-based similarity between frames, which we use during the merge process and in our baseline system.

Parameter Fitting
Given an input corpus parsed into universal dependencies (UD) and converted into a set of input strings E, we instantiate a model G according to § 2. We set |F| = 1 and |R| = |D|, and D, T v and T n are automatically extracted from E. Starting from this, we iteratively perform split-merge sequences (with an IO parameter estimation in between), and cluster E to disjoint subsets E i by finding the most-likely derivations that G yields. We detail this process in the following subsections.

The IO Algorithm
As a solution to sparsity of observations, we modify the IO algorithm slightly. We adapt the procedures described in (Eisner, 2016) with the exception that for computing inside and outside probabilities, instead of mapping terminals to nonterminals using an exact matching of the right-handsides of the rules (and respectively their assigned parameters), we use embedding-based similarities. I.e., for computing inside probabilities, given symbol a as input, instead of considering A → θ as rewrite rules and updating the parse chart only by asserting θ in it, we also consider B → θ bs in which instead of θ we assert α × θs in the IO table, where α is the r 2 coefficient correlations of embeddings for a and bs. During the outside procedure, θs are updated proportionally w.r.t. to αs used during the inside parameter estimation procedure.

Split
We alter the splitting procedure from (Klein and Manning, 2003a;Petrov et al., 2006) for our application. In (Klein and Manning, 2003b;Petrov et al., 2006), during split, a non-terminal symbol (which represents a random variable in the underlying probabilistic model) is split and its related production rules are duplicated independently of its parent, or sibling nodes. We can apply such a context-independent split only to the R r nonterminals but the F f ... s must split dependently w.r.t. their sibling nodes that define the frame structure. Therefore, to split frame x to two frames y and z, we replace the entire set and V x → v} with two similar sets where x gets replaced with y and z, respectively. The parameters for the new rules are set to half of the value of the parameters of the rules that they originated from with the addition (or subtraction) of a random (e.g., 1e-7) to break symmetry.
Moreover, in our application, training the split grammar on the whole input is ineffective and at a certain point, computationally intractable. The problem is due to the scope of the split grammar and the large portion of input strings E that they span. Splitting a frame's rules, unlike parameter fitting for syntactic rules, increases the number of possible derivations for all E, to the extent that after a number of split iterations the computation of the derivations becomes intractable even for a small E of short length. We address this problem by using a new strategy for splitting: we not only split the grammar, but also the input training data.
Before each split, we cluster input strings E to clusters E i that G gives ( § 3.1.4) at that point. For input strings in each cluster E i , we instantiate a new G i and perform parameter fitting and splitting independently of other E i s. The corresponding probabilistic G i is initialized by assigning random parameters to its rules and then smoothing them ( § 3.1.3) by the fitted parameters for G. We apply the aforementioned process several times, until the number of independently generated clusters is at least twice as large as |T v |. At the end of each split iteration, we collect the elicited E i clusters (and their respective G i s) for the next merge process. Given the independence assumption between roles and frames, pre-terminals that rewrite roles are split similar to (Petrov et al., 2006).

Smoothing
We apply a notion of smoothing by interpolating parameters that are obtained in the n−1th iteration of split-merge with parameters that are randomly initialized at the beginning of each split-merge iteration and, as mentioned earlier in § 3.1.2, when deriving new G i s from G: For each rule in G i or G n with parameters θ (i.e., the G instantiated for the next split-merge iteration), we smooth θs using θ = αθ + (1 − α)θ n−1 , where θ n−1 is the already known and fitted parameter for the corresponding rule in G. We choose α = 0.1.

Generating Clusters from G
After fitting parameters of G, the frame structure for an input string is given by its most-likely viterbi derivation with respect to G. The verb which is rewritten by F f h is placed to frametype/cluster f . Similarly, lexical items that are argument fillers are assigned to type/cluster r where r is the structural annotation for pre-terminal R r that rewrite them. For example, assuming Fig 1 is the most likely derivation for (1), the verb 'offer' is categorized as frame x and its arguments as roles a, b, and c.

Merge
The model resulting from the split process generates a relatively large number of 'homogeneous' clusters that are 'incomplete'. A subsequent merge process unifies these homogeneousbut-incomplete clusters to achieve a clustering that is both homogeneous and complete. To this end, we use heuristics which are based on both the estimated loss in likelihood from merging two symbols that span the same input sequence (as proposed previously in Petrov et al. (2006)) as well as the 'discriminative similarities' between the obtained clusters.
Merge by likelihoods does not work: The heuristics for merge in (Petrov et al., 2006) (i.e., minimizing the loss in training likelihood using 'locally' estimated inside and outside probabilities) are based on the assumptions that a) preterminals appearing in different places in derivations are nearly independent, and b) that their approximation (according to the method proposed by Petrov et al. (2006)) requires less computation than computing full derivation trees. However, neither of these hold in our case: a) most pre-terminals in our model are dependent on each other and, b) to compute the loss in likelihood from a cluster merge requires computation of full derivation trees (given the interdependence between pre-terminals that define frame structures). More importantly, in our application, the outside probabilities for clusters are always 1.0 and differences in the sum of inside probabilities is often negligible since input strings are spanned moreor-less by the same set of derivations. For these reasons (i.e., computation cost and the lack of sufficient statistics), the 'estimated loss in likelihood' heuristics is a futile method for guiding the merge process in our application. We resolve this problem by leveraging discriminative similarities between the obtained frame clusters and proposing a hybrid method.
Our merge approach: In the beginning of a merge process, we conflate G i s that are obtained from the previous split procedure to form a G that spans all input strings. Where applicable, we set parameters of rules in G to the arithmetic mean of corresponding ones obtained from the split process and normalize them such that sum of the parameters of rules with the same pre-terminal is 1.0.
We go through an iterative process: Using the method proposed in § 3.2 below, the frame instances in the clusters are converted to tensors and similarities among them are computed. Motivated by the assumption that split clusters are homogeneous, for every pair of clusters c x and c y (x = y) with instances a i ∈ c x and b j ∈ c y , we find arg max i,j sim(a i , b j ) and arg min i,j sim(a i , b j ) (sim is given by Eq. 2 below) and calculate their harmonic mean as the similarity s c between c x and c y . Cluster pairs are sorted in a descending order by s c . Given a threshold δ, for all s c (c x , c y ) > δ, their corresponding production rules (i.e., the similar set of rules mentioned in the split procedure) are merged and their parameters are updated to the arithmetic mean of their origin rules.
Parameters for this new merged G are updated through a few IO iterations (in an incremental fashion (Liang and Klein, 2009)), and finally G is used to obtain a new clustering. The process is repeated for this newly obtained clustering until all the resulting cluster-wise s c similarities are less than a threshold β.
Computing all derivations for each input string is time consuming and makes the merge process computationally expensive, particularly in the first few iterations. We resolve this issue using a stratified random sampling and by performing the aforementioned iterations only on a random subset of input strings in each cluster. Each cluster in the output of the split process is taken as a stratum and its size is reduced by 90% by applying a random sampling; this random sampling is updated in each iteration (we use a similar strategy for parameter estimation, i.e., we update samples in each estimation iteration). This process reduces the required time for merge drastically without hurting the overall outcome of the merge process. It is worth to mention that after merging clusters c x and c y , the output does not necessarily contain a cluster c x ∪c y . Instead, the resulting clustering reflects the effect of merging the rules that rewrite c x and c y in the whole model.
To merge R r categories, we use the merge method from (Petrov et al., 2006) based on the obtained likelihoods. After merging frame clusters, we reduce the number of R r s by 50%. Since our method for merging role categories is similar to (Petrov et al., 2006), we do not describe it here.

Similarity Between Frame Instances
When necessary (such as during merge), we compute embedding-based similarities between frame instances similar to methods proposed in (Mitchell and Lapata, 2008;Clark, 2013). We build a ndimensional embedding for each word appearing in our input strings from large web corpora. Each frame instance is then represented using a (m + 1, n)-tensor, in which m is the total number of argument types/clusters given by our model at its current stage and n is the dimensionality of the embeddings that represent words that fill these arguments. To this, we add the embedding for the verb that lexicalizes the head of the frame, which gives us the final (m + 1, n)-tensor.
For two frame-instances represented by tensors a and b, the similarity for their arguments is in which v i s are embeddings for the ith argument filler ( n j=1 v i j = 0), r 2 is the coefficient of determination, and k If an argument is lexicalized by more than one filler, we replace r 2 with the arithmetic mean of r 2 s computed over each distinct pair of fillers. The overall sim between a and b is: where v h s are the embeddings for the lexical heads (i.e., verbs), and w 1 and w 2 are two hyper-parameters which can be tuned.
For instance, for two hypothetical structures of F a :[Head:travel, [Arg 1 :John, Arg 2 :London]] and F b :[Head:walk, [Arg 1 :Mary, Arg 3 :home]], the similarity between F a and F b is w 1 r 2 ( travel, walk) + w 2 r 2 ( John, Mary), given that the all these vectors have at least one nonzero component. During merge we use w 1 = w 2 = 0.50.
We build our lexical embeddings of dimensionality n = 900 using the hash-based embedding learning technique proposed in (QasemiZadeh and Kallmeyer, 2017); before using these embeddings, we weight them using positive pointwise mutual information. During evaluation, this combination of PPMI-weighted hash-based embeddings and the r 2 estimator consistently yielded better results than using other popular choices such as the co-sine of word2vec vectors. We associate this observation to the imbalanced frequency of the usages of lexical items in our experiments in the corpora used to train embeddings (i.e., an English web corpus (Schäfer, 2015) and PTB's WSJ).

Dataset
We derive our data for evaluation from the PTB's WSJ sections parsed (using Schuster and Manning, 2016) to the enhanced UD format. We augment these sentences with semantic role annotations obtained from Prague Semantic Dependencies (PSD) (Cinkova et al., 2012) from the SDP resource (Oepen et al., 2016). Using Eng-Vallex (Cinková et al., 2014) and SemLink (Bonial et al., 2013), we semi-automatically annotate verbs with FrameNet frames (Baker et al., 1998). We choose 1k random sentences and manually verify the semi-automatic mappings to eventually build our evaluation dataset of approximately 5k instances (all). From this data, we use a random subset of 200 instances (dev) during the development and for parameter tuning (see Table 1  For these gold instances, we extract input strings from their UD parses according § 2. Since we discard verbs without syntactic arguments, use automatic parses, and do not distinguish arguments from adjuncts, the input strings do not exactly match the gold data argument structures. We report results only for the portion of the gold data that appears in the extracted input strings. Table 2 reports the statistics for the induced input strings and their agreement with the gold data (in terms of precision and recall).
Input strings are hard to cluster in the sense that a) all the frames are lexicalized by at least two different verb lemmas, b) many verbs lexicalizes at least two different types of frames, c) verb lemmas that lexicalize a frame have long-tailed distributions, i.e., a large proportion of instances of a frame are realized at surface structure by one  Table 2: Input strings extracted from the UD parses: GR, AIG, R F , R A , and P A denote, respectively, the number of distinct grammatical relations, syntactic arguments that are a semantic role in the gold data, recall for frame and arguments, and precision for arguments. The remaining symbols are the same as Table 1.
lemma while in the remaining instances the frame is evoked by different lemmas, and d) last but not least, the frame types themselves have long-tailed distribution. Table 3 shows examples of frames and verb lemmas that lexicalize them; in the table, the most frequent lemma for each frame type is italicized.

Evaluation Measures
We evaluate our method's performance on a) clustering input strings to frame types, and b) clustering syntactic arguments to semantic role types.
To this end, we report the harmonic mean of BCubed precision and recall (BCF) (Bagga and Baldwin, 1998), and purity (PU), inverse purity (IPU) and their harmonic mean (FPU) (Steinbach et al., 2000) as figures of merit. These measures reflect a notion of similarity between the distribution of instances in the obtained clusters and the gold/evaluation data based on certain criteria and alone may lack sufficient information for a fair understanding of the system's performance. While PU and IPU are easy to interpret (by establishing an analogy between them and precision and recall in classification tasks), they may be deceiving under certain conditions (as explained by Amigó et al., 2009, under the notions of homogeneity, completeness, rag bag, and 'size vs. quantity' constraints). Reporting BCF alongside FPU ensures that these pitfalls are not overlooked when our system's output are compared quantitatively with the baselines.

Baselines
As baselines, we report the standard all-in-oneclass clustering (ALLIN1) and the one-clusterper-instance (1CPERI) baselines, as well as the random baseline (R n ) in which instances are randomly partitioned into n clusters (n being the number of generated clusters in our sys-  Table 3: Examples of frames in our evaluation set and verbs that evoke them; #T and #V denote the total number of instances for the frame and the number of distinct verb lemmas that evoke them, respectively. tem's output). Moreover, for frame type clustering, we report the one-cluster-per-lexical-head baseline (1CPERHEAD). For role clustering, we report the additional one-cluster-per-syntacticcategory baseline (1CPERGR). Similar to the most-frequent-sense baseline in word sense induction and disambiguation problems, the latter 1CPERHEAD and 1CPERGR are particularly hard to beat given the heavy-tailed distribution of lexical items in frame and role categories.
For both subtasks, an additional baseline from (Modi et al., 2012) and  could be an interesting comparison to our method with the state of the art in frame head clustering and unsupervised semantic role labeling, particularly given that  and respectively (Modi et al., 2012) employ Gibbs sampling for statistical inference, whereas we use the IO algorithm. We are, unfortunately, not able to access codes for Modi et al. (2012) and the system in  relies on features that are engineered for treebanks in the format and formalisms set for the CoNLL-2008 shared-task. As explained by Oepen et al. (2016), mapping to (and from) formalisms used in CoNLL-2008 (from-to) those proposed in SDP-PSD (used in this paper) is a nontrivial task. We expect that an automatic conversion from our data to the CoNLL-2008 format as an input for  would not reflect the best performance of their method. Nonetheless, we report the result from this experiment (marked as TK-URL) later in this section, not as a baseline, but to confirm (Oepen et al., 2016).
Lastly, as an extra baseline for frame type clustering, we report performance of a HAC method. The HAC method is described below ( § 4.3.1).

A Baseline HAC Method
To build a frame clustering using HAC, we begin by initializing one cluster per instance and itera-tively merge the pair of clusters with the lowest distance, using average-link cluster distance. For two clusters A and B, we define their distance as: in which sim(f i , f j ) is given by Eq. 2, and l = |A| + |B|. We iteratively update the distance matrix and agglomerate clusters until we reach a single cluster. During iterations, we keep track of the merges/linkages which we later use to flatten the built hierarchy into q clusters. To set our baseline, by constraining w 1 + w 2 = 1 in Eq.
(2), we build cluster hierarchies for different w 1 and w 2 (starting with w 1 = 0.0, w 2 = 1 − w 1 and gradually increasing w 1 by 0.1 until w 1 = 1.0) and find w 1 , w 2 , and q that yield the 'best' clustering according to the BCF metric (w 1 = 0.8, w 2 = 0.2, q = 140). For this baseline, the argument types are defined by their syntactic relation to their heads, e.g., subj, dobj, and so on.

Results
Since our method involves stochastic decisions, its performance varies slightly in each run. Hence, we report the mean and the standard deviation of the obtained performances from 4 independent runs. The reported results are based on the output of the system after 7 split and merge iterations. After tuning parameters on the dev set, we choose δ = 0.55 during merge, and in each inner-merge iteration subtract δ by 0.01 until δ < β = 0.42.
Quantitative Comparison with Baselines Tables 4 and 5 show the results for clustering input strings to frame types and semantic roles, respectively. On frame type clustering, our method (denoted by L-PCFG) outperforms all the baselines. FPU and BCF for our system are simultaneously higher than all the baselines, which verifies that the output contains a small proportion of "rag bag" clusters. The system, however, tends to generate   many incomplete yet homogeneous clusters (as we discuss below). With respect to roles, however, the method's performance and its output remains very similar to the syntactic baseline (BCF=97.3).
What is in the clusters? The ability of the system to successfully cluster instances varies from one gold frame category to another one. The most problematic cases are the frame types AC-TIVITY START and PROCESS START, as well as PLACING. While the system put instances of 'start' and 'begin' verbs in one cluster, it fails to distinguish between ACTIVITY START and PRO-CESS START. Regarding the PLACING frame, the system places verbs that evoke this frame in different clusters; each cluster consists of instances from one verb lemma. In our opinion, this is due to the frequent idiomatic usages of these verbs, e.g., 'to lay claim', 'to place listing', 'to position oneself as' and so on. This being said, however, the system is capable of distinguishing between different readings of polysemous verbs, e.g., instances of the verb 'pack' that evoke the FILLING frame end up in different clusters than those evoking the PLACING frame. Additionally, for a number of frame types, we observe that the system can successfully group synonymous (and opposite) verbs that evoke the same frame into one cluster: representative examples are the instances of the CHANGE POSITION ON A SCALE frame that are evoked by different verb lemmas such as ' decline', 'drop', 'fall', 'gain', 'jump', 'rise', . . . , which all end up in one cluster. The output of the system also contains a large number of small clusters (consisting of only one or two instances): we observe that these instances are usually those with wrong (and incomplete) dependency parses.

Related Work
Our work differs from most work on word sense induction (WSI), e.g. (Goyal and Hovy, 2014;Lau et al., 2012;Manandhar et al., 2010;Van de Cruys and Apidianaki, 2011), in that not only do we discern different senses of a lexical item but also we group the induced senses into more general meaning categories (i.e., FrameNet's grouping). Hence, our model must be able to capture lexical relationships other than polysemy, e.g., synonymy, antonymy (opposite verbs), troponymy, etc.. However, our method can be adapted to WSI, too. Firstly, we can assume that word senses are 'incompatible' and thus they necessarily evoke different frames; subsequently, the induced frame clusters can be seen directly as clusters of word senses. Otherwise, the proposed method can be adapted for WSI by altering its initialization, e.g., by building one-model-at-a-time for each word form (i.e., simply altering the input).
Despite similarities between our method and those proposed previously to address unsupervised semantic role induction (Carreras and Marquez, 2005;Lapata, 2010, 2011;Swier and Stevenson, 2004), our method differs from them in that we attempt to include frame head grouping information for inducing roles associated to them. In other words, these methods leave out the problem of sense/frame grouping in their models.
Our work differs in objective from methods for unsupervised template induction in information extraction (IE) (e.g., MUC-style frames in Chambers andJurafsky (2009, 2011) and its later refinements such as (Chambers, 2013;Balasubramanian et al., 2013), and in a broader sense attempts towards ontology learning and population from text (Cimiano et al., 2005)). Our focus is on lexicalized elementary syntactic structures, identifying lexical semantic relationships, and thereby finding salient patterns in syntax-semantic interface. However, in IE tasks the aim is to build structured summaries of text. Therefore, the pre-and post-processing in these induction models are often more complex/different than our method (e.g., they require anaphora resolution, identifying discourse relations, etc.). Lastly, we deal with a broader set of verbs and domains and more general frame definitions than these methods.
As stated earlier, Modi et al. (2012) propose the most similar work to ours. They adapt  to learn FrameNet-style head and role groupings. Modi et al. (2012) assume roles to be frame-specific, while our role clusters are defined independently of frame groupings (as expressed in Eq. 1). Last, with respect to research such as (Pennacchiotti et al., 2008;Green et al., 2004) in which lexical resources such as Word-Net are used (in supervised or unsupervised settings) to refine and extend existing frame repositories such as FrameNet, our model learns and bootstraps a frame repository from text annotated only with syntactic structure in an unsupervised way.

Conclusion
We proposed an unsupervised method for coarselexical frame induction from dependency parsed sentences using L-PCFG. We converted lexicalized dependency trees of sentences to a set of input strings of fixed, predetermined structure consisting of a verbal head, its arguments and their syntactic dependencies. We then use a CFG model (subsequently L-PCFG) to shape/capture frame structures from these strings. We adapted EM parameter estimation techniques from PCFG while relaxing independence assumptions, including appropriate methods for splitting and merging frames and semantic roles and using word embeddings for better generalization. In empirical evaluations, our model outperforms several baselines.