Frame-Based Continuous Lexical Semantics through Exponential Family Tensor Factorization and Semantic Proto-Roles

We study how different frame annotations complement one another when learning continuous lexical semantics. We learn the representations from a tensorized skip-gram model that consistently encodes syntactic-semantic content better, with multiple 10% gains over baselines.


Introduction
Consider "Bill" in Fig. 1: what is his involvement with the words "would try," and what does this involvement mean?Word embeddings represent such meaning as points in a real-valued vector space (Deerwester et al., 1990;Mikolov et al., 2013).These representations are often learned by exploiting the frequency that the word cooccurs with contexts, often within a user-defined window (Harris, 1954;Turney and Pantel, 2010).When built from large-scale sources, like Wikipedia or web crawls, embeddings capture general characteristics of words and allow for robust downstream applications (Kim, 2014;Das et al., 2015).
Frame semantics generalize word meanings to that of analyzing structured and interconnected labeled "concepts" and abstractions (Minsky, 1974;Fillmore, 1976Fillmore, , 1982)).These concepts, or roles, implicitly encode expected properties of that word.In a frame semantic analysis of Fig. 1, the segment "would try" triggers the ATTEMPT frame, filling the expected roles AGENT and GOAL with "Bill" and "the same tactic," respectively.While frame semantics provide a structured form for analyzing words with crisp, categorically-labeled concepts, the encoded properties and expectations are implicit.What does it mean to fill a frame's role?
Semantic proto-role (SPR) theory, motivated by Dowty (1991)'s thematic proto-role theory, offers an answer to this.SPR replaces categorical roles ATTEMPT She said Bill would try the same tactic again.
AGENT GOAL Figure 1: A simple frame analysis.
with judgements about multiple underlying properties about what is likely true of the entity filling the role.For example, SPR talks about how likely it is for Bill to be a willing participant in the ATTEMPT.The answer to this and other simple judgments characterize Bill and his involvement.
Since SPR both captures the likelihood of certain properties and characterizes roles as groupings of properties, we can view SPR as representing a type of continuous frame semantics.
We are interested in capturing these SPR-based properties and expectations within word embeddings.We present a method that learns frameenriched embeddings from millions of documents that have been semantically parsed with multiple different frame analyzers (Ferraro et al., 2014).Our method leverages Cotterell et al. (2017)'s formulation of Mikolov et al. (2013)'s popular skip-gram model as exponential family principal component analysis (EPCA) and tensor factorization.This paper's primary contributions are: (i) enriching learned word embeddings with multiple, automatically obtained frames from large, disparate corpora; and (ii) demonstrating these enriched embeddings better capture SPR-based properties.In so doing, we also generalize Cotterell et al.'s method to arbitrary tensor dimensions.This allows us to include an arbitrary amount of semantic information when learning embeddings.Our variable-size tensor factorization code is available at https://github.com/Frame semantics currently used in NLP have a rich history in linguistic literature.Fillmore (1976)'s frames are based on a word's context and prototypical concepts that an individual word evokes; they intend to represent the meaning of lexical items by mapping words to real world concepts and shared experiences.Frame-based semantics have inspired many semantic annotation schemata and datasets, such as FrameNet (Baker et al., 1998), PropBank (Palmer et al., 2005), and Verbnet (Schuler, 2005), as well as composite resources (Hovy et al., 2006;Palmer, 2009;Banarescu et al., 2012).1 Thematic Roles and Proto Roles These resources map words to their meanings through discrete/categorically labeled frames and roles; sometimes, as in FrameNet, the roles can be very descriptive (e.g., the DEGREE role for the AF-FIRM OR DENY frame), while in other cases, as in PropBank, the roles can be quite general (e.g., ARG0).Regardless of the actual schema, the roles are based on thematic roles, which map a predicate's arguments to a semantic representation that makes various semantic distinctions among the arguments (Dowty, 1989).2Dowty (1991) claims that thematic role distinctions are not atomic, i.e., they can be deconstructed and analyzed at a lower level.Instead of many discrete thematic roles, Dowty (1991) argues for proto-thematic roles, e.g.PROTO-AGENT rather than AGENT, where distinctions in proto-roles are based on clusterings of logical entailments.That is, PROTO-AGENTs often have certain properties in common, e.g., manipulating other objects or willingly participating in an action; PROTO-PATIENTs are often changed or affected by some action.By decomposing the meaning of roles into properties or expectations that can be reasoned about, proto-roles can be seen as including a form of vector representation within structured frame semantics.

Continuous Lexical Semantics
Word embeddings represent word meanings as elements of a (real-valued) vector space (Deerwester et al., 1990).Mikolov et al. (2013)'s word2vec methods-skip-gram (SG) and continuous bag of words (CBOW)-repopularized these methods.We focus on SG, which predicts the context i around a word j, with learned representations c i and w j , respectively, as p(context i | word j) ∝ exp (c i w j ) = exp (1 (c i w j )) , where is the Hadamard (pointwise) product.Traditionally, the context words i are those words within a small window of j and are trained with negative sampling (Goldberg and Levy, 2014).

Skip-Gram as Matrix Factorization
Levy and Goldberg (2014b), and subsequently Keerthi et al. (2015), showed how vectors learned under SG with the negative sampling are, under certain conditions, the factorization of (shifted) positive pointwise mutual information.Cotterell et al. (2017) showed that SG is a form of exponential family PCA that factorizes the matrix of word/context cooccurrence counts (rather than shifted positive PMI values).With this interpretation, they generalize SG from matrix to tensor factorization, and provide a theoretical basis for modeling higher-order SG (or additional context, such as morphological features of words) within a word embeddings framework.
Specifically, Cotterell et al. recast higher-order SG as maximizing the log-likelihood ijk where X ijk is a cooccurrence count 3-tensor of words j, surrounding contexts i, and features k.

Skip-Gram as n-Tensor Factorization
When factorizing an n-dimensional tensor to include an arbitrary number of L annotations, we replace feature k in Equation ( 1) and a k in Equation (2) with each annotation type l and vector α l included.X i,j,k becomes X i,j,l 1 ,...l L , representing the number of times word j appeared in context i with features l 1 through l L .We maximize

Experiments
Our end goal is to use multiple kinds of automatically obtained, "in-the-wild" frame se-mantic parses in order to improve the semantic content-specifically SPR-type informationwithin learned lexical embeddings.We utilize majority portions of the Concretely Annotated New York Times and Wikipedia corpora from Ferraro et al. (2014).These have been annotated with three frame semantic parses: FrameNet from Das et al. (2010), and both FrameNet and PropBank from Wolfe et al. (2016).In total, we use nearly five million frame-annotated documents.

Extracting Counts
The baseline extraction we consider is a standard sliding window: for each word w j seen ≥ T times, extract all words w i two to the left and right of w j .These counts, forming a matrix, are then used within standard word2vec.
We also follow Cotterell et al. (2017) and augment the above with the signed number of tokens separating w i and w j , e.g., recording that w i appeared two to the left of w j ; these counts form a 3-tensor.
To turn semantic parses into tensor counts, we first identify relevant information from the parses.We consider all parses that are triggered by the target word w j (seen ≥ T times) and that have at least one role filled by some word in the sentence.We organize the extraction around roles and what fills them.We extract every word w r that fills all possible triggered frames; each of those frame and role labels; and the distance between filler w r and trigger w j .This process yields a 9-tensor X. 3 Although we always treat the trigger as the "original" word (e.g., word j, with vector w j ), later we consider (1) what to include from X, (2) what to predict (what to treat as the "context" word i), and (3) what to treat as auxiliary features.

Data Discussion
The baseline extraction methods result in roughly symmetric target and surrounding word counts.This is not the case for the frame extraction.Our target words must trigger some semantic parse, so our target words are actually target triggers.However, the surrounding context words are those words that fill semantic roles.As shown in Table 1, there are an order-of-magnitude fewer triggers than target words, but up to an order-of-magnitude more surrounding words.to enable any arbitrary dimensional tensor factorization, as described in §3.2.We learn 100dimensional embeddings for words that appear at least 100 times from 15 negative samples. 4 The implementation is available at https://github.com/fmof/tensor-factorization.
Metric We evaluate our learned (trigger) embeddings w via QVEC (Tsvetkov et al., 2015).QVEC uses canonical correlation analysis to measure the Pearson correlation between w and a collection of oracle lexical vectors o.These oracle vectors are derived from a human-annotated resource.For   relations nsubj, dobj, iobj and nsubjpass-result in 80-dimensional oracle vectors. 6redict Fillers or Roles?Since SPR judgments are between predicates and arguments, we predict the words filling the roles, and treat all other frame information as auxiliary features.SPR annotations were originally based off of (gold-standard) Prop-Bank annotations, so we also train a model to predict PropBank frames and roles, thereby treating role-filling text and all other frame information as auxiliary features.In early experiments, we found it beneficial to treat the FrameNet annotations additively and not distinguish one system's output from another.Treating the annotations additively serves as a type of collapsing operation.Although X started as a 9-tensor, we only consider up to 6-tensors: trigger, role filler, token separation between the trigger and filler, PropBank frame and role, FrameNet frame, and FrameNet role.Results Fig. 2 shows the overall percent change for SPR-QVEC from the filler and role prediction models, on newswire (Fig. 2a) and Wikipedia (Fig. 2b), across different ablation models.We indicate additional contextual features being used with a +: sep uses the token separation distance between the frame and role filler, fn-frame uses FrameNet frames, fn-role uses FrameNet roles, filler uses the tokens filling the frame role, and none indicates no additional information is used when predicting.The 0 line represents a plain word2vec baseline and the dashed line represents the 3-tensor baseline of Cotterell et al. (2017).Both of these baselines are windowed: they are restricted to a local context and cannot take advantage of frames or any lexical signal that can be derived from frames.
Overall, we notice that we obtain large improvements from models trained on lexical signals that have been derived from frame output (sep and none), even if the model itself does not incorporate any frame labels.The embeddings that predict the role filling lexical items (the green triangles) correlate higher with SPR oracles than the embeddings that predict PropBank frames and roles (red circles).Examining Fig. 2a, we see that both model types outperform both the word2vec and Cotterell et al. (2017) baselines in nearly all model configurations and ablations.We see the highest improvement when predicting role fillers given the frame trigger and the number of tokens separating the two (the green triangles in the sep rows).
Comparing Fig. 2a to Fig. 2b, we see newswire is more amenable to predicting PropBank frames and roles.We posit this is a type of out-ofdomain error, as the PropBank parser was trained on newswire.We also find that newswire is overall more amenable to incorporating limited framebased features, particularly when predicting Prop-Bank using lexical role fillers as part of the con- textual features.We hypothesize this is due to the significantly increased vocabulary size of the Wikipedia role fillers (c.f., Tab. 1).Note, however, that by using all available schema information when predicting PropBank, we are able to compensate for the increased vocabulary.
In Fig. 3 we display the ten nearest neighbors for three randomly sampled trigger words according to two of the highest performing newswire models.They each condition on the trigger and the role filler/trigger separation; these correspond to the sep rows of Fig. 2a.The left column of Fig. 3 predicts the role filler, while the right column predicts PropBank annotations.We see that while both models learn inflectional relations, this quality is prominent in the model that predicts Prop-Bank information while the model predicting role fillers learns more non-inflectional paraphrases.

Related Work
The recent popularity of word embeddings have inspired others to consider leveraging linguistic annotations and resources to learn embeddings.Both Cotterell et al. (2017) and Levy and Goldberg (2014a) incorporate additional syntactic and morphological information in their word embeddings.Rothe and Schütze (2015)'s use lexical resource entries, such as WordNet synsets, to improve pre-computed word embeddings.Through generalized CCA, Rastogi et al. (2015) incorporate paraphrased FrameNet training data.On the applied side, Wang and Yang (2015) used frame embeddings-produced by training word2vec on tweet-derived semantic frame (names)-as additional features in downstream prediction.Teichert et al. (2017) similarly explored the relationship between semantic frames and thematic proto-roles.They proposed using a Conditional Random Field (Lafferty et al., 2001) to jointly and conditionally model SPR and SRL.Teichert et al. (2017) demonstrated slight improvements in jointly and conditionally predicting PropBank (Bonial et al., 2013)'s semantic role labels and Reisinger et al. (2015)'s proto-role labels.

Conclusion
We presented a way to learn embeddings enriched with multiple, automatically obtained frames from large, disparate corpora.We also presented a QVEC evaluation for semantic proto-roles.As demonstrated by our experiments, our extension of Cotterell et al. (2017)'s tensor factorization enriches word embeddings by including syntacticsemantic information not often captured, resulting in consistently higher SPR-based correlations.The implementation is available at https: //github.com/fmof/tensor-factorization.
(a) Changes in SPR-QVEC for Annotated NYT.(b) Changes in SPR-QVEC for Wikipedia.

Figure 2 :
Figure 2: Effect of frame-extracted tensor counts on our SPR-QVEC evaulation.Deltas are shown as relative percent changes vs. the word2vec baseline.The dashed line represents the 3-tensor word2vec method of Cotterell et al. (2017).Each row represents an ablation model: sep means the prediction relies on the token separation distance between the frame and role filler, fn-frame means the prediction uses FrameNet frames, fn-role means the prediction uses FrameNet roles, and filler means the prediction uses the tokens filling the frame role.Read from top to bottom, additional contextual features are denoted with a +.Note when filler is used, we only predict PropBank roles.

Figure 3 :
Figure 3: K-Nearest Neighbors for three randomly sampled trigger words, from two newswire models.