Affordance Extraction and Inference based on Semantic Role Labeling

Common-sense reasoning is becoming increasingly important for the advancement of Natural Language Processing. While word embeddings have been very successful, they cannot explain which aspects of ‘coffee’ and ‘tea’ make them similar, or how they could be related to ‘shop’. In this paper, we propose an explicit word representation that builds upon the Distributional Hypothesis to represent meaning from semantic roles, and allow inference of relations from their meshing, as supported by the affordance-based Indexical Hypothesis. We find that our model improves the state-of-the-art on unsupervised word similarity tasks while allowing for direct inference of new relations from the same vector space.


Introduction
The word representations used more recently in Natural Language Processing (NLP) have been based on the Distributional Hypothesis (DH) (Harris, 1954) -"words that occur in the same contexts tend to have similar meanings".This simple idea has led to the development of powerful word embedding models, starting with Latent Semantic Analysis (LSA) (Landauer and Dumais, 1997) and later, the popular word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) models.Although, effective at quantifying the similarity between words (and phrases) such as 'tea' and 'coffee', they cannot relate that judgement to the fact that both can be sold, for instance.Furthermore, current representations can't inform us about possible relations between words occurring in mostly distinct contexts, such as using a 'newspaper' to cover a 'face'.While there have been substantial improvements to word embedding models over the years, these shortcomings have endured (Camacho-Collados and Pilehvar, 2018).

Word Pairs Affordances
(w1, w2) (w1 as ARG0, w2 as ARG1) shop, tea sell, import, cure doctor, patient diagnose, prescribe, treat newspaper, face cover, expose, poke man, cup drink, pour, spill Table 1: Results from affordance meshing (coordination) using automatically labelled semantic roles.Glenberg et al. (2000) identified these issues soon after LSA was introduced, and cautioned that high-dimensional word representations, such as those based on the DH, lack the necessary grounding to be proper semantic analogues.Instead, Glenberg proposed the Indexical Hypothesis (IH) which supports that meaning is constructed by (a) indexing words and phrases to real objects or perceptual, analog symbols; (b) deriving affordances from the objects and symbols; and (c) meshing the affordances under the guidance of syntax.Following Glenberg et al. (2000), this work considers an object's affordances as its possibilities for action constrained by its context, including actions which may not be directly perceived, which differs slightly from Gibson (1979)'s original definition.Even though the language grounding advocated by the IH is beyond the reach of NLP by itself, we believe that its representation of meaning through affordances can still be captured to a useful extent.
Our contribution 1 is a word-level representation that allows for the affordance correspondence and meshing supported by the IH.These affordances are approximated from occurrences of semantic roles in corpora through an adaptation of models based on the DH.Our work is motivated by two observations: (1) a pressing need to integrate common-sense knowledge in NLP mod-  els and (2) recent improvements to Semantic Role Labeling (SRL) have made affordance extraction from raw corpora sufficiently reliable.We find that our model (A2Avecs) performs competitively on word similarity tasks while enabling novel 'whodoes-what-to-whom' style inferences (Table 1).

Related Work
This work is closely related to the research area of selectional preferences, where the goal is to predict the likelihood of a verb taking a certain argument for a particular role (e.g.likelihood of man being an agent of drive).Most notably, Erk et al. (2010) proposed a distributional model of selectional preferences that used SRL annotations as a primary set of verb-role-arguments from which to generalize using word representations based on the DH and several word similarity metrics.Progress in selectional preferences is usually measured through correlations with human thematic fit judgements and, more recently, neural approaches (de Cruys, 2014;Tilk et al., 2016) obtained stateof-the-art results.
While this work shares some of these same elements (i.e.SRL and word embeddings), they are used to predict potential affordances instead of selectional preferences.Consequently, our representations are designed to enable the meshing proposed by the IH, allowing us to infer affordances that would not be likely under a selectional preference learning scheme (e.g.newspaper-coverface from Table 1).Additionally, this work is concerned with showing that information derived from SRL is complementary to information derived from DH methods, and thus focuses its evaluation on tasks related to lexical similarity rather than thematic fit correlations.

Method
Our word representations are modelled using Predicate-Argument Structures (PASs).These structures are obtained through SRL of raw corpora, and used to populate a sparse word/context co-occurrence matrix W where roles serve as contexts (features), and argument spans serve as the co-occurrence windows.The roles are predicates specified by argument type (e.g.eat|ARG0) and used in place of affordances.See Table 2  After computing our co-occurrence matrix we follow-up with the additional steps employed by traditional bag-of-words models.We use Positive Pointwise Mutual Information (PPMI) to improve co-occurrence statistics, as used successfully by Bullinaria and Levy (2007); Levy and Goldberg (2014b), and maintain explicit high-dimensional representations in order to preserve the context information required for affordance meshing.Previous works, such as Levy and Goldberg (2014a) and Stanovsky et al. (2015), have also produced word representations from syntactic context definitions (dependency parse trees and open information extraction, respectively) but have opted for following-up with the word2vec's SkipGram (SG) model, presumably influenced by a much higher number of contexts in their approaches.
We reduce the sparsity of our explicit PPMI matrix by linear combination and interpolation of semantically related vectors.The semantic relatedness is obtained from the cosine similarity of SG vectors.As evidenced by Baroni et al. (2014), SG seem best suited for estimating relatedness (or association).These steps are further described in remainder of this section (See Fig. 1).

Extracting PASs
We use the AllenNLP (Gardner et al., 2017) implementation of He et al. (2017) state-of-the art SRL to extract PASs from an English Wikipedia dump from April 2010 (1B words).Since the automatic identification of predicates by an endto-end SRL may produce erroneous results, we ensure that these predicates are valid by restricting them to the set of verbs tagged on the Brown corpus (Francis and Kucera, 1979).We also use the spaCy parser (Honnibal and Montani, 2017) to reduce each argument phrase to its head noun phrase, reducing the dilution of the more relevant noun and predicate co-occurrence statistics (See Fig. 2).Additionally, we lemmatize the predicates (verbs) to their root form using WordNet's Morphy Lemmatizer (Miller, 1992).Finally, we trim the vocabulary size and the number of roles by discarding those which occur less than 100 times, and consider only core and adjunct argument types.The result is a set C of observed contexts, such as <chase|ARG1, (the, cat)>, used to populate W .

Argument-specific PPMI
The authors of PropBank (Kingsbury and Palmer, 2002), which provides the annotations used for learning SRL, state that arguments are predicatespecific.Still, they also acknowledge that there are some trends in the argument assignments.For instance, the most consistent trends are that ARG0 is usually the agent, and ARG1 is the direct object or theme of the relation.This realisation leads us to adapting the PPMI measure to better account for the correlations between roles of the same argument types.each C ARG , such that for each W w,p : P P M I(w, r) = max(P M I(w, r), 0)) where w is a word from the vocabulary V , r is a role (context) from the set R of the same argument type as C ARG , and f is the probability function.
The resulting matrix M = P P M I(W ) maintains the dimensions W and is slightly sparser.

Leveraging Association
The constraints imposed by SRL yield a very reduced number of PAS-based contexts that can be extracted from a corpus, in comparison to lexical adjacency-based contexts.Moreover, the postprocessing steps we perform, while otherwise beneficial (see Table 3), further trim this information.
To mitigate this issue, we also compute an embedding matrix A (see Section 3 for parameters), using the state-of-the-art lexical-based SG model of Bojanowski et al. (2017), and use those embeddings to obtain similarity values that can be used to interpolate missing values in M , through weighted linear combination.This way, existing vectors are re-computed as: with α i defined as: where cos A corresponds to the cosine similarity in the SG representations.
The similarity threshold is tested on a few natural choices (0.5 ± 0.1) and validated from results on a single word similarity task (see Table 3).This approach is also used to define representations for words that are out-of-vocabulary (OOV) for M , but can be interpolated from related representations, similarly to Zhao et al. (2015).In conjunction with the interpolation, we apply half down rounding to the vectors, before and after re-computing them, so that our representations remain efficiently sparse while benefitting from improved performance.Finally, we apply a quadratic transformation to enlarge the influence of meaningful co-occurrences, obtaining M + = interpolate(M, A) 2 .

Inferring Relations
The examples shown in Table 1 are easily obtained with our model through a simple procedure (see Algorithm 1) that matches different arguments of the same predicates.As was the case with Argspecific PPMI, this procedure is made possible by the fact that a significant portion of argument assignments remain consistent across predicates.

Evaluation and Experiments
The A2Avecs model introduced in this paper is used to generate 155,183 word vectors of 18,179 affordance dimensions.This section compares our model with lexical-based models (word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) and fastText (Bojanowski et al., 2017)) and other syntactic-based models (Deps (Levy and Goldberg, 2014a) and OpenIE (Stanovsky et al., 2015)).We're using Deps and OpenIE embeddings that the respective authors trained on a Wikipedia corpus and distributed online.Lexical models were trained using the same parameters, wherever applicable: Wikipedia corpus from April 2010 (same as mentioned in section 2.1); minimum word frequency of 100; window size of 2; 300 vector dimensions; 15 negative samples; and ngrams sized from 3 to 6 characters.We also show that our approach can make use of high-quality pretrained embeddings.We experiment with a fastText model pretrained on 600B tokens, referred to as 'fastText 600B' in contrast with the fastText model trained on Wikipedia.

Model Introspection
The explicit nature of the representations produced by our model makes them directly interpretable, similarly to other sparse representations such as Faruqui and Dyer (2015b).The examples presented at Table 4 demonstrate the relational capacity of our model, beyond associating meaningful predicates.In this introspection we highlight the top role contexts for a set of words, inspired by (Levy and Goldberg, 2014a) which presented the top syntactic context for the same words, and note that this introspection produces results that should correspond to Erk et al. (2010)'s inverse selectional preferences.
Our online demonstration provides access to additional introspection queries, such as top words for given affordances, or which affordances are most distinguishable between a pair of words (determined by absolute difference).

Word Similarity
The results presented on Table 5 show that our model can outperform lexical and syntactic models, in spite of maintaining an explicit representation.In fact, applying Singular Value Decomposition (SVD) to obtain dense 300-dimensional embeddings degrades performance.We achieve best results with the concatenation of the fastText 600B vectors with our model interpolated using those same vectors for the vocabulary V M + ∩ V A , after normalizing both to unit length (L 2 ).Interestingly, the same concatenation process with Deps embeddings doesn't seem as beneficial, suggesting that our representations are more complementary.

Conclusion
Our results suggest that semantic similarity can be captured in a vector space that also allows for the inference of new relations through affordancebased representations, which opens up exciting possibilities for the field.In the process, we presented more evidence to support that information obtained from SRL is complementary to information obtained from adjacency-based contexts, or even contexts based on syntactical dependencies.We believe this work helps bridge the gap between selectional preferences and semantic plausibility, beyond frequentist generalizations based on the DH.In the near term, we expect that specific tasks such as Entity Disambiguation and Coreference can benefit from these representations.With further developments, semantic plausibility assessments should also be useful for more broad tasks such as Fact Verification and Story Understanding.

Figure 2 :
Figure2: Parse tree for the sentence 'The dog with white stripes chased the cat.'.The label for ARG0 is repositioned to the smaller subtree.

Table 2 :
for a comparison of this context definition with the traditional lexical definition.
Different context definitions applied to the sentence 'John drinks red wine slowly'.Top: Our proposed definition; Bottom: Lexical adjacency definition (with window size of 2).
Thus, we segment C by argument type, and apply PPMI independently considering a Failed after using too much memory.

Table 3 :
Sensitivity/Impact analysis for some parameters of our approach.

Table 4 :
(Levy and Goldberg, 2014a)ontexts.Using the same words from the introspection of(Levy and Goldberg, 2014a)to clarify the difference in the representations of both approaches.

Table 5 :
Faruqui and Dyer (2014)or word similarity tasks (seeFaruqui and Dyer (2014)for task descriptions).Top section shows results from training on the Wikipedia corpus exclusively.Bottom section shows results where we used SG embeddings (A) trained on a larger corpus for performing interpolation and concatenation on the same set of roles used above.For comparison, we also show results for Deps concatenated with those embeddings.